Jupyter notebook source on GitHub: conversation.ipynb

Conversations on Twitter: Before and after July 6th 2012

According to a blog post by Bob Leggitt "With older conversations, there won’t be any connectivity on Twitter between the different components of the dialogue. Originally, there was no such thing as an @reply or @mention, and for a long time after the function was introduced, the site did not formally link a reply to a ‘parent tweet’. In fact, nothing prior to 6th July 2012 will link together the dialogue components."

First #doc(s)toctoc was on 2012-06-06.

List of status before 2012-07-06 (included):

23 status (from 210290960695959553 to 221152797289226241):

221152797289226241 221142366302642176 220773627438694400 220555448007397376 219708917700247554 218635541023956992 218621627158642688 218295811006676992 217871430564577281 217660388374880256 217517250763165697 216418177138167808 216230683201372160 216114884105084928 215359264812838912 215353726960017409 215005937742782464 215005696280895489 214992412496498689 212504641613729792 211040882802163713 210419080203747329 210290960695959553

get conversation tree from question

Each #d status with a reply_count > 0 is the start of a diagnostic conversation tree.


  • Not available in the standard dev API
  • Available from TweetScraper
  • dataset/
  • nota bene: reply count number includes only direct answers to the original tweet, not self answers (thread), not answers to answers.

Standard (free of charge) Twitter API doesn't allow to get all responses to a specific status.

2 methods to route around this limitation:

Brute force collection of conversation
  1. Use TweetScraper
  2. Search all replies to the user who posted the question status after a certain date and time
  3. We need to filter those answers with "in_reply_to_status_id" but this field is not present in the json object obtained with TweetScraper...
  4. Get the full Twitter object with the standard API
  5. store those objects in database to save API throttling and speed up the process for further lookup
  6. filter all collected answers with status["in_reply_to_status_id"] == status_id
  7. if true add to the corpus database
  8. repeat the process recursively for each answer with not null reply_count

Original tweet is 1st doc(s)toctoc tweet posted on 2012-06-06: Request is "to:DrKoibo since:2012-06-06"

# using pipenv
pipenv run scrapy crawl TweetScraper -a query="to:DrKoibo since:2012-06-06"

returns 8111 status (as of 2018-03-29)

Collect conversation through web API

Treeverse is a Chromium extension coded in TypeScript that can visualize a conversation as a tree. It is reconstituting a conversation tree by parsing the HTML answer to${this.username}/status/${} We could write a Python parser based on urllib to reconstitute the conversation tree and feed it to our Django mptt table. This parser should use @doctoctocbot credentials as many users have made their account private.

username change

scrapers use user_name, which can vary across time. Method to look for old user_names:

Update conversation trees

  • use TweetScraper to collect all doc(s)toctoc status
  • compare current and previous "nbr_reply" int value: if(present.nbr_reply > past.nbr_reply) -> rebuild conversation tree (without deleting anything)

Database structure

  • PostgreSQL