Database

I haven't been posting during the last few weeks because I was busy improving my database skills. I never really pushed very deep into SQL or relational algebra before. Relational algebra is the theoretical stuff upon which RDBMS (Relational Database Management Systems) are built.

The projects I've worked on already had a stable ORM (Object Relational Mapper).

I used Stanford's "Databases" MOOC by Jennifer Widom. It's free and it's brilliant. I recommend it to every dev who, like me, didn't any have formal CS education.

Then I had to find and learn an ORM for Python. I chose SQLAlchemy and followed the introductory course "Introduction to SQLAlchemy" presented at PyCOn 2013. Links to video, slides and prerequisite material are on the tutorials page of sqlalchemy.org. The prerequisite material contains instructions to use interactive Python slides (sliderepl): half slides, half interactive Python shell where you can try your own commands and do the embedded exercises. You can also get it from https://github.com/tony/introduction_to_sqlalchemy.

Now I am ready to embark on the next step: retrieving all conversation trees from Twitter's database and storing the tree structure efficiently in Postgres for fast retrieval. Each conversation tree must be displayed near instantaneously for crowdsourced analysis. Physician's time is a scarce and precious resource.

Classification

Classify status

Clinical

diagnosis

Question related to diagnosis.

  • clinical description asking for diagnosis
  • question pertaining to best diagnostic tests to confirm given differential
therapy

Determination of appropriate therapy for a problem already diagnosed.

Non clinical

practical advice

Conversation

Jupyter notebook source on GitHub: conversation.ipynb

get conversation tree from question

Each #d status with a reply_count > 0 is the start of a diagnostic conversation tree.

reply_count

  • Not available in the standard API
  • Available from TweetScraper
  • dataset/replycount.py

Standard (free of charge) Twitter API doesn't allow to get all responses to a specific status. Method to route around this limitation:

  1. Use TweetScraper
  2. Search all replies to the user who posted the question status after a certain date and time
  3. We need to filter those answers with "in_reply_to_status_id" but this field is not present in the json object obtained with TweetScraper...
  4. Get the full Twitter object with the standard API
  5. store those objects in database to save API throttling and speed up the process for further lookup
  6. filter all collected answers with status["in_reply_to_status_id"] == status_id
  7. if true add to the corpus database
  8. repeat the process recursively for each answer with not null reply_count

Original tweet is 1st doc(s)toctoc tweet posted on 2012-06-06: https://twitter.com/DrKoibo/status/210290960695959553 Request is "to:DrKoibo since:2012-06-06"

# using pipenv
pipenv run scrapy crawl TweetScraper -a query="to:DrKoibo since:2012-06-06"

returns 8111 status (as of 2018-03-29)

Database structure

  • PostgreSQL