Documentation by Raghava
- Learnt and went through intro tutorials for NetworkX, Neo4j and what ICD 11 API - for foundation data and MMS data
- Introduced to MongoDB and Elasticsearch.
- Ran Python scripts to download data from the ICD 11 API
- Parsed raw json files to readable format
- Pushed local data to MongoDB
- Imported the json file into a NetworkX graph and made each disease into a single node with childs as edges
- Created a list of dictionaries corresponding to the diseases in ICD11
- Wrote a recursive function to extract all cardiovascular diseases from the nodes and stored the IDs in a separate list
- Used the list of cardiovascular diseases created last week to create a hierarchical tree in NetworkX starting from the Root Cardiovascular code as taken from the ICD 11 Browser
- Plotted the tree to visualize the hierarchy and the spread of diseases
- Created a list of all paths (shortest) from the root node to different leaf nodes to gauge number of children per root
- Conducted some EDA to map codes to their titles and get a count of first degree children per root disease
- Plotted simple bar and bubble plots to visualize the diseases with the maximum number of childs.
- Extracted lists of disease hierarchies for 4 main Cardiovascular disease groups: Cardiomyopathy, Ischaemic Heart Disease, Cardiac Arrhythmia and Heart Valve Disease using a recursive algorithm
- Inputted the resulting dictionaries into a d3.js to plot the tree hierarchies
- Got familiarized with Elastic search
- Got set up on AWS server to work with the big ICD 11 data
- Created a list of all Cardiovascular diseases using recursion
- Imported this file into the AWS server to run the indexing and the searching python scripts using elastic search which outputted a list of dictionaries with pmid, title and abstract of disease
- Then used the pmid and title to come up with a data frame of number of occurences of each disease in our case records
- Then on this dataframe I performed TF-IDF and clustering using K-Means - Results were inconclusive
- After results were inconclusive from last week, performed t-SNE on the titles to visually see how they were clustered
- Found a large number of very smalll clusters - probably why results were inconclusive using K-means
- Then narrowed down to 4 particular clusters, plotted them again using t-SNE and found out the most commonly used words per cluster
- Repeated the same process for titles realted to heart failures and as expected found that heart and failures were the most commonly used words in these titles