| | | Krishna Shravya Gade | CSci 5751 | | |
ii) COVID-19 case counts dataset
iii) COVID-19 case counts dataset
1. Countries contributing strains for the study
IV. Challenges and lessons learnt
In a battle with the COVID-19 disease, governments all over the world are trying to best understand the nature of the virus to mitigate consequences. Like the Zika virus which has been classified into three lineages and each one shows different symptoms, the Novel Coronavirus strains are being analysed by various biomedical labs all over the world to find a lineage tree or phylogenetic tree. This analysis might help governments deal with the virus differently from understanding the specific features of the strains.
This being the basis of this study, a dataset of most dominant strains all over the world has been combined with the COVID-19 cases and mortality dataset in those locations to find if a potential strain is more deadly or has less impact on the patient after infection.
The paper majorly covers the data manipulation and steps involved to validate the data available for each of the analysis. The methods used for analysis and conclusion are explained in detail for every analysis.
Below is a phylogenetic tree of the corona virus strains taken from GISAID website [4]:
Figure 1: Phylogenetic tree for corona virus strains [4]
Two datasets were used in the analysis – nCov strains dataset and COVID-19 case counts dataset. The detailed description of the attributes can be found in Appendix (VII 2). The details of the dataset are explained below:
- Source: h-CoV-19 GISAID dataset [5]
- Size: 15800 records
- Format: TSV
- Relevant attributes: strain(string), date(date), country(string), division(string), host(string), originating lab(string)
Initially, the time series version of this dataset was considered instead of the cumulative dataset. But given the dependencies between strains dataset and case counts dataset, it was observed that strains dataset lacked the richness required for a time series analysis. This will be covered in detail under Analysis section (III.3).
- Source: COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University [6]
- Size: 3192 rows
- Format: CSV
- Relevant attributes: province(string), country(string), confirmed(longint), deaths(longint), recovered(longint), active(longint)
Some common validations were done on both the datasets:
- Number of null / na / missing values in every column of the data
- Uniqueness of the "key" value
- Total unique countries and divisions
- nCov strains data had division_exposure attribute which specified the division(state) where the patient was likely exposed to the virus. Upon exploring, it was found that this attribute was either same as division attribute or was null for 92% of the data meaning it did not carry any extra information than the division attribute. Hence, only division attribute was considered for analysis.
- Strains data had non-human host strains which were the samples collected from bats, canines etc. These rows were removed from analysis as the motive behind the study is to analyse the effect of nCov on humans.
- Many countries in this COVID-19 case counts dataset were misspelled. Cleaning process is described in Analysis section (III 2. a.).
The initial idea was to identify the specific virology labs contributing the most but due to inconsistencies in the "originating lab" field in the strains data, the study was pivoted towards countries.
The processed strains data was categorized into countries that contributed the strains to the data set. This was done by running a group by query.
Below is the bar graph plot for the top 20 countries that contributed the most strains. United Kingdom, United States of America and Australia were the top three contributing countries.
The potential conclusion that can be drawn from the above visualization is that UK, USA and Australia are the countries researching to find a vaccine most extensively compared to other countries. The dip in the bar chart shows that USA and UK are leading substantially in comparison to other countries in terms of this research.
The motivation behind this analysis was to test the hypothesis if the confirmed cases in the country were correlated with the strains present in the country.
- To weed out misspelled entries in COVID-19 counts dataset, the "country" attribute was subtracted from "country" in strain dataset. After finding possibly missing / wrongly spelled countries, regular expression queries were used to find the mapping of countries' names between both the datasets. This also took care of the check if a country was missing in the COVID-19 counts dataset.
- To parallelize the process of updating all the country names, a dictionary mapping incorrect name with the correct one was casted to a broadcast** variable which can be shared by all the tasks. A user defined function** to get the values from the broadcast variable was written and called to update the country attribute.
- The counts data was grouped by countries and the counts were aggregated to create a new data frame with country wise sum of confirmed, _death_s, recovered and active cases. This dataframe was inner-joined with the strains data upon the country column.
- Pearson correlation coefficient was calculated to be 0.5857 for strains contributed and confirmed cases. This is not a strong correlation hence a relationship cannot be established between these attributes.
- Below is the visual of the scatter plot for strains contributed by countries and confirmed cases in the country.
- There is no significant relationship between strains in the country and the confirmed cases as per the scatter plot above
- The weak correlation signifies the lack of any relationship between strains contributed by the country and the number of infections
- This could be due to various other factors such as lack of research investment by the governments of the countries, private research not publishing the genomes found or not enough tests being done and published to show real counts
- Only the strains that have high coverage were considered in the dataset and this could have contributed to the lack of enough data to establish a relationship
The motivation behind this analysis was a research paper [2] published on bioRxiv about a speculation that the virus strain spreading in the west coast of the USA is less lethal than the one in east coast causing more deaths and severe symptoms. The idea was to study the lethality of each of the strains to classify them. This analysis could bolster the vaccine research helping scientists since individuals who have recovered could be susceptible to a mutant of the virus making vaccination a vain effort.
A time series data analysis with the date when the sample was collected and the case counts after that could have served as a good study but the strains from USA were only available for the month of January when the case counts were in single digits in the country. Hence study was pivoted to cumulative analysis instead. This is still significant as these strains could be the parent strains of the ones currently active in the region.
- While validating the strains data, it was found that the highest detail about location was at county level for many strains but sufficiently enough records did not have that level of information. Hence the study was shifted to state level analysis to maintain uniformity.
- Some strains collected by University of Washington Virology lab had state as "USA" and these records were disregarded from the study
- After correcting the date strings for some entries, the date collected column was converted to date type. It was found that all strains collected in USA were only in the month of January 2020
- A new column called category was created to classify the strains as older, mid and newer based on date strain was collected in the strains data
- The COVID-19 case counts and strains data was inner-joined based on the state column in both
- Using the Vector assembler package in spark ml lib the columns- confirmed, deaths, recovered, active and category – were vectorised and added to a new column called features
- Once the dataframe was cleaned, processed and features were vectorized, a decision had to be made of choosing a model for the data
- Kmeans : Initially, Kmeans clustering model, from spark ML lib, was fit on the strains data to find similar strains but the clustering failed and resulted in insignificant classification. This could be attributed to the limitations of Kmeans clustering in context of the distribution of data points and presence of outliers in the data.
- Gaussian Mixture Model: Then the data was fit to a Gaussian Mixture Model to see some pattern in the clustering where the lower end of the spectrum with less confirmed cases and less deaths were considered one cluster, mid level where most strains were classified were one and higher numbers gave another cluster
Below is the plot of Gaussian Mixture Model clusters:
- DBSCAN : Lastly, the data was fit in DBSCAN model which rendered significant information in the classification. After multiple trials, the parameters chosen for DBSCAN were epsilon of 1 and minimum samples in a cluster to be 20 gave around 388 clusters. An epsilon with higher value could render a better prediction in our case but DBSCAN is a memory intensive algorithm and due to memory limitation in Databricks the process was getting terminated.
Below is the plot for the DBSCAN results:
- It is clear that the distribution of the datapoints in DBSCAN is similar to GMM but the clustering is more diverse and informative. We can visually see the strains that are comparatively less lethal are in brown, mid-level is yellow and the blue seem to be relatively more lethal.
Combining the results from DBSCAN and GMM, we could potentially claim that some strains spreading in the USA right now display different properties compared to others. Even though the results of the study could be skewed due to many other factors, the overall distribution of the strains and classification could point us to a direction. This result can be used by scientists to further probe into these strains protein composition to make other observations to support or disprove the claim.
- The first major challenge for me was that the previous knowledge I had about genetics was not sufficient to deeply understand the data and perform analysis. I referred to biomed scientific research papers and articles which was both interesting and challenging for me as a computer science student.
- There were many inconsistencies in both the datasets and at each point when these were discovered, I had to pivot the study according to the data available
- The idea behind joining the two datasets on country had led me to probe into the spelling mistakes, missing country names and anomalies like one of the datasets considering Hong Kong as a state in China. It took quite a bit of time and effort to find these and fix them. Even though the COVID-19 case counts dataset had a FIPS to ISO codes for countries, Strains data did not have it which complicated things.
-
Running on a hardware limitation of 16 GB memory due to the limit on Databricks community version I learnt to rely on modifying the same dataframe instead of creating a new one each time to save memory. I also encountered memory issues while running DBSCAN clustering and adjusted the parameters to be memory efficient as well while delivering optimal clustering.
-
Knowing enough about the subject matter is very essential for data analysis
The Novel Coronavirus strains analysis is a crucial step towards developing a vaccine and aiding decision making for governments all over the world. This study could have been proven with even better results if the data was richer and more labs all over the world actively contributed to the dataset. The lack of definite results in correlation between number of strains and confirmed cases in countries proves that not all countries are reporting numbers accurately and contributing to the greater good. Even though this is a proof by contradiction, more research could lead to conclusive results.
The second half of the analysis of finding a deadlier strain could serve as an early warning against a stronger and deadlier mutant of Coronavirus. A protein level analysis of these strains could give more specific features of the virus that could cause it to be more dangerous.
[1] Structure of a Data Analysis Report http://www.stat.cmu.edu/~brian/701/notes/paper-structure.pdf
[2] Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2 https://www.biorxiv.org/content/10.1101/2020.04.29.069054v1
[3] Mutant coronavirus strain has emerged that's even more contagious than original, study says https://www.post-gazette.com/news/science/2020/05/05/coronavirus-strain-study-scientists-new-mutant-more-contagious/stories/202005050156
[4] https://www.epicov.org/epi3/frontend#lightbox1597646798
[5] Genomic epidemiology https://www.epicov.org/epi3/frontend#5e283e
[6] COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University https://github.com/CSSEGISandData/COVID-19
[7] How genomic epidemiology is tracking the spread of COVID-19 locally and globally https://cen.acs.org/biological-chemistry/genomics/genomic-epidemiology-tracking-spread-COVID/98/i17
[8] Reconstructing evolutionary trees in parallel for massive sequences https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5751538/
[9] Stackoverflow.com
[10] Spark documentation https://spark.apache.org/docs/2.2.0/ml-clustering.html
[11] https://dwgeek.com/
[12] Next strain data description https://github.com/nextstrain/ncov/blob/master/docs/metadata.md
[13] CSSEGISandData data description https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data
Link to the databricks code: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3847040375540205/1221180919542404/3031952302722608/latest.html
Strains dataset: I gratefully acknowledge the authors and researchers in originating and submitting labs of sequence data on which the analysis is based and GISAID for hosting and publishing the dataset.
COVID-19 counts dataset: I gratefully acknowledge Center for Systems Science and Engineering (CSSE) at Johns Hopkins University for providing the data to support this analysis.
-
Strains data [12]
-
Column 1: **** strain
-
Column 2: **** virus
Type of virus – nCov in our case
- Column 3: **** gisaid_epi_isl
If this genome is shared via GISAID then please include the EPI ISL here. In our example this is "EPI_ISL_413490".
- Column 4: **** genbank_accession
If this genome is shared via GenBank then please include the accession number here. In our example this is "?" indicating that it hasn't (yet) been deposited in GenBank. (See above for more information on how to encode missing data.)
- Column 5: **** date (really important!)
This describes the sample collection data (not sequencing date!) and must be formated according as YYYY-MM-DD.
- Column 6: **** region
The region the sample was collected in -- for our example this is "Oceania". Please use either "Africa", "Asia", "Europe", "North America", "Oceania" or "South America".
- Column 7: **** country
The country the sample was collected in
- Column 8: **** division
Division currently doesn't have a precise definition and we use it differently for different regions. For instance for samples in the USA, division is the state in which the sample was collected here. For other countries, it might be a county, region, or other administrative sub-division. To see the divisions which are currently set for your country, you can run the following command (replace "New Zealand" with your country):
- Column 9: **** location
Similarly to division, but for a smaller geographic resolution. This data is often unavailable, and missing data here is typically represented by an empty field or the same value as division is used.
- Column 10: **** region_exposure
If the sample has a known travel history and infection is thought to have occurred in this location, then represent this here.
If there is no travel history then set this to be the same value as region.
- Column 11: **** country_exposure
Analogous to region_exposure but for country. In our example, given the patient's travel history, this is set to "Iran".
- Column 12: **** division_exposure
Analogous to region_exposure but for division. If we don't know the exposure division, we may specify the value for country_exposure here as well.
- Column 13: **** segment
Unused. Typically the value "genome" is set here.
- Column 14: **** length
Unused : Genome length (numeric value).
- Column 15: **** host
Host from which the sample was collected. Currently we have multiple values in the dataset, including "Human", "Canine", "Manis javanica" and "Rhinolophus affinis".
- Column 16: **** age
Numeric age of the patient from whom the sample was collected. We round to an integer value.
- Column 17: **** sex
Sex of the patient from whom the sample was collected.
- Column 18: **** originating_lab
- Please see GISAID for more information.
- Column 19: **** submitting_lab
- Please see GISAID for more information.
- Column 20: **** authors
Author of the genome sequence, or the paper which announced this genome. Typically written as "LastName et al".
- Column 21: **** url
The URL, if available, pointing to the genome data. For most SARS-CoV-2 data this is https://www.gisaid.org.
- Column 22: **** title
The URL, if available, of the publication announcing these genomes.
- Column 23: **** date_submitted
Date the genome was submitted to a public database (most often GISAID). In YYYY-MM-DD format (see date for more information on this formatting).
-
COVID-19 case counts [13]:
-
FIPS : US only. Federal Information Processing Standards code that uniquely identifies counties within the USA.
-
Admin2 : County name. US only.
-
Province_State : Province, state or dependency name.
-
Country_Region : Country, region or sovereignty name. The names of locations included on the Website correspond with the official designations used by the U.S. Department of State.
-
Last Update : MM/DD/YYYY HH:mm:ss (24 hour format, in UTC).
-
Lat and Long_ : Dot locations on the dashboard. All points (except for Australia) shown on the map are based on geographic centroids, and are not representative of a specific address, building or any location at a spatial scale finer than a province/state. Australian dots are located at the centroid of the largest city in each state.
-
Confirmed : Confirmed cases include presumptive positive cases and probable cases, in accordance with CDC guidelines as of April 14.
-
Deaths : Death totals in the US include confirmed and probable, in accordance with CDC guidelines as of April 14.
-
Recovered : Recovered cases outside China are estimates based on local media reports, and state and local reporting when available, and therefore may be substantially lower than the true number. US state-level recovered cases are from COVID Tracking Project.
-
Active: Active cases = total confirmed - total recovered - total deaths.
-
Incidence_Rate : Admin2 + Province_State + Country_Region.
-
Case-Fatality Ratio (%): = confirmed cases per 100,000 persons.
-
US Testing Rate : = total test results per 100,000 persons. The "total test results" is equal to "Total test results (Positive + Negative)" from COVID Tracking Project.
-
US Hospitalization Rate (%): = Total number hospitalized / Number confirmed cases. The "Total number hospitalized" is the "Hospitalized – Cumulative" count from COVID Tracking Project. The "hospitalization rate" and "hospitalized - Cumulative" data is only presented for those states which provide cumulative hospital data.
1 Coverage of a sequence is the average number of reads to encode the genome sequence. High coverage in case of this dataset implies there are less than 0.1% Ns which indicate unknown nucleotide base. N/A in computer science terminology.
2 FIPS and ISO codes are internationally recognized codes for states in the USA and countries respectively