|
6 | 6 |
|
7 | 7 | ## Dataset attributes of version 2023-12-15
|
8 | 8 |
|
9 |
| -Data was downloaded using the [CELLxGENE Discover Census version `2023-12-15`](https://chanzuckerberg.github.io/cellxgene-census/cellxgene_census_docsite_data_release_info.html#lts-2023-12-15). We first downloaded cellxgene census version 2023-12-15 using the `cellxgene_census` python API. We limited cell data to `organism=”Homo sapiens”`, with a non “na” `suspension_type`, `is_primary_data=True`, and `disease=”normal”` to limit to non-diseased tissues that are also the primary data source per cell to make sure that cells are only included once in the download. We tracked metadata including “assay”, “sex”, “development_stage”, “tissue_general”, “dataset_id” and “self_reported_ethnicity”. The metadata “assay”, “tissue_general”, and “dataset_id” were used to construct dataset splits into train, validation, and test sets. The training set represented 99% of the downloaded cells. We partitioned the data by dataset_id into a train set (99%) and a hold-out set (1%), to make sure that the hold-out datasets were independently collected single cell experiments, which helps evaluate generalizability to new future datasets. In this training split, we made sure that all “assay” and “tissue_general” labels were present in the training set so that our model would have maximal visibility into different tissues and assay biases. Finally the 1% hold-out set was split further into a validation and test set. This final split was mostly done randomly by cell, however we set aside a full dataset into the test split so that we could evaluate performance after training on a completely unseen dataset, including when monitoring the validation loss during training. |
| 9 | +Data was downloaded using the [CELLxGENE Discover Census version `2023-12-15`](https://chanzuckerberg.github.io/cellxgene-census/cellxgene_census_docsite_data_release_info.html#lts-2023-12-15). We first downloaded CELLxGENE census version 2023-12-15 using the `cellxgene_census` python API. We limited cell data to `organism="Homo sapiens"`, with a non "na" `suspension_type`, `is_primary_data=True`, and `disease="normal"` to limit to non-diseased tissues that are also the primary data source per cell to make sure that cells are only included once in the download. We tracked metadata including "assay", "sex", "development_stage", "tissue_general", "dataset_id" and "self_reported_ethnicity". The metadata "assay", "tissue_general", and "dataset_id" were used to construct dataset splits into train, validation, and test sets. The training set represented 99% of the downloaded cells. We partitioned the data by dataset_id into a train set (99%) and a hold-out set (1%), to make sure that the hold-out datasets were independently collected single cell experiments, which helps evaluate generalizability to new future datasets. In this training split, we made sure that all "assay" and "tissue_general" labels were present in the training set so that our model would have maximal visibility into different tissues and assay biases. Finally the 1% hold-out set was split further into a validation and test set. This final split was mostly done randomly by cell, however we set aside a full dataset into the test split so that we could evaluate performance after training on a completely unseen dataset, including when monitoring the validation loss during training. |
10 | 10 |
|
11 |
| -These parameters resulted in 23.87 Million single cells collected from a variety of public datasets, all hosted by CZI cell x gene census. After the splitting procedure we had: |
| 11 | +These parameters resulted in 23.87 Million single cells collected from a variety of public datasets, all hosted by CZI CELLxGENE census. After the splitting procedure we had: |
12 | 12 |
|
13 | 13 | - 23.64 Million cells in the training split
|
14 | 14 | - 0.13 Million cells in the validation split
|
@@ -53,11 +53,11 @@ Different assays have different ranges of reported gene measurements. On the low
|
53 | 53 |
|
54 | 54 | #### Dataset distribution
|
55 | 55 |
|
56 |
| -Dataset (eg a publication that produces data and uploads to cellxgene) leads to known batch effects due to different handling proceedures, collection procedures, etc. We stratify our training vs hold-out split by this covariate for this reason. Exploring the breakdown of datasets we see that the top 10 datsets represent approximately 10 million cells of the full cellxgene datset. The largest dataset alone has 4 million cells. |
| 56 | +Dataset (for example, a publication that produces data and uploads to CELLxGENE) leads to known batch effects due to different handling procedures, collection procedures, and more. Hence, we stratify our training rather than hold out split by this covariate. Exploring the breakdown of datasets, we see that the top 10 datasets represent approximately 10 million cells of the full CELLxGENE dataset. The largest dataset alone has 4 million cells. |
57 | 57 |
|
58 | 58 | 
|
59 | 59 |
|
60 |
| -Looking at the makeup of these top datasets, we see that most represent single tissue categories predominately. Most of these tend to be nervous system datsets with the exception of one which is balanced between many cell types. |
| 60 | +Looking at the makeup of these top datasets, we see that they represent single tissue categories predominately. Most of these tend to be nervous system datasets, with the exception of one that is balanced between many cell types. |
61 | 61 | 
|
62 | 62 |
|
63 | 63 | ## References
|
@@ -87,7 +87,7 @@ Our training, validation and test data, including subsets made available for tes
|
87 | 87 | * Publication Reference: Cheng et al. (2018) Cell Reports; Publication: https://doi.org/10.1016/j.celrep.2018.09.006 Dataset Version: https://datasets.cellxgene.cziscience.com/912d943b-9060-4fd3-a12c-ad641a89f0e4.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/43d4bb39-21af-4d05-b973-4c1fed7b916c
|
88 | 88 | * Publication Reference: Cowan et al. (2020) Cell; Publication: https://doi.org/10.1016/j.cell.2020.08.013 Dataset Version: https://datasets.cellxgene.cziscience.com/b1989183-5808-46ab-87f5-978febb2d26e.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/2f4c738f-e2f3-4553-9db2-0582a38ea4dc
|
89 | 89 | * Publication Reference: Cowan et al. (2020) Cell; Publication: https://doi.org/10.1016/j.cell.2020.08.013 Dataset Version: https://datasets.cellxgene.cziscience.com/c0d3867e-1a7b-4e57-af62-c563f1934226.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/2f4c738f-e2f3-4553-9db2-0582a38ea4dc
|
90 |
| -* Publication Reference: Dom\u00ednguez Conde et al. (2022) Science; Publication: https://doi.org/10.1126/science.abl5197 Dataset Version: https://datasets.cellxgene.cziscience.com/08f58b32-a01b-4300-8ebc-2b93c18f26f7.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/62ef75e4-cbea-454e-a0ce-998ec40223d3 |
| 90 | +* Publication Reference: Domínguez Conde et al. (2022) Science; Publication: https://doi.org/10.1126/science.abl5197 Dataset Version: https://datasets.cellxgene.cziscience.com/08f58b32-a01b-4300-8ebc-2b93c18f26f7.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/62ef75e4-cbea-454e-a0ce-998ec40223d3 |
91 | 91 | * Publication Reference: Easter et al. (2024) Nat Commun; Publication: https://doi.org/10.1038/s41467-024-49037-y Dataset Version: https://datasets.cellxgene.cziscience.com/221dff56-a47d-4563-90ed-51b60e2f16d5.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/71f4bccf-53d4-4c12-9e80-e73bfb89e398
|
92 | 92 | * Publication Reference: Egozi et al. (2021) Nat Med; Publication: https://doi.org/10.1038/s41591-021-01586-1 Dataset Version: https://datasets.cellxgene.cziscience.com/e3a84fef-b6df-49b2-b0ca-ecaf444773ec.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/7651ac1a-f947-463a-9223-a9e408a41989
|
93 | 93 | * Publication Reference: Elmentaite et al. (2020) Developmental Cell; Publication: https://doi.org/10.1016/j.devcel.2020.11.010 Dataset Version: https://datasets.cellxgene.cziscience.com/3aedefc0-401a-4ee8-a1b5-a0ffc20e1ff2.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/17481d16-ee44-49e5-bcf0-28c0780d8c4a
|
@@ -282,7 +282,7 @@ Our training, validation and test data, including subsets made available for tes
|
282 | 282 | * Publication Reference: Smillie et al. (2019) Cell; Publication: https://doi.org/10.1016/j.cell.2019.06.029 Dataset Version: https://datasets.cellxgene.cziscience.com/6c483976-30de-4835-97f0-2b9bc93614e7.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/33d19f34-87f5-455b-8ca5-9023a2e5453d
|
283 | 283 | * Publication Reference: Smith et al. (2021) Proc. Natl. Acad. Sci. U.S.A.; Publication: https://doi.org/10.1073/pnas.2023333118 Dataset Version: https://datasets.cellxgene.cziscience.com/bf50dbfb-9ca0-4f0d-8deb-a1a810a0e313.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/e02201d7-f49f-401f-baf0-1eb1406546c0
|
284 | 284 | * Publication Reference: Smith et al. (2021) Proc. Natl. Acad. Sci. U.S.A.; Publication: https://doi.org/10.1073/pnas.2023333118 Dataset Version: https://datasets.cellxgene.cziscience.com/ff7778bf-7a65-4d23-a9f4-b26c47926c28.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/e02201d7-f49f-401f-baf0-1eb1406546c0
|
285 |
| -* Publication Reference: Sol\u00e9-Boldo et al. (2020) Commun Biol; Publication: https://doi.org/10.1038/s42003-020-0922-4 Dataset Version: https://datasets.cellxgene.cziscience.com/bc8d7152-3b69-4153-9314-7342ae58fbde.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/c353707f-09a4-4f12-92a0-cb741e57e5f0 |
| 285 | +* Publication Reference: Solé-Boldo et al. (2020) Commun Biol; Publication: https://doi.org/10.1038/s42003-020-0922-4 Dataset Version: https://datasets.cellxgene.cziscience.com/bc8d7152-3b69-4153-9314-7342ae58fbde.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/c353707f-09a4-4f12-92a0-cb741e57e5f0 |
286 | 286 | * Publication Reference: Stephenson et al. (2021) Nat Med; Publication: https://doi.org/10.1038/s41591-021-01329-2 Dataset Version: https://datasets.cellxgene.cziscience.com/46586a98-b75d-4557-9cc4-839fc28e67d5.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/ddfad306-714d-4cc0-9985-d9072820c530
|
287 | 287 | * Publication Reference: Stewart et al. (2019) Science; Publication: https://doi.org/10.1126/science.aat5031 Dataset Version: https://datasets.cellxgene.cziscience.com/40ebb8e4-1a25-4a33-b8ff-02d1156e4e9b.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/120e86b4-1195-48c5-845b-b98054105eec
|
288 | 288 | * Publication Reference: Stewart et al. (2019) Science; Publication: https://doi.org/10.1126/science.aat5031 Dataset Version: https://datasets.cellxgene.cziscience.com/fe7e4408-7390-4f93-95aa-ffe472843421.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/120e86b4-1195-48c5-845b-b98054105eec
|
|
0 commit comments