NVIDIA
diff --git a/‎docs/docs/datasets/CELLxGENE.md
Lines changed: 6 additions & 6 deletions b/‎docs/docs/datasets/CELLxGENE.md
Lines changed: 6 additions & 6 deletions
diff --git a/‎docs/docs/datasets/index.md
Lines changed: 1 addition & 1 deletion b/‎docs/docs/datasets/index.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/docs/datasets/uniprot.md
Lines changed: 4 additions & 4 deletions b/‎docs/docs/datasets/uniprot.md
Lines changed: 4 additions & 4 deletions
diff --git a/‎docs/docs/models/ESM-2/index.md
Lines changed: 20 additions & 21 deletions b/‎docs/docs/models/ESM-2/index.md
Lines changed: 20 additions & 21 deletions
@@ -6,9 +6,9 @@
 
 ## Dataset attributes of version 2023-12-15
 
-Data was downloaded using the [CELLxGENE Discover Census version `2023-12-15`](https://chanzuckerberg.github.io/cellxgene-census/cellxgene_census_docsite_data_release_info.html#lts-2023-12-15). We first downloaded cellxgene census version 2023-12-15 using the `cellxgene_census` python API. We limited cell data to `organism=”Homo sapiens”`, with a non “na” `suspension_type`, `is_primary_data=True`, and `disease=”normal”` to limit to non-diseased tissues that are also the primary data source per cell to make sure that cells are only included once in the download. We tracked metadata including “assay”, “sex”, “development_stage”, “tissue_general”, “dataset_id” and “self_reported_ethnicity”. The metadata “assay”, “tissue_general”, and “dataset_id” were used to construct dataset splits into train, validation, and test sets. The training set represented 99% of the downloaded cells. We partitioned the data by dataset_id into a train set (99%) and a hold-out set (1%), to make sure that the hold-out datasets were independently collected single cell experiments, which helps evaluate generalizability to new future datasets. In this training split, we made sure that all “assay” and “tissue_general” labels were present in the training set so that our model would have maximal visibility into different tissues and assay biases. Finally the 1% hold-out set was split further into a validation and test set. This final split was mostly done randomly by cell, however we set aside a full dataset into the test split so that we could evaluate performance after training on a completely unseen dataset, including when monitoring the validation loss during training.
+Data was downloaded using the [CELLxGENE Discover Census version `2023-12-15`](https://chanzuckerberg.github.io/cellxgene-census/cellxgene_census_docsite_data_release_info.html#lts-2023-12-15). We first downloaded CELLxGENE census version 2023-12-15 using the `cellxgene_census` python API. We limited cell data to `organism="Homo sapiens"`, with a non "na" `suspension_type`, `is_primary_data=True`, and `disease="normal"` to limit to non-diseased tissues that are also the primary data source per cell to make sure that cells are only included once in the download. We tracked metadata including "assay", "sex", "development_stage", "tissue_general", "dataset_id" and "self_reported_ethnicity". The metadata "assay", "tissue_general", and "dataset_id" were used to construct dataset splits into train, validation, and test sets. The training set represented 99% of the downloaded cells. We partitioned the data by dataset_id into a train set (99%) and a hold-out set (1%), to make sure that the hold-out datasets were independently collected single cell experiments, which helps evaluate generalizability to new future datasets. In this training split, we made sure that all "assay" and "tissue_general" labels were present in the training set so that our model would have maximal visibility into different tissues and assay biases. Finally the 1% hold-out set was split further into a validation and test set. This final split was mostly done randomly by cell, however we set aside a full dataset into the test split so that we could evaluate performance after training on a completely unseen dataset, including when monitoring the validation loss during training.
 
-These parameters resulted in 23.87 Million single cells collected from a variety of public datasets, all hosted by CZI cell x gene census. After the splitting procedure we had:
+These parameters resulted in 23.87 Million single cells collected from a variety of public datasets, all hosted by CZI CELLxGENE census. After the splitting procedure we had:
 
 - 23.64 Million cells in the training split
 - 0.13 Million cells in the validation split
@@ -53,11 +53,11 @@ Different assays have different ranges of reported gene measurements. On the low
 
 #### Dataset distribution
 
-Dataset (eg a publication that produces data and uploads to cellxgene) leads to known batch effects due to different handling proceedures, collection procedures, etc. We stratify our training vs hold-out split by this covariate for this reason. Exploring the breakdown of datasets we see that the top 10 datsets represent approximately 10 million cells of the full cellxgene datset. The largest dataset alone has 4 million cells.
+Dataset (for example, a publication that produces data and uploads to CELLxGENE) leads to known batch effects due to different handling procedures, collection procedures, and more. Hence, we stratify our training rather than hold out split by this covariate. Exploring the breakdown of datasets, we see that the top 10 datasets represent approximately 10 million cells of the full CELLxGENE dataset. The largest dataset alone has 4 million cells.
 
 ![Top datasets make up a large fraction of cells](../assets/old_images/cellxgene/num_cells_by_dataset.png)
 
-Looking at the makeup of these top datasets, we see that most represent single tissue categories predominately. Most of these tend to be nervous system datsets with the exception of one which is balanced between many cell types.
+Looking at the makeup of these top datasets, we see that they represent single tissue categories predominately. Most of these tend to be nervous system datasets, with the exception of one that is balanced between many cell types.
 ![Top 9 datasets are largely biased toward single cell types](../assets/old_images/cellxgene/top9_datasets_tissue_distribution.png)
 
 ## References
@@ -87,7 +87,7 @@ Our training, validation and test data, including subsets made available for tes
 * Publication Reference: Cheng et al. (2018) Cell Reports; Publication: https://doi.org/10.1016/j.celrep.2018.09.006 Dataset Version: https://datasets.cellxgene.cziscience.com/912d943b-9060-4fd3-a12c-ad641a89f0e4.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/43d4bb39-21af-4d05-b973-4c1fed7b916c
 * Publication Reference: Cowan et al. (2020) Cell; Publication: https://doi.org/10.1016/j.cell.2020.08.013 Dataset Version: https://datasets.cellxgene.cziscience.com/b1989183-5808-46ab-87f5-978febb2d26e.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/2f4c738f-e2f3-4553-9db2-0582a38ea4dc
 * Publication Reference: Cowan et al. (2020) Cell; Publication: https://doi.org/10.1016/j.cell.2020.08.013 Dataset Version: https://datasets.cellxgene.cziscience.com/c0d3867e-1a7b-4e57-af62-c563f1934226.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/2f4c738f-e2f3-4553-9db2-0582a38ea4dc
-* Publication Reference: Dom\u00ednguez Conde et al. (2022) Science; Publication: https://doi.org/10.1126/science.abl5197 Dataset Version: https://datasets.cellxgene.cziscience.com/08f58b32-a01b-4300-8ebc-2b93c18f26f7.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/62ef75e4-cbea-454e-a0ce-998ec40223d3
+* Publication Reference: Domínguez Conde et al. (2022) Science; Publication: https://doi.org/10.1126/science.abl5197 Dataset Version: https://datasets.cellxgene.cziscience.com/08f58b32-a01b-4300-8ebc-2b93c18f26f7.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/62ef75e4-cbea-454e-a0ce-998ec40223d3
 * Publication Reference: Easter et al. (2024) Nat Commun; Publication: https://doi.org/10.1038/s41467-024-49037-y Dataset Version: https://datasets.cellxgene.cziscience.com/221dff56-a47d-4563-90ed-51b60e2f16d5.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/71f4bccf-53d4-4c12-9e80-e73bfb89e398
 * Publication Reference: Egozi et al. (2021) Nat Med; Publication: https://doi.org/10.1038/s41591-021-01586-1 Dataset Version: https://datasets.cellxgene.cziscience.com/e3a84fef-b6df-49b2-b0ca-ecaf444773ec.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/7651ac1a-f947-463a-9223-a9e408a41989
 * Publication Reference: Elmentaite et al. (2020) Developmental Cell; Publication: https://doi.org/10.1016/j.devcel.2020.11.010 Dataset Version: https://datasets.cellxgene.cziscience.com/3aedefc0-401a-4ee8-a1b5-a0ffc20e1ff2.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/17481d16-ee44-49e5-bcf0-28c0780d8c4a
@@ -282,7 +282,7 @@ Our training, validation and test data, including subsets made available for tes
 * Publication Reference: Smillie et al. (2019) Cell; Publication: https://doi.org/10.1016/j.cell.2019.06.029 Dataset Version: https://datasets.cellxgene.cziscience.com/6c483976-30de-4835-97f0-2b9bc93614e7.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/33d19f34-87f5-455b-8ca5-9023a2e5453d
 * Publication Reference: Smith et al. (2021) Proc. Natl. Acad. Sci. U.S.A.; Publication: https://doi.org/10.1073/pnas.2023333118 Dataset Version: https://datasets.cellxgene.cziscience.com/bf50dbfb-9ca0-4f0d-8deb-a1a810a0e313.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/e02201d7-f49f-401f-baf0-1eb1406546c0
 * Publication Reference: Smith et al. (2021) Proc. Natl. Acad. Sci. U.S.A.; Publication: https://doi.org/10.1073/pnas.2023333118 Dataset Version: https://datasets.cellxgene.cziscience.com/ff7778bf-7a65-4d23-a9f4-b26c47926c28.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/e02201d7-f49f-401f-baf0-1eb1406546c0
-* Publication Reference: Sol\u00e9-Boldo et al. (2020) Commun Biol; Publication: https://doi.org/10.1038/s42003-020-0922-4 Dataset Version: https://datasets.cellxgene.cziscience.com/bc8d7152-3b69-4153-9314-7342ae58fbde.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/c353707f-09a4-4f12-92a0-cb741e57e5f0
+* Publication Reference: Solé-Boldo et al. (2020) Commun Biol; Publication: https://doi.org/10.1038/s42003-020-0922-4 Dataset Version: https://datasets.cellxgene.cziscience.com/bc8d7152-3b69-4153-9314-7342ae58fbde.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/c353707f-09a4-4f12-92a0-cb741e57e5f0
 * Publication Reference: Stephenson et al. (2021) Nat Med; Publication: https://doi.org/10.1038/s41591-021-01329-2 Dataset Version: https://datasets.cellxgene.cziscience.com/46586a98-b75d-4557-9cc4-839fc28e67d5.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/ddfad306-714d-4cc0-9985-d9072820c530
 * Publication Reference: Stewart et al. (2019) Science; Publication: https://doi.org/10.1126/science.aat5031 Dataset Version: https://datasets.cellxgene.cziscience.com/40ebb8e4-1a25-4a33-b8ff-02d1156e4e9b.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/120e86b4-1195-48c5-845b-b98054105eec
 * Publication Reference: Stewart et al. (2019) Science; Publication: https://doi.org/10.1126/science.aat5031 Dataset Version: https://datasets.cellxgene.cziscience.com/fe7e4408-7390-4f93-95aa-ffe472843421.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/120e86b4-1195-48c5-845b-b98054105eec
 
@@ -4,7 +4,7 @@ The BioNeMo Framework provides access to a variety of high-quality datasets for
 
 | **Dataset**                                              | **Modality**   | **Uses**                                         |
 | -------------------------------------------------------- | -------------- | ------------------------------------------------ |
-| [CELLxGENE](./CELLxGENE.md)                              | Single Cell    | Single-Cell Gene Expression
+| [CELLxGENE](./CELLxGENE.md)                              | Single Cell    | Single-Cell Gene Expression                      |
 | [UniProt](./uniprot.md)                                  | Protein        | Protein Sequence and Function Analysis           |
 
 For more information about the datasets included in the BioNeMo Framework, refer to the Dataset Cards linked in the table above or the original sources referenced in the respective dataset descriptions.
@@ -21,9 +21,9 @@ randomly chosen UniRef90 sequence from each.
 
 ## Data Availability
 
-Two versions of the dataset are distributed, a full training dataset (~80Gb) and a 10,000 UniRef50 cluster random slice
-(~150Mb). To load and use the sanity dataset, the [bionemo.core.data.load][bionemo.core.data.load.load] function
-can be used to materialize the sanity dataset in the BioNeMo2 cache directory:
+Two versions of the dataset are distributed, a full training dataset (~80GB) and a 10,000 UniRef50 cluster random slice
+(~150MB). To load and use the sanity dataset, use the [bionemo.core.data.load][bionemo.core.data.load.load] function
+to materialize the sanity dataset in the BioNeMo2 cache directory:
 
 ```python
 from bionemo.core.data.load import load
@@ -36,7 +36,7 @@ sanity_data_dir = load("esm2/testdata_esm2_pretrain:2.0")
 * [Sanity Dataset](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/resources/esm2_pretrain_nemo2_testdata/files)
 * [Full Dataset]
 
-## Reference
+## References
 
 1. UniProt Consortium. (2023). UniProt: The universal protein knowledgebase in 2023. Nucleic Acids Research, 51(D1),
    D523–D531. doi:10.1093/nar/gkac1052
 
@@ -14,9 +14,9 @@ These models are ready for commercial use.
 
 ### Third-Party Community Consideration
 
-This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements
+This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements
 for this application and use case [1]; see link to [Non-NVIDIA Model Card for ESM-2 3B model](
-    https://huggingface.co/facebook/esm2_t36_3B_UR50D) and [non-NVIDIA Model Card for ESM-2 650M model](
+    https://huggingface.co/facebook/esm2_t36_3B_UR50D) and [Non-NVIDIA Model Card for ESM-2 650M model](
         https://huggingface.co/facebook/esm2_t33_650M_UR50D)
 
 ### References
@@ -27,7 +27,7 @@ Santos Costa, A., 2023. Evolutionary-scale prediction of atomic-level protein st
 
 [2] "UniProt: the universal protein knowledgebase in 2021." Nucleic acids research 49, no. D1 (2021): D480-D489.
 
-[3] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for
+[3] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. BERT: Pre-training of deep bidirectional transformers for
 language understanding. arXiv preprint arXiv:1810.04805.
 
 ### Model Architecture
@@ -47,7 +47,7 @@ length 1022. Longer sequences are automatically truncated to this length.
 
 ### Output
 
-**Output Type(s):** Embeddings (Amino-acid and sequence-level)
+**Output Type(s):** Embeddings (Amino acid and sequence-level)
 
 **Output Parameters:** 1D
 
@@ -63,15 +63,15 @@ acid.
 
 **Supported Hardware Microarchitecture Compatibility**
 
-* [Ampere]
-* [Hopper]
-* [Volta]
+* NVIDIA Ampere
+* NVIDIA Hopper
+* NVIDIA Volta
 
 **[Preferred/Supported] Operating System(s)**
 
-* [Linux]
+* Linux
 
-### Model Version(s)
+### Model Versions
 
 * [esm2/650m:2.0](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/esm2nv650m)
 * [esm2/3b:2.0](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/esm2nv3b)
@@ -81,30 +81,30 @@ acid.
 ### Training Dataset
 
 Original ESM-2 checkpoints from HuggingFace were trained with the UniProt 2021_04 sequence database. For more details on
-the training dataset, see Lin *et al.* 2023. The train / test splits used by the original authors were not distributed.
+the training dataset, see Lin *et al.* 2023. The train/test splits used by the original authors were not distributed.
 A pre-training database compiled by NVIDIA following a similar approach is described in [UniProt
-Dataset](../datasets/uniprot.md).
+Dataset](../../datasets/uniprot.md).
 
 ### Inference
 
 **Engine:** BioNeMo, NeMo
 
 **Test Hardware**
 
-* [Ampere]
-* [Hopper]
-* [Volta]
+* NVIDIA Ampere
+* NVIDIA Hopper
+* NVIDIA Volta
 
 ## License
 
-ESM-2 is as provided under the Apache 2.0 license.
+ESM-2 is provided under the Apache 2.0 license.
 
 ## Competitive Benchmarking
 
 ### Accuracy
 
 A validation set of 328,360 UniRef50 representative sequences were randomly selected from UniRef 2024_03 (see [UniProt
-Dataset](../datasets/uniprot.md)). This validation set was used to ensure that the output of BioNeMo-converted
+Dataset](../../datasets/uniprot.md)). This validation set was used to ensure that the output of BioNeMo-converted
 checkpoints is consistent with their outputs when evaluated with the HuggingFace Transformers library.
 
 | Checkpoint | HuggingFace | BioNeMo2 | Lin *et al.* 2023                    |
@@ -123,24 +123,23 @@ checkpoints is consistent with their outputs when evaluated with the HuggingFace
 
 ![ESM-2 Single-Device Training Performance](../../assets/images/esm2/esm2_single_node_training_perf.png)
 
-The pure-pytorch baseline (compiled with `torch.compile()`) raised an out-of-memory error for batch sizes larger than 16
-at the ESM2-650M model size. The `bionemo2` model could handle batch sizes of 46, reaching a model flops utilization of
+The pure-PyTorch baseline (compiled with `torch.compile()`) raised an out-of-memory error for batch sizes larger than 16
+at the ESM2-650M model size. The `bionemo2` model could handle batch sizes of 46, reaching a model FLOPs utilization of
 59.2% on an NVIDIA A100.
 
 #### Model Scaling
 
 ![ESM-2 Model Scaling](../../assets/images/esm2/esm2_model_scaling.png)
 
 Training ESM-2 at the 650M, 3B, and 15B model variants show improved performance with the BioNeMo2 framework over the
-pure-pytorch baseline. These experiments were conducted on 16x NVIDIA A100 or 16x NVIDIA H100 GPUs split across two
+pure-PyTorch baseline. These experiments were conducted on 16x NVIDIA A100 or 16x NVIDIA H100 GPUs split across two
 nodes. <sup>*</sup>*Note:* 15B model variants were trained on 64 GPUs with the BioNeMo2 framework.
 
 #### Device Scaling
 
 ![ESM-2 Device Scaling](../../assets/images/esm2/esm2_device_scaling.png)
 
-Training ESM-3B on 256 NVIDIA A100s on 32 nodes achieved 96.85% of the theoretical linear throughput expected from
-extrapolating single-node (8 GPU) performance, representing a model flops utilization of 60.6% at 256 devices.
+Training ESM-3B on 256 NVIDIA A100s on 32 nodes achieved 96.85% of the theoretical linear throughput expected from extrapolating single-node (8 GPU) performance, representing a model flops utilization of 60.6% at 256 devices.
 
 ### LoRA Fine-tuning Performace