This repository provides a full pipeline for preparing and vectorizing
SNOMED CT data using a combination of R and Python.
- Data Cleaning: Handled in
Rusingtargets,quarto, and other tidyverse tools. - Vectorization: Performed in
Pythonusing Google’stext-embedding-004model. - Storage: Vectorized SNOMED CT records are upserted into a Pinecone index for later use.
The purpose of vectorizing SNOMED CT is to support more effective structuring of unstructured clinical text. By transforming clinical terms into dense vector representations, we make it easier to match, classify, or retrieve relevant medical concepts from free-text medical records.
Before you begin, make sure the following tools are installed on your system:
gitRquarto- RStudio IDE (optional but recommended)
You can use RStudio to fork and clone this repository easily. Refer to this guide (start from Step 2).
Alternatively, use Git in the terminal:
git clone https://github.com/your-username/your-repo.git
cd your-repoOpen RStudio and run:
install.packages("renv") # Only required once
renv::restore() # Install all required R packagesRestart your R session after all packages are installed.
Get two files from your SNOMED CT release file:
-
description.txt$\to$ Contains the description of SNOMED CT terms -
relationship.txt$\to$ Contains the relationship of SNOMED CT terms
Export your raw data as a file named data.csv and place it in the
following directory:
data/
└── raw/
├── description.txt
└── relationship.txtThis directory structure is required for the targets pipeline to run
properly.
Run the pipeline to clean and process the SNOMED CT data:
targets::tar_make()This will execute the steps defined in _targets.R, and the cleaned
dataset will be saved to data/processed/.
Once data cleaning is complete, run the Python script to vectorize and upsert the data to Pinecone:
python src/python/upsert.pyMake sure your .env file contains valid credentials for Google
Generative AI and Pinecone APIs.