Skip to content

konsulin-care/snomed-vectorizer

Repository files navigation

Getting Started

This repository provides a full pipeline for preparing and vectorizing SNOMED CT data using a combination of R and Python.

Overview

  • Data Cleaning: Handled in R using targets, quarto, and other tidyverse tools.
  • Vectorization: Performed in Python using Google’s text-embedding-004 model.
  • Storage: Vectorized SNOMED CT records are upserted into a Pinecone index for later use.

The purpose of vectorizing SNOMED CT is to support more effective structuring of unstructured clinical text. By transforming clinical terms into dense vector representations, we make it easier to match, classify, or retrieve relevant medical concepts from free-text medical records.


Requirements

Before you begin, make sure the following tools are installed on your system:

Setup Instructions

Clone the Repository

You can use RStudio to fork and clone this repository easily. Refer to this guide (start from Step 2).

Alternatively, use Git in the terminal:

git clone https://github.com/your-username/your-repo.git
cd your-repo

Set Up the R Environment

Open RStudio and run:

install.packages("renv")  # Only required once
renv::restore()           # Install all required R packages

Restart your R session after all packages are installed.

Place the Raw Data

Get two files from your SNOMED CT release file:

  • description.txt $\to$ Contains the description of SNOMED CT terms
  • relationship.txt $\to$ Contains the relationship of SNOMED CT terms

Export your raw data as a file named data.csv and place it in the following directory:

data/
└── raw/
    ├── description.txt
    └── relationship.txt

This directory structure is required for the targets pipeline to run properly.

Run the Data Pipeline

Run the pipeline to clean and process the SNOMED CT data:

targets::tar_make()

This will execute the steps defined in _targets.R, and the cleaned dataset will be saved to data/processed/.


Vectorization with Python

Once data cleaning is complete, run the Python script to vectorize and upsert the data to Pinecone:

python src/python/upsert.py

Make sure your .env file contains valid credentials for Google Generative AI and Pinecone APIs.

About

Enabling semantic search of unstructured text input by transforming SNOMED CT terms into vectors.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published