Getting Started

This repository provides a full pipeline for preparing and vectorizing SNOMED CT data using a combination of R and Python.

Overview

Data Cleaning: Handled in R using targets, quarto, and other tidyverse tools.
Vectorization: Performed in Python using Google’s text-embedding-004 model.
Storage: Vectorized SNOMED CT records are upserted into a Pinecone index for later use.

The purpose of vectorizing SNOMED CT is to support more effective structuring of unstructured clinical text. By transforming clinical terms into dense vector representations, we make it easier to match, classify, or retrieve relevant medical concepts from free-text medical records.

Requirements

Before you begin, make sure the following tools are installed on your system:

git
R
quarto
RStudio IDE (optional but recommended)

Setup Instructions

Clone the Repository

You can use RStudio to fork and clone this repository easily. Refer to this guide (start from Step 2).

Alternatively, use Git in the terminal:

git clone https://github.com/your-username/your-repo.git
cd your-repo

Set Up the R Environment

Open RStudio and run:

install.packages("renv")  # Only required once
renv::restore()           # Install all required R packages

Restart your R session after all packages are installed.

Place the Raw Data

Get two files from your SNOMED CT release file:

description.txt $\to$ Contains the description of SNOMED CT terms
relationship.txt $\to$ Contains the relationship of SNOMED CT terms

Export your raw data as a file named data.csv and place it in the following directory:

data/
└── raw/
    ├── description.txt
    └── relationship.txt

This directory structure is required for the targets pipeline to run properly.

Run the Data Pipeline

Run the pipeline to clean and process the SNOMED CT data:

targets::tar_make()

This will execute the steps defined in _targets.R, and the cleaned dataset will be saved to data/processed/.

Vectorization with Python

Once data cleaning is complete, run the Python script to vectorize and upsert the data to Pinecone:

python src/python/upsert.py

Make sure your .env file contains valid credentials for Google Generative AI and Pinecone APIs.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
_targets		_targets
data		data
docs		docs
renv		renv
src		src
.Rprofile		.Rprofile
.env-example		.env-example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.qmd		README.qmd
_targets.R		_targets.R
renv.lock		renv.lock
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Getting Started

Overview

Requirements

Setup Instructions

Clone the Repository

Set Up the R Environment

Place the Raw Data

Run the Data Pipeline

Vectorization with Python

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

konsulin-care/snomed-vectorizer

Folders and files

Latest commit

History

Repository files navigation

Getting Started

Overview

Requirements

Setup Instructions

Clone the Repository

Set Up the R Environment

Place the Raw Data

Run the Data Pipeline

Vectorization with Python

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages