Skip to content

ggcr/atlas-sync

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AtlasSync

Stream documents from S3, embed them with VoyageAI, and store them onto MongoDB Atlas in bulk

Visitors

Overview

atlas-sync is a tiny Go pipeline that:

  1. Loads objects from Amazon S3 (either by prefix or from an archive format)
  2. Embeds their contents using an embeddings provider (currently VoyageAI)
  3. Inserts each document + embedding into MongoDB Atlas

Note

The embeddings providers interface is specifically made to be extended with others: OpenAI, Cohere...

It is intentionally small and hackable so you can adapt it to your own data source, embedding model, or destination

Quick start

Install

git clone https://github.com/ggcr/atlas-sync
cd atlas-sync

Create a config.yaml

aws:
  # Supports either s3:// or https://<bucket>.s3[.<region>].amazonaws.com/<key>
  s3_url: "s3://my-bucket/path/to/data.tgz"  # or a prefix like s3://my-bucket/path/
  aws_region: "us-east-1"                      # optional if https URL contains region
  anonymous_client: true                       # set false to use your configured AWS creds
  batch_size: 64                               # batch size used by loaders

atlas:
  mongodb_uri: "mongodb+srv://<user>:<pass>@cluster0.mongodb.net/?retryWrites=true&w=majority"
  batch_size: 64

embedding:
  provider: "voyage"
  model: "voyage-3.5"
  # API key can be set here or via VOYAGE_API_KEY env var
  key: "${VOYAGE_API_KEY}"                    # optional if env var is set
  batch_size: 64

embedding.key can be omitted, the loader looks for ${PROVIDER}_API_KEY (e.g. VOYAGE_API_KEY).

Run the main pipeline

go run .

About

Stream, embed and store your S3 buckets onto Mongo Atlas 🍃

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages