Stream documents from S3, embed them with VoyageAI, and store them onto MongoDB Atlas in bulk
atlas-sync
is a tiny Go pipeline that:
- Loads objects from Amazon S3 (either by prefix or from an archive format)
- Embeds their contents using an embeddings provider (currently VoyageAI)
- Inserts each document + embedding into MongoDB Atlas
Note
The embeddings providers interface is specifically made to be extended with others: OpenAI, Cohere...
It is intentionally small and hackable so you can adapt it to your own data source, embedding model, or destination
Install
git clone https://github.com/ggcr/atlas-sync
cd atlas-sync
Create a config.yaml
aws:
# Supports either s3:// or https://<bucket>.s3[.<region>].amazonaws.com/<key>
s3_url: "s3://my-bucket/path/to/data.tgz" # or a prefix like s3://my-bucket/path/
aws_region: "us-east-1" # optional if https URL contains region
anonymous_client: true # set false to use your configured AWS creds
batch_size: 64 # batch size used by loaders
atlas:
mongodb_uri: "mongodb+srv://<user>:<pass>@cluster0.mongodb.net/?retryWrites=true&w=majority"
batch_size: 64
embedding:
provider: "voyage"
model: "voyage-3.5"
# API key can be set here or via VOYAGE_API_KEY env var
key: "${VOYAGE_API_KEY}" # optional if env var is set
batch_size: 64
embedding.key
can be omitted, the loader looks for ${PROVIDER}_API_KEY
(e.g. VOYAGE_API_KEY
).
Run the main pipeline
go run .