AtlasSync

Stream documents from S3, embed them with VoyageAI, and store them onto MongoDB Atlas in bulk

Overview

atlas-sync is a tiny Go pipeline that:

Loads objects from Amazon S3 (either by prefix or from an archive format)
Embeds their contents using an embeddings provider (currently VoyageAI)
Inserts each document + embedding into MongoDB Atlas

Note

The embeddings providers interface is specifically made to be extended with others: OpenAI, Cohere...

It is intentionally small and hackable so you can adapt it to your own data source, embedding model, or destination

Quick start

Install

git clone https://github.com/ggcr/atlas-sync
cd atlas-sync

Create a config.yaml

aws:
  # Supports either s3:// or https://<bucket>.s3[.<region>].amazonaws.com/<key>
  s3_url: "s3://my-bucket/path/to/data.tgz"  # or a prefix like s3://my-bucket/path/
  aws_region: "us-east-1"                      # optional if https URL contains region
  anonymous_client: true                       # set false to use your configured AWS creds
  batch_size: 64                               # batch size used by loaders

atlas:
  mongodb_uri: "mongodb+srv://<user>:<pass>@cluster0.mongodb.net/?retryWrites=true&w=majority"
  batch_size: 64

embedding:
  provider: "voyage"
  model: "voyage-3.5"
  # API key can be set here or via VOYAGE_API_KEY env var
  key: "${VOYAGE_API_KEY}"                    # optional if env var is set
  batch_size: 64

embedding.key can be omitted, the loader looks for ${PROVIDER}_API_KEY (e.g. VOYAGE_API_KEY).

Run the main pipeline

go run .

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
atlas		atlas
config		config
dataloader		dataloader
embeddings		embeddings
examples		examples
misc		misc
s3		s3
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AtlasSync

Overview

Quick start

About

Uh oh!

Releases

Packages

Languages

ggcr/atlas-sync

Folders and files

Latest commit

History

Repository files navigation

AtlasSync

Overview

Quick start

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages