A powerful tool to vectorize codebases and store them in PostgreSQL with pgvector extension for semantic search and LLM integration.
- 🔍 Code Discovery: Automatically discovers and processes code files from Git repositories
- 🧠 Smart Chunking: Intelligent text chunking with configurable overlap for better context
- 🔢 Vector Embeddings: Generates embeddings using OpenAI's text-embedding-ada-002 model
- 🗄️ PostgreSQL Storage: Stores vectors in PostgreSQL with pgvector for efficient similarity search
- 🐳 Docker Support: Easy setup with Docker Compose
- 📊 Rich CLI: Beautiful command-line interface with progress tracking and statistics
- 🔐 Git Authentication: Support for private repositories with GitHub tokens
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Git Repo │───▶│ Code Files │───▶│ Text Chunks │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ PostgreSQL │ │ OpenAI API │
│ (pgvector) │◀───│ Embeddings │
└─────────────────┘ └─────────────────┘
- Docker and Docker Compose
- OpenAI API key
- GitHub token (for private repositories)
# Clone the repository
git clone <your-repo-url>
cd VectoriseCodeBase
# Run the startup script
./start.shThe script will:
- Create
.envfile from template - Prompt you to add your OpenAI API key
- Start all services automatically
- Show you how to use the API
git clone <your-repo-url>
cd VectoriseCodeBasecp env.example .envEdit .env with your OpenAI API key:
# Required
OPENAI_API_KEY=your_openai_api_key_here
# Optional (for private repositories)
GITHUB_TOKEN=your_github_token_here
GITHUB_USERNAME=your_github_username_heredocker-compose up -d# Check services
docker-compose ps
# Test API
curl http://localhost:8000/api/healthVisit http://localhost:8000/docs for Swagger UI
# Vectorize a repository
curl -X POST "http://localhost:8000/api/vectorize" \
-H "Content-Type: application/json" \
-d '{
"repo_url": "https://github.com/username/repo-name",
"username": "test_user"
}'
# Search for code
curl -X POST "http://localhost:8000/api/search" \
-H "Content-Type: application/json" \
-d '{
"query": "function to parse JSON",
"username": "test_user"
}'# Vectorize repository
make vectorize REPO_URL=https://github.com/username/repo USERNAME=test_user
# Search code
make search QUERY="authentication function" USERNAME=test_user
# List repositories
make list-repos USERNAME=test_user# Public repository
python main.py vectorize --repo-url https://github.com/username/repo-name
# Private repository
python main.py vectorize --repo-url https://github.com/username/repo-name --github-token your_token
# Custom repository name
python main.py vectorize --repo-url https://github.com/username/repo-name --repo-name my-custom-namepython main.py list-repospython main.py stats repo-namepython main.py delete repo-name# Development mode (auto-reload)
make server-dev
# Production mode
make server- POST
/api/vectorize- Start vectorizing a repository - GET
/api/job/{job_id}- Get job status and progress - POST
/api/search- Search for code using semantic similarity - GET
/api/user/{username}/repos- Get user's repositories - DELETE
/api/user/{username}/repo/{repo_name}- Delete repository - GET
/api/health- Health check
import requests
# Start vectorization
response = requests.post("http://localhost:8000/api/vectorize", json={
"repo_url": "https://github.com/username/repo",
"username": "john_doe"
})
job_id = response.json()["job_id"]
# Check status
status = requests.get(f"http://localhost:8000/api/job/{job_id}").json()
# Search code
results = requests.post("http://localhost:8000/api/search", json={
"query": "authentication function",
"username": "john_doe"
}).json()Visit http://localhost:8000/docs for interactive Swagger documentation.
-
repositories: Stores repository metadata
id: Primary keyrepo_name: Repository namerepo_url: Repository URLclone_path: Local clone pathstatus: Processing status (pending, processing, completed, failed)created_at,updated_at: Timestamps
-
code_files: Stores information about code files
id: Primary keyrepository_id: Foreign key to repositoriesfile_path: Relative file pathfile_name: File namefile_extension: File extensionfile_size: File size in bytescontent_hash: SHA256 hash of file content
-
code_chunks: Stores code chunks with embeddings
id: Primary keyfile_id: Foreign key to code_fileschunk_index: Chunk index within filecontent: Text contentstart_line,end_line: Line numberstoken_count: Number of tokensembedding: Vector embedding (1536 dimensions)created_at: Timestamp
- Vector similarity index on
code_chunks.embeddingusing IVFFlat - Indexes on foreign keys and frequently queried columns
| Variable | Description | Default |
|---|---|---|
DATABASE_URL |
PostgreSQL connection string | postgresql://vectorize_user:vectorize_password@localhost:5432/vectorize_db |
OPENAI_API_KEY |
OpenAI API key | Required |
OPENAI_MODEL |
OpenAI embedding model | text-embedding-ada-002 |
GITHUB_TOKEN |
GitHub token for private repos | Optional |
GITHUB_USERNAME |
GitHub username | Optional |
CHUNK_SIZE |
Maximum tokens per chunk | 1000 |
CHUNK_OVERLAP |
Token overlap between chunks | 200 |
MAX_FILE_SIZE |
Maximum file size in bytes | 1048576 (1MB) |
The tool supports a wide range of programming languages and file types:
- Programming Languages: Python, JavaScript, TypeScript, Java, C++, C#, PHP, Ruby, Go, Rust, Swift, Kotlin, Scala, Clojure, Haskell, ML, F#, R, MATLAB
- Web Technologies: HTML, CSS, SCSS, SASS, Vue, Svelte, Astro
- Configuration: YAML, JSON, XML, TOML, INI, CFG
- Documentation: Markdown, RST, TeX
- Shell Scripts: Bash, Zsh, Fish, PowerShell
- Others: SQL, Vim, Lisp, Emacs Lisp
The vectorized code can be used with LLMs for:
- Code Search: Find relevant code snippets using semantic similarity
- Code Generation: Use code context for better code generation
- Documentation: Generate documentation from code
- Refactoring: Identify similar code patterns
- Bug Detection: Find similar bug patterns
-- Find similar code chunks
SELECT
cc.content,
cf.file_path,
cc.start_line,
cc.end_line,
1 - (cc.embedding <=> '[query_embedding]') as similarity
FROM code_chunks cc
JOIN code_files cf ON cc.file_id = cf.id
WHERE 1 - (cc.embedding <=> '[query_embedding]') > 0.8
ORDER BY similarity DESC
LIMIT 10;VectoriseCodeBase/
├── docker-compose.yml # Complete stack (DB + API)
├── Dockerfile # API container
├── start.sh # One-command startup script
├── init.sql # Database initialization
├── requirements.txt # Python dependencies
├── config.py # Configuration management
├── database.py # Database models and connection
├── git_manager.py # Git repository management
├── file_processor.py # File discovery and processing
├── embedding_service.py # OpenAI embedding service
├── vectorizer.py # Main vectorization logic
├── server.py # FastAPI server (main API)
├── main.py # CLI application (optional)
├── search.py # Semantic search utility (optional)
├── client_example.py # API client example (optional)
├── test_setup.py # Setup verification (optional)
├── Makefile # Easy commands
├── env.example # Environment template
├── API_DOCUMENTATION.md # API documentation
├── PRODUCT_GUIDE.md # Business/product guide
└── README.md # This file
- New File Types: Add extensions to
Config.SUPPORTED_EXTENSIONS - Custom Chunking: Modify
FileProcessor.chunk_text() - Different Embeddings: Extend
EmbeddingServicefor other providers - Additional Metadata: Add columns to database models
-
Database Connection Failed
- Ensure Docker containers are running:
docker-compose ps - Check database logs:
docker-compose logs postgres
- Ensure Docker containers are running:
-
OpenAI API Error
- Verify API key is correct
- Check API quota and billing
-
Git Clone Failed
- For private repos, ensure GitHub token has repo access
- Check repository URL format
-
Memory Issues
- Reduce
CHUNK_SIZEin configuration - Process smaller repositories first
- Reduce
Enable debug logging by modifying the logging level in Python files:
logging.basicConfig(level=logging.DEBUG)- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
For issues and questions:
- Check the troubleshooting section
- Review the logs
- Open an issue on GitHub
Happy Vectorizing! 🚀