📌 Extending MiniBase from a relational DBMS to a vector-aware DBMS supporting efficient approximate nearest neighbor (ANN) queries in high-dimensional vector spaces using Locality-Sensitive Hashing (LSH).
Traditional database systems, like MiniBase, are optimized for structured relational data but struggle with modern applications involving high-dimensional vector data such as image embeddings, NLP vectors, and recommendation systems. This project integrates a disk-based Locality Sensitive Hash Forest (LSHF) Index into MiniBase to support approximate nearest-neighbor and range queries on 100D vectors using Euclidean distance.
By designing a clustered hash-based LSH index that integrates tightly with MiniBase's heapfiles and buffer manager, we significantly reduce the query space and enhance performance for high-dimensional similarity search while maintaining compatibility with MiniBase's relational query engine.
- 💾 100D Vector Data Type Support - Custom vector data type for high-dimensional data
- 🌲 Disk-Based LSH Forest Index - Efficient approximate similarity search
- 🔍 k-NN and Range Queries - Support for both nearest neighbor and range queries using Euclidean distance
- ⚙️ Efficient RID-Based Indexing - Reduced I/O operations through smart indexing
- ✅ MiniBase Integration - Fully compatible with existing MiniBase operators
- 🧪 Comprehensive Testing - Unit tests and performance benchmarks on large datasets
- ☕ Java 21+ Support - Modern Java development environment
- 💽 HeapFile-Based Storage - Leverages MiniBase's storage engine
- 🌳 LSH Forest Implementation - Optimized for Euclidean similarity
- 🔧 Bash Automation - Scripts for batch operations and queries
- Java: OpenJDK 21 or higher
- Memory: At least 156MB of RAM
- Storage: Sufficient disk space for database files
- OS: Linux/Mac (bash shell recommended)
git clone https://github.com/Meet-Modi/minibaseDB-vector-LSH.git
cd minibaseDB-vector-LSHEnsure the .jar, batchinsert.sh, and query.sh files are in the same directory. Make the shell scripts executable:
chmod +x batchinsert.sh query.shInsert vector data into the database using the batch insert script:
./batchinsert.sh <hashesPerLayer> <numLayers> <dataSpecFile> <dbName>Example:
./batchinsert.sh 6 2 sample_data_1.txt demoRunParameters:
hashesPerLayer: Number of hash functions per LSH layernumLayers: Number of LSH layersdataSpecFile: Input data specification filedbName: Name of the database to create
Execute queries on the vector database:
./query.sh <dbName> <querySpecFile> <Y|N:useIndex> <numBuffers>Example:
./query.sh demoRun nquery1.txt Y 1000Parameters:
dbName: Name of the existing databasequerySpecFile: Query specification fileuseIndex: Y to use LSH index, N for brute forcenumBuffers: Number of buffer pages to allocate
⚠️ Important: Avoid using~for file paths in.txtquery spec files. Use relative or absolute paths instead.
NN(<vectorCol>, <targetVecFile>, <topK>, <outputColNums...>)
Parameters:
vectorCol: Column containing vector datatargetVecFile: File containing the target vector (100 space-separated integers)topK: Number of nearest neighbors to returnoutputColNums: Columns to include in the output
Range(<vectorCol>, <targetVecFile>, <distance>, <outputColNums...>)
Parameters:
vectorCol: Column containing vector datatargetVecFile: File containing the target vector (100 space-separated integers)distance: Maximum Euclidean distance thresholdoutputColNums: Columns to include in the output
Target Vector File Format: The target vector file should contain a single line with 100 space-separated integers representing the query vector.
| Component | Description |
|---|---|
| LSHFIndex.java | Core implementation of LSH Forest indexing algorithm |
| BatchInsert.java | Command-line interface for data ingestion and index creation |
| Query.java | Command-line interface for executing NN and Range queries |
| NNIndexScan.java | Iterator implementation for nearest neighbor queries |
| RSIndexScan.java | Iterator implementation for range queries |
| Vector100Dtype.java | Custom 100-dimensional vector data type |
| Sort.java | Enhanced sorting with support for RID and vector-based sorting |
| tests/ | Comprehensive unit and integration test suite |
The project includes extensive testing to ensure correctness and performance:
- Index Persistence: Verification of LSH index serialization and deserialization
- Hash Function Correctness: Validation of LSH hashing and binning algorithms
- Sorting Algorithms: Testing vector distance-based sorting operations
- Query Accuracy: Correctness verification for NN and Range queries (with and without index)
- Performance Benchmarks: Evaluation under various configurations and data sizes
# Run all tests
make test
# Run specific test categories
make test-unit
make test-integration- Search Space Reduction: Achieves 89–98% reduction in search space using LSHFIndex
- Optimal Configuration: 6 hashes per layer × 2 layers provides best performance/accuracy trade-off
- Scalable Architecture: Linear trade-offs observed between layers/hashes and I/O operations
- Memory Efficient: Performs efficiently under memory constraints with ≤1500 buffers
| Use Case | Hashes/Layer | Layers | Expected Performance |
|---|---|---|---|
| High Accuracy | 8-10 | 3-4 | 95%+ accuracy, higher I/O |
| Balanced | 6 | 2 | 89-95% accuracy, optimal I/O |
| Fast Search | 4-5 | 1-2 | 80-90% accuracy, minimal I/O |
- Union Copy Optimization: Eliminate the current bottleneck in union operations
- Adaptive Bin Sizing: Dynamic bin sizing based on data distribution (standard deviation)
- RID-Only Sorting: Remove intermediate tuple copying for better performance
- Cosine Similarity Support: Extend LSH implementation for cosine similarity queries
- Multi-Threading: Parallel query execution for improved throughput
- Compression: Vector compression techniques to reduce storage overhead