A CLI tool for stress testing MindsDB's Knowledge Base feature through semantic codebase navigation. Ingests codebases and enables natural language search using MindsDB's embedding and reranking capabilities.
The Semantic Code Navigator is a CLI application that transforms codebase navigation through semantic search. It clones GitHub repositories, extracts functions and classes from multiple programming languages, and ingests them into MindsDB Knowledge Bases with rich metadata. Users can perform natural language queries with advanced filtering capabilities.
The application demonstrates MindsDB Knowledge Base capabilities including CREATE KNOWLEDGE_BASE with OpenAI embedding models, batch INSERT operations, complex SELECT queries with metadata filtering, and CREATE INDEX for performance optimization.
- Semantic code search with natural language queries
- Metadata filtering by language, file path, function name, repository
- Batch processing for large codebases
- Multiple output formats (table, JSON, compact)
- Progress tracking with rich CLI interface
- Code purpose classification
- Natural language explanations
- Automated docstring generation
- Test case suggestions
- Search result rationale
- Concurrent query testing
- Performance benchmarking
- Scalability analysis
- Error rate monitoring
-
MindsDB: Install and run MindsDB locally or use MindsDB Cloud
# Local installation pip install mindsdb # Docker docker-compose up
-
OpenAI API Key: Required for embeddings and reranking
- Get your API key from OpenAI Platform
-
Clone the repository:
git clone https://github.com/Deeptanshu-sankhwar/semantic-code-navigator.git cd semantic-code-navigator
-
Install dependencies:
pip install -r requirements.txt
-
Configure environment:
cp env.example .env # Edit .env with your configuration
-
Set environment variables:
# Required OPENAI_API_KEY=sk-your-openai-api-key-here # Optional (for local MindsDB) MINDSDB_HOST=127.0.0.1 MINDSDB_PORT=47334 # Optional (for MindsDB Cloud) MINDSDB_USER=your-email@example.com MINDSDB_PASSWORD=your-password
python main.py kb:init --validate-config
Options:
--force
: Recreate knowledge base if exists--validate-config
: Validate configuration before creation
# Basic search
python main.py kb:query "authentication middleware"
# With filters
python main.py kb:query "database connection" --language python --limit 20
# With AI analysis
python main.py kb:query "error handling" --ai-all
# Different output formats
python main.py kb:query "JWT validation" --output-format json
Search Options:
--language, -l
: Filter by programming language--filepath, -f
: Filter by file path pattern--function
: Filter by function name--repo, -r
: Filter by repository name--limit
: Maximum number of results (default: 10)--relevance-threshold
: Minimum relevance score (0.0-1.0)--output-format
: Output format (table, json, compact)--ai-purpose
: Add AI purpose classification--ai-explain
: Add AI code explanations--ai-docstring
: Add AI-generated docstrings--ai-tests
: Add AI test case suggestions--ai-all
: Add all AI analysis
The kb:ingest
command clones Git repositories, parses code files to extract functions and classes, and inserts the resulting code chunks into the knowledge base with comprehensive metadata.
# Basic ingestion
python main.py kb:ingest https://github.com/org/repo-name.git
# Advanced ingestion with full options
python main.py kb:ingest https://github.com/org/repo.git \
--branch develop \
--extensions "py,js,ts,java,go" \
--exclude-dirs "node_modules,__pycache__,build" \
--batch-size 500 \
--extract-git-info \
--generate-summaries
# Preview ingestion without inserting data
python main.py kb:ingest https://github.com/org/repo.git --dry-run
- Repository Cloning: Clones the specified Git repository and branch
- File Discovery: Scans for files matching specified extensions
- Code Parsing: Extracts functions, classes, and methods using AST parsing
- Metadata Extraction: Collects file paths, languages, function names, and Git information
- Batch Processing: Inserts code chunks in configurable batch sizes for optimal performance
- Progress Reporting: Provides real-time feedback on ingestion progress
Option | Description | Default | Example |
---|---|---|---|
--branch, -b |
Git branch to clone | main |
--branch develop |
--extensions |
Comma-separated file extensions | py,js,ts,java,go,rs,cpp,c,h |
--extensions "py,js" |
--exclude-dirs |
Directories to skip during ingestion | .git,node_modules,__pycache__,.venv,build,dist |
--exclude-dirs "tests,docs" |
--batch-size |
Number of records per batch insert | 500 |
--batch-size 1000 |
--extract-git-info |
Include Git author and commit data | false |
--extract-git-info |
--generate-summaries |
Generate AI summaries (costs OpenAI credits) | false |
--generate-summaries |
--dry-run |
Preview without inserting data | false |
--dry-run |
--cleanup |
Remove temporary files after ingestion | true |
--no-cleanup |
The ingestion engine supports AST-based parsing for:
- Python (.py) - Functions, classes, methods
- JavaScript/TypeScript (.js, .ts) - Functions, classes, arrow functions
- Java (.java) - Methods, classes, interfaces
- Go (.go) - Functions, methods, structs
- Rust (.rs) - Functions, implementations, traits
- C/C++ (.c, .cpp, .h) - Functions, classes, structs
For unsupported languages, the system falls back to chunk-based extraction.
Each ingested code chunk includes:
Field | Description | Source |
---|---|---|
content |
The actual function/class code | AST parsing |
filepath |
Relative path within repository | File system |
language |
Programming language | File extension |
function_name |
Name of function/class/method | AST parsing |
repo |
GitHub repository URL | Git remote |
last_modified |
Last commit timestamp | Git log |
author |
Code author (if --extract-git-info ) |
Git blame |
line_range |
Start-end line numbers (if --extract-git-info ) |
AST + file position |
summary |
AI-generated summary (if --generate-summaries ) |
OpenAI API |
- Batch Size: Larger batches (500-1000) improve throughput but use more memory
- Git Info Extraction: Adds processing time but provides richer metadata
- AI Summaries: Significantly increases processing time and OpenAI costs
- Repository Size: Large repositories (1000+ files) may require 10-30 minutes
- Network: Repository cloning speed depends on internet connection
Starting repository ingestion: https://github.com/psf/requests.git
This may take a few minutes depending on repository size...
Extracted 1,247 code chunks from repository
Starting batch insertion into knowledge base...
Using batch size: 500 for stable insertion
Total records to insert: 1,247
Inserted batch 1/3: 500 records
Inserted batch 2/3: 500 records
Inserted batch 3/3: 247 records
Successfully ingested 1,247 code chunks
Language breakdown:
python: 1,198 chunks
markdown: 31 chunks
yaml: 12 chunks
shell: 6 chunks
# Initialize AI tables
python main.py ai:init
# Analyze code
python main.py ai:analyze "def authenticate_user(username, password): return username == 'admin'" --all
# List AI tables
python main.py ai:list
# Reset AI tables
python main.py ai:reset
AI Analysis Types:
--classify
: Code purpose classification--explain
: Natural language explanation--docstring
: Generate documentation--tests
: Suggest test cases--all
: Run all analysis types
# Create sync job
python main.py kb:sync https://github.com/org/repo-name.git
# Custom schedule
python main.py kb:sync https://github.com/org/repo.git --schedule "EVERY 12 HOURS"
# List jobs
python main.py kb:sync:list
# Delete job
python main.py kb:sync:delete sync_github_com_org_repo_git
Experience the complete AI-enhanced semantic search workflow:
# Complete workflow demonstration
python demo.py workflow "decorator function" --limit 3
# Create SQL view joining KB with AI tables
python demo.py create-view
# Query integrated workflow view
python demo.py query-view --limit 5
Demo Features:
- Complete pipeline demonstration (KB search to AI analysis)
- Step-by-step workflow output
- SQL view integration
- Professional presentation format
# Check status
python main.py kb:status
# View schema
python main.py kb:schema
# Create index
python main.py kb:index
# Reset knowledge base
python main.py kb:reset --force
┌─────────────────────────────────────────────────────────────┐
│ Semantic Code Navigator │
├─────────────────────────────────────────────────────────────┤
│ CLI Interface (Click + Rich) │
│ ├── kb:* - Knowledge Base Operations │
│ ├── ai:* - AI Table Management │
│ └── demo:* - Workflow Demonstrations │
├─────────────────────────────────────────────────────────────┤
│ MindsDB Client (Python SDK) │
│ ├── Connection Management │
│ ├── Knowledge Base Operations │
│ ├── AI Table Integration │
│ └── Batch Processing │
├─────────────────────────────────────────────────────────────┤
│ MindsDB Knowledge Base │
│ ├── OpenAI Embeddings (text-embedding-3-large) │
│ ├── OpenAI Reranking (gpt-4o) │
│ ├── Vector Storage & Indexing │
│ └── Metadata Filtering │
├─────────────────────────────────────────────────────────────┤
│ AI Tables (Generative AI Models) │
│ ├── code_classifier - Purpose Classification │
│ ├── code_explainer - Natural Language Explanations │
│ ├── docstring_generator - Documentation Generation │
│ ├── test_case_outliner - Test Case Suggestions │
│ └── result_rationale - Search Match Explanations │
├─────────────────────────────────────────────────────────────┤
│ Git Repository Ingestion Pipeline │
│ ├── Repository Cloning & Discovery │
│ ├── Function/Class Extraction │
│ ├── Metadata Extraction │
│ └── Batch Processing │
└─────────────────────────────────────────────────────────────┘
chunk_content
: Code content with embedded metadatachunk_id
: Unique identifiermetadata
: MindsDB internal metadatarelevance
: Semantic search relevance scoredistance
: Vector similarity distance
filepath
: Relative path within repositorylanguage
: Programming languagefunction_name
: Function/class/method namerepo
: GitHub repository URLlast_modified
: Git commit timestampauthor
: Git commit author (optional)line_range
: Start-end line numbers (optional)
- Python, JavaScript, TypeScript, Java, Go, Rust, C/C++
- Fallback chunking for unsupported languages
# Authentication code
python main.py kb:query "user authentication and login validation"
# HTTP handling
python main.py kb:query "http request" --language python --limit 5
# Error patterns
python main.py kb:query "exception handling" --relevance-threshold 0.7
# Test files
python main.py kb:query "test validation" --filepath "*/test*"
# Specific functions
python main.py kb:query "database connection" --function "*connect*"
# Recent changes
python main.py kb:query "authentication" --author "john@example.com" --since "2024-01-01"
# 1. Initialize
python main.py kb:init
# 2. Ingest repository
python main.py kb:ingest https://github.com/psf/requests.git --extract-git-info
# 3. Search
python main.py kb:query "http request handling" --limit 5
# 4. Check status
python main.py kb:status
# 5. Reset for new testing
python main.py kb:reset --force
This project is part of MindsDB Knowledge Base stress testing. Contributions welcome.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
MIT License - see LICENSE file for details.
The project includes a comprehensive stress testing suite that evaluates the complete workflow across 10 GitHub repositories of varying sizes (30 to 120 files).
- Complete Workflow Testing: Tests KB creation, data ingestion, indexing, semantic search, and AI analysis
- 10 Repository Coverage: From small Flask apps to large projects like Linux kernel and WebKit
- Serial Execution: Tests run one after another to prevent memory issues
- Memory Management: Automatic KB reset after each test to free memory
- Real-time Reporting: Beautiful markdown reports with timestamps and metrics
- Performance Analysis: Tracks ingestion speed, search response times, and success rates
- Failure Analysis: Detailed error reporting and recommendations
Size Category | File Count | Examples | Batch Size |
---|---|---|---|
Small | 50-150 | Flask, Express, Gin examples | 100-300 |
Medium | 200-450 | Django, React, Spring Boot | 200-400 |
Medium-Large | 500-700 | Angular, NestJS, FastAPI | 300-500 |
Large | 800-1000 | Kubernetes, TensorFlow.js | 400-600 |
Very Large | 1200-3000+ | VS Code, Chromium, Linux | 500-1000 |
# Full stress test (10 repositories)
python stress_test.py
# View help
python stress_test.py --help
Each repository test follows this workflow:
- KB Creation - Initialize fresh knowledge base
- AI Tables Setup - Create AI analysis models
- Data Ingestion - Clone repo and extract code chunks
- Index Creation - Optimize for search performance
- Semantic Search - Test 5 different queries
- AI Analysis - Test code classification and explanation
- Cleanup - Reset KB to free memory for next test
The stress test generates detailed markdown reports including:
- Real-time Progress: Timestamped updates during execution
- Performance Metrics: Ingestion speed, search response times
- Success Rates: Pass/fail statistics for each workflow step
- Memory Usage: Peak memory consumption tracking
- Failure Analysis: Detailed error logs and recommendations
- Comparative Analysis: Performance across different repository sizes
### Testing Repository: django-blog
- **URL:** https://github.com/django/django
- **Estimated Files:** 250
- **Language:** Python
- **Batch Size:** 200
#### Step 1: Knowledge Base Creation
**KB Creation:** Success in 2.34s
#### Step 3: Data Ingestion
**Data Ingestion:** Success in 45.67s
- Files Processed: 247
- Chunks Extracted: 1,234
#### Step 5: Semantic Search Testing
**Semantic Search:** Success
- Queries Tested: 5
- Average Response Time: 1.23s
- Total Results: 47
- MindsDB running locally (
docker-compose up
) - OpenAI API key configured in
.env
- Sufficient disk space (~5GB for temporary repositories)
- Stable internet connection for repository cloning
- 8GB+ RAM recommended for large repositories
- Embedding Costs: ~$0.10-0.50 per repository (varies by size)
- AI Analysis Costs: ~$0.05-0.20 per repository
- Total Estimated Cost: $5-10 for full 10-repository test
- No Summary Generation: Disabled to reduce OpenAI costs
The stress test is designed to thoroughly validate the system's reliability, performance, and scalability across diverse codebases while providing actionable insights for optimization.
The Semantic Code Navigator includes a powerful AI agent system that provides specialized code analysis and assistance. These agents have full access to your ingested codebase and can provide expert-level insights in specific domains.
- Template-Based Creation: Pre-configured agent templates for different specializations
- Knowledge Base Integration: Agents have full access to your ingested codebase
- Natural Language Interaction: Query agents with natural language questions
- Specialized Expertise: Each agent is optimized for specific domains (code review, architecture, security)
- Rich Output Formatting: Beautiful formatted responses with structured analysis
Template | Specialization | Model | Description |
---|---|---|---|
code-reviewer |
Code Review | gpt-4o | Expert code reviewer focusing on security, performance, and best practices |
architect |
System Architecture | gpt-4o | Software architect for system-level analysis and design patterns |
security-auditor |
Security Analysis | gpt-4o | Security expert for vulnerability assessment and compliance |
The agent system transforms your ingested codebase into an interactive knowledge resource, providing expert-level analysis and guidance tailored to your specific code and requirements.
- MindsDB for the Knowledge Base platform
- MindsDB Python SDK for integration
- OpenAI for embedding and reranking models