Skip to content

ofriw/chunkhound

Repository files navigation

ChunkHound

Modern RAG for your codebase - semantic and regex search via MCP.

Tests License: MIT 100% AI Generated

Transform your codebase into a searchable knowledge base for AI assistants using semantic search via cAST algorithm and regex search. Integrates with AI assistants via the Model Context Protocol (MCP).

Features

  • cAST Algorithm - Research-backed semantic code chunking
  • Semantic search - Natural language queries like "find authentication code"
  • Regex search - Pattern matching without API keys
  • Local-first - Your code stays on your machine
  • 22 languages with structured parsing
    • Programming (via Tree-sitter): Python, JavaScript, TypeScript, JSX, TSX, Java, Kotlin, Groovy, C, C++, C#, Go, Rust, Bash, MATLAB, Makefile
    • Configuration (via Tree-sitter): JSON, YAML, TOML, Markdown
    • Text-based (custom parsers): Text files, PDF
  • MCP integration - Works with Claude, VS Code, Cursor, Windsurf, Zed, etc

Documentation

Visit ofriw.github.io/chunkhound for complete guides:

Requirements

Installation

# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install ChunkHound
uv tool install chunkhound

Quick Start

Option 1: With Embeddings (Recommended)

  1. Create .chunkhound.json in project root file
{
  "embedding": {
    "provider": "openai",
    "api_key": "your-api-key-here"
  }
}
  1. Index your codebase
chunkhound index

Option 2: Without embeddings (regex search only)

chunkhound index --no-embeddings

For configuration, IDE setup, and advanced usage, see the documentation.

Why ChunkHound?

Research Foundation: Built on the cAST (Chunking via Abstract Syntax Trees) algorithm from Carnegie Mellon University, providing:

  • 4.3 point gain in Recall@5 on RepoEval retrieval
  • 2.67 point gain in Pass@1 on SWE-bench generation
  • Structure-aware chunking that preserves code meaning

Local-First Architecture:

  • Your code never leaves your machine
  • Works offline with Ollama local models
  • No per-token charges for large codebases

Universal Language Support:

  • Structured parsing for 22 languages (Tree-sitter + custom parsers)
  • Same semantic concepts across all programming languages

License

MIT

About

Modern RAG for your codebase - semantic and regex search via MCP

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •