Neural search for your starred GitHub repositories using txtai embeddings.
GitHub Stars Search is a tool that allows you to search through your starred GitHub repositories using neural embeddings. It extracts README files and repository metadata, processes them into chunks, and creates embeddings for efficient semantic search.
Key features:
- Fetches your starred GitHub repositories and their metadata
- Extracts README files (preferably in English)
- Processes content using intelligent chunking strategies
- Generates embeddings using BAAI/bge-small-en-v1.5 via txtai
- Provides hybrid search combining neural embeddings and BM25 keyword search
- Stores all data and embeddings on disk for reuse
- Supports incremental updates for newly starred repositories
- Configurable search parameters and embedding models
- Clone the repository:
git clone https://github.com/yourusername/github_stars.git
cd github_stars
- Install dependencies:
pip install -r requirements.txt
- Set up your GitHub API key:
Create a
.env
file in the project root with your GitHub API key:
GITHUB_STARS_KEY=your_github_api_key
You can generate a GitHub API key from your GitHub Developer Settings.
Before searching, you need to fetch and process your starred repositories:
python github_stars_search.py update
This will:
- Fetch your starred repositories from GitHub
- Extract README files and metadata
- Process content into chunks
- Generate embeddings
- Store everything on disk
You can limit the number of repositories to update:
python github_stars_search.py update --limit 100
Or force update all repositories, even if they haven't changed:
python github_stars_search.py update --force
Once you've updated your repository data, you can search through them:
python github_stars_search.py search "machine learning for time series"
You can apply filters to your search:
python github_stars_search.py search "machine learning" --min-stars 100 --language python
And customize the search weights:
python github_stars_search.py search "machine learning" --neural-weight 0.8 --keyword-weight 0.2
For a more user-friendly experience, you can use the included web interface:
# Install Flask if you don't have it
pip install flask
# Run the web interface
python examples/web_interface.py
This will start a local web server at http://127.0.0.1:5000 where you can:
- Search your repositories with a simple form
- Adjust neural and keyword search weights
- Filter by language and minimum stars
- View nicely formatted search results
You can view and modify the configuration:
python github_stars_search.py config --show
Set a different embedding model:
python github_stars_search.py config --embedding-model "sentence-transformers/all-MiniLM-L6-v2"
Change search weights:
python github_stars_search.py config --neural-weight 0.6 --keyword-weight 0.4
View information about your data:
python github_stars_search.py info
The config.yaml
file contains various settings that you can customize:
- GitHub API settings (pagination, retries, timeout)
- Content processing settings (chunk size, strategy)
- Embedding settings (model, device, batch size)
- Search settings (weights, result limits)
- Storage settings (compression, backups)
The system uses a hybrid chunking strategy:
- First attempts to chunk by semantic sections (headers)
- For large sections, applies sliding window chunking with overlap
- Preserves repository context in each chunk
The search engine combines two approaches:
- Neural search using embeddings for semantic understanding
- BM25 keyword search for traditional relevance
Results are merged with configurable weights to provide the most relevant repositories.
The project includes a comprehensive test suite to ensure code quality and reliability.
First, install the testing dependencies:
# Using the provided script
./install_test_deps.sh
# Or manually
pip install pytest pytest-cov pytest-mock
Then run the tests:
# Using the provided scripts
./run_tests.sh # Run all tests
./run_specific_test.sh tests/test_github_client.py # Run a specific test file
./run_marked_tests.sh unit # Run tests with a specific marker
# Or using pytest directly
cd github_stars
pytest
pytest --cov=src # Run with coverage
Tests are categorized using pytest markers:
unit
: Unit tests that test individual components in isolationintegration
: Integration tests that test the interaction between componentsapi
: Tests that interact with the GitHub API (requires a valid API key)slow
: Tests that are slow to run
To run tests with a specific marker:
pytest -m unit
To run tests excluding a specific marker:
pytest -m "not api"
See the tests README for more information.
MIT