This Bash script helps prepare large codebases for AI analysis by consolidating files into a single text file. It's specifically designed to optimize token usage when feeding code into AI assistants (like Claude, GPT, etc.) for codebase analysis and questions.
When working with AI assistants on large code projects, token limitations can prevent uploading the entire codebase. This tool solves that problem by:
- Including all text-based files with their content (code, markdown, config files, etc.)
- Only listing media file paths without their binary content
- Creating a single, well-formatted text file that provides the AI with a complete view of your codebase structure
- Creating comprehensive codebases for AI code review tools like Claude or ChatGPT
- Generating documentation snapshots of entire projects
- Code auditing and analysis across multiple files
- Creating backups of text-based project files
- Consolidating documentation spread across multiple files
- Creating searchable archives of text content
- Merging configuration files for analysis
- Collecting logs or data files for processing
- Preparing codebases for LLM-assisted refactoring or debugging
- Creating training datasets from code repositories
- Generating comprehensive project overviews for new team members
- Archiving project states at specific milestones
- Aggregating CSV, JSON, or other structured data files
- Collecting configuration files for batch processing
- Merging scattered text files into single documents
-
Create a local bin directory and move the script:
mkdir -p ~/bin mv combiner.sh ~/bin/combiner chmod +x ~/bin/combiner
-
Add ~/bin to your PATH (if not already there):
echo 'export PATH="$HOME/bin:$PATH"' >> ~/.bashrc source ~/.bashrc
-
Now you can use
combiner
from anywhere:combiner /path/to/your/project
sudo mv combiner.sh /usr/local/bin/combiner
sudo chmod +x /usr/local/bin/combiner
echo "alias combiner='/path/to/your/combiner.sh'" >> ~/.bashrc
source ~/.bashrc
combiner <directory>
Example:
combiner ~/projects/my-web-app
- Recursively traverses the specified directory
- Automatically excludes common directories that waste tokens:
- Node.js:
node_modules
,.npm
, debug logs - Python:
__pycache__
,venv
,build
,dist
,*.egg-info
- Laravel/PHP:
vendor
,storage/logs
,bootstrap/cache
- General:
.git
,.vscode
,.idea
, cache folders, build artifacts
- Node.js:
- For code and text files: Includes both file path and complete content
- For media files (gif, jpg, jpeg, png, psd, svg, eps): Includes only the file path to save tokens
- Creates a single output file (
combined_files.txt
) containing only your actual codebase
- Run this script on your project's root directory
- The script automatically filters out dependencies and build artifacts for optimal token usage
- Upload the resulting
combined_files.txt
to your AI assistant - Ask specific questions about your code structure, implementation details, or potential improvements
- Reference specific files or components in your questions
- Run:
combiner ~/projects/my-web-app
- Upload the generated
combined_files.txt
to your AI assistant - Ask: "Can you explain how the authentication flow works across the codebase?"
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request with a clear description
- Ensure your contributions are compatible with GPL v3
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.
What this means:
- ✅ You can use, modify, and distribute this software
- ✅ You can use it commercially
⚠️ Any derivative works must also be licensed under GPL v3⚠️ You must include the license and copyright notice⚠️ You must disclose the source code of any distributed modifications
- The script intelligently excludes dependency folders (
node_modules
,vendor
, etc.) and build artifacts to optimize token usage - For very large projects, you might want to run this on specific subdirectories
- The script preserves file paths, making it easy for the AI to understand project structure
- Binary and media files are just listed by path to prevent token waste on non-textual content
- Output includes a summary showing how many directories were skipped for transparency