|
| 1 | +# Generating llms.txt for Code Documentation with DSPy |
| 2 | + |
| 3 | +This tutorial demonstrates how to use DSPy to automatically generate an `llms.txt` file for the DSPy repository itself. The `llms.txt` standard provides LLM-friendly documentation that helps AI systems better understand codebases. |
| 4 | + |
| 5 | +## What is llms.txt? |
| 6 | + |
| 7 | +`llms.txt` is a proposed standard for providing structured, LLM-friendly documentation about a project. It typically includes: |
| 8 | + |
| 9 | +- Project overview and purpose |
| 10 | +- Key concepts and terminology |
| 11 | +- Architecture and structure |
| 12 | +- Usage examples |
| 13 | +- Important files and directories |
| 14 | + |
| 15 | +## Building a DSPy Program for llms.txt Generation |
| 16 | + |
| 17 | +Let's create a DSPy program that analyzes a repository and generates comprehensive `llms.txt` documentation. |
| 18 | + |
| 19 | +### Step 1: Define Our Signatures |
| 20 | + |
| 21 | +First, we'll define signatures for different aspects of documentation generation: |
| 22 | + |
| 23 | +```python |
| 24 | +import dspy |
| 25 | +from typing import List |
| 26 | + |
| 27 | +class AnalyzeRepository(dspy.Signature): |
| 28 | + """Analyze a repository structure and identify key components.""" |
| 29 | + repo_url: str = dspy.InputField(desc="GitHub repository URL") |
| 30 | + file_tree: str = dspy.InputField(desc="Repository file structure") |
| 31 | + readme_content: str = dspy.InputField(desc="README.md content") |
| 32 | + |
| 33 | + project_purpose: str = dspy.OutputField(desc="Main purpose and goals of the project") |
| 34 | + key_concepts: List[str] = dspy.OutputField(desc="List of important concepts and terminology") |
| 35 | + architecture_overview: str = dspy.OutputField(desc="High-level architecture description") |
| 36 | + |
| 37 | +class AnalyzeCodeStructure(dspy.Signature): |
| 38 | + """Analyze code structure to identify important directories and files.""" |
| 39 | + file_tree: str = dspy.InputField(desc="Repository file structure") |
| 40 | + package_files: str = dspy.InputField(desc="Key package and configuration files") |
| 41 | + |
| 42 | + important_directories: List[str] = dspy.OutputField(desc="Key directories and their purposes") |
| 43 | + entry_points: List[str] = dspy.OutputField(desc="Main entry points and important files") |
| 44 | + development_info: str = dspy.OutputField(desc="Development setup and workflow information") |
| 45 | + |
| 46 | +class GenerateLLMsTxt(dspy.Signature): |
| 47 | + """Generate a comprehensive llms.txt file from analyzed repository information.""" |
| 48 | + project_purpose: str = dspy.InputField() |
| 49 | + key_concepts: List[str] = dspy.InputField() |
| 50 | + architecture_overview: str = dspy.InputField() |
| 51 | + important_directories: List[str] = dspy.InputField() |
| 52 | + entry_points: List[str] = dspy.InputField() |
| 53 | + development_info: str = dspy.InputField() |
| 54 | + usage_examples: str = dspy.InputField(desc="Common usage patterns and examples") |
| 55 | + |
| 56 | + llms_txt_content: str = dspy.OutputField(desc="Complete llms.txt file content following the standard format") |
| 57 | +``` |
| 58 | + |
| 59 | +### Step 2: Create the Repository Analyzer Module |
| 60 | + |
| 61 | +```python |
| 62 | +class RepositoryAnalyzer(dspy.Module): |
| 63 | + def __init__(self): |
| 64 | + super().__init__() |
| 65 | + self.analyze_repo = dspy.ChainOfThought(AnalyzeRepository) |
| 66 | + self.analyze_structure = dspy.ChainOfThought(AnalyzeCodeStructure) |
| 67 | + self.generate_examples = dspy.ChainOfThought("repo_info -> usage_examples") |
| 68 | + self.generate_llms_txt = dspy.ChainOfThought(GenerateLLMsTxt) |
| 69 | + |
| 70 | + def forward(self, repo_url, file_tree, readme_content, package_files): |
| 71 | + # Analyze repository purpose and concepts |
| 72 | + repo_analysis = self.analyze_repo( |
| 73 | + repo_url=repo_url, |
| 74 | + file_tree=file_tree, |
| 75 | + readme_content=readme_content |
| 76 | + ) |
| 77 | + |
| 78 | + # Analyze code structure |
| 79 | + structure_analysis = self.analyze_structure( |
| 80 | + file_tree=file_tree, |
| 81 | + package_files=package_files |
| 82 | + ) |
| 83 | + |
| 84 | + # Generate usage examples |
| 85 | + usage_examples = self.generate_examples( |
| 86 | + repo_info=f"Purpose: {repo_analysis.project_purpose}\nConcepts: {repo_analysis.key_concepts}" |
| 87 | + ) |
| 88 | + |
| 89 | + # Generate final llms.txt |
| 90 | + llms_txt = self.generate_llms_txt( |
| 91 | + project_purpose=repo_analysis.project_purpose, |
| 92 | + key_concepts=repo_analysis.key_concepts, |
| 93 | + architecture_overview=repo_analysis.architecture_overview, |
| 94 | + important_directories=structure_analysis.important_directories, |
| 95 | + entry_points=structure_analysis.entry_points, |
| 96 | + development_info=structure_analysis.development_info, |
| 97 | + usage_examples=usage_examples.usage_examples |
| 98 | + ) |
| 99 | + |
| 100 | + return dspy.Prediction( |
| 101 | + llms_txt_content=llms_txt.llms_txt_content, |
| 102 | + analysis=repo_analysis, |
| 103 | + structure=structure_analysis |
| 104 | + ) |
| 105 | +``` |
| 106 | + |
| 107 | +### Step 3: Gather Repository Information |
| 108 | + |
| 109 | +Let's create helper functions to extract repository information: |
| 110 | + |
| 111 | +```python |
| 112 | +import requests |
| 113 | +import os |
| 114 | +from pathlib import Path |
| 115 | + |
| 116 | +def get_github_file_tree(repo_url): |
| 117 | + """Get repository file structure from GitHub API.""" |
| 118 | + # Extract owner/repo from URL |
| 119 | + parts = repo_url.rstrip('/').split('/') |
| 120 | + owner, repo = parts[-2], parts[-1] |
| 121 | + |
| 122 | + api_url = f"https://api.github.com/repos/{owner}/{repo}/git/trees/main?recursive=1" |
| 123 | + response = requests.get(api_url) |
| 124 | + |
| 125 | + if response.status_code == 200: |
| 126 | + tree_data = response.json() |
| 127 | + file_paths = [item['path'] for item in tree_data['tree'] if item['type'] == 'blob'] |
| 128 | + return '\n'.join(sorted(file_paths)) |
| 129 | + else: |
| 130 | + raise Exception(f"Failed to fetch repository tree: {response.status_code}") |
| 131 | + |
| 132 | +def get_github_file_content(repo_url, file_path): |
| 133 | + """Get specific file content from GitHub.""" |
| 134 | + parts = repo_url.rstrip('/').split('/') |
| 135 | + owner, repo = parts[-2], parts[-1] |
| 136 | + |
| 137 | + api_url = f"https://api.github.com/repos/{owner}/{repo}/contents/{file_path}" |
| 138 | + response = requests.get(api_url) |
| 139 | + |
| 140 | + if response.status_code == 200: |
| 141 | + import base64 |
| 142 | + content = base64.b64decode(response.json()['content']).decode('utf-8') |
| 143 | + return content |
| 144 | + else: |
| 145 | + return f"Could not fetch {file_path}" |
| 146 | + |
| 147 | +def gather_repository_info(repo_url): |
| 148 | + """Gather all necessary repository information.""" |
| 149 | + file_tree = get_github_file_tree(repo_url) |
| 150 | + readme_content = get_github_file_content(repo_url, "README.md") |
| 151 | + |
| 152 | + # Get key package files |
| 153 | + package_files = [] |
| 154 | + for file_path in ["pyproject.toml", "setup.py", "requirements.txt", "package.json"]: |
| 155 | + try: |
| 156 | + content = get_github_file_content(repo_url, file_path) |
| 157 | + if "Could not fetch" not in content: |
| 158 | + package_files.append(f"=== {file_path} ===\n{content}") |
| 159 | + except: |
| 160 | + continue |
| 161 | + |
| 162 | + package_files_content = "\n\n".join(package_files) |
| 163 | + |
| 164 | + return file_tree, readme_content, package_files_content |
| 165 | +``` |
| 166 | + |
| 167 | +### Step 4: Configure DSPy and Generate llms.txt |
| 168 | + |
| 169 | +```python |
| 170 | +def generate_llms_txt_for_dspy(): |
| 171 | + # Configure DSPy (use your preferred LM) |
| 172 | + lm = dspy.LM(model="gpt-4o-mini") |
| 173 | + dspy.configure(lm=lm) |
| 174 | + os.environ["OPENAI_API_KEY"] = "<YOUR OPENAI KEY>" |
| 175 | + |
| 176 | + # Initialize our analyzer |
| 177 | + analyzer = RepositoryAnalyzer() |
| 178 | + |
| 179 | + # Gather DSPy repository information |
| 180 | + repo_url = "https://github.com/stanfordnlp/dspy" |
| 181 | + file_tree, readme_content, package_files = gather_repository_info(repo_url) |
| 182 | + |
| 183 | + # Generate llms.txt |
| 184 | + result = analyzer( |
| 185 | + repo_url=repo_url, |
| 186 | + file_tree=file_tree, |
| 187 | + readme_content=readme_content, |
| 188 | + package_files=package_files |
| 189 | + ) |
| 190 | + |
| 191 | + return result |
| 192 | + |
| 193 | +# Run the generation |
| 194 | +if __name__ == "__main__": |
| 195 | + result = generate_llms_txt_for_dspy() |
| 196 | + |
| 197 | + # Save the generated llms.txt |
| 198 | + with open("llms.txt", "w") as f: |
| 199 | + f.write(result.llms_txt_content) |
| 200 | + |
| 201 | + print("Generated llms.txt file!") |
| 202 | + print("\nPreview:") |
| 203 | + print(result.llms_txt_content[:500] + "...") |
| 204 | +``` |
| 205 | + |
| 206 | +## Expected Output Structure |
| 207 | + |
| 208 | +The generated `llms.txt` for DSPy would follow this structure: |
| 209 | + |
| 210 | +``` |
| 211 | +# DSPy: Programming Language Models |
| 212 | +
|
| 213 | +## Project Overview |
| 214 | +DSPy is a framework for programming—rather than prompting—language models... |
| 215 | +
|
| 216 | +## Key Concepts |
| 217 | +- **Modules**: Building blocks for LM programs |
| 218 | +- **Signatures**: Input/output specifications |
| 219 | +- **Teleprompters**: Optimization algorithms |
| 220 | +- **Predictors**: Core reasoning components |
| 221 | +
|
| 222 | +## Architecture |
| 223 | +- `/dspy/`: Main package directory |
| 224 | + - `/adapters/`: Input/output format handlers |
| 225 | + - `/clients/`: LM client interfaces |
| 226 | + - `/predict/`: Core prediction modules |
| 227 | + - `/teleprompt/`: Optimization algorithms |
| 228 | +
|
| 229 | +## Usage Examples |
| 230 | +1. **Building a Classifier**: Using DSPy, a user can define a modular classifier that takes in text data and categorizes it into predefined classes. The user can specify the classification logic declaratively, allowing for easy adjustments and optimizations. |
| 231 | +2. **Creating a RAG Pipeline**: A developer can implement a retrieval-augmented generation pipeline that first retrieves relevant documents based on a query and then generates a coherent response using those documents. DSPy facilitates the integration of retrieval and generation components seamlessly. |
| 232 | +3. **Optimizing Prompts**: Users can leverage DSPy to create a system that automatically optimizes prompts for language models based on performance metrics, improving the quality of responses over time without manual intervention. |
| 233 | +4. **Implementing Agent Loops**: A user can design an agent loop that continuously interacts with users, learns from feedback, and refines its responses, showcasing the self-improving capabilities of the DSPy framework. |
| 234 | +5. **Compositional Code**: Developers can write compositional code that allows different modules of the AI system to interact with each other, enabling complex workflows that can be easily modified and extended. |
| 235 | +``` |
| 236 | + |
| 237 | +The resulting `llms.txt` file provides a comprehensive, LLM-friendly overview of the DSPy repository that can help other AI systems better understand and work with the codebase. |
| 238 | + |
| 239 | +## Next Steps |
| 240 | + |
| 241 | +- Extend the program to analyze multiple repositories |
| 242 | +- Add support for different documentation formats |
| 243 | +- Create metrics for documentation quality assessment |
| 244 | +- Build a web interface for interactive repository analysis |
0 commit comments