Skip to content

Commit 94299c7

Browse files
authored
add llms.txt generation tutorial (#8421)
1 parent 7b12fd4 commit 94299c7

File tree

2 files changed

+246
-0
lines changed

2 files changed

+246
-0
lines changed
Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
# Generating llms.txt for Code Documentation with DSPy
2+
3+
This tutorial demonstrates how to use DSPy to automatically generate an `llms.txt` file for the DSPy repository itself. The `llms.txt` standard provides LLM-friendly documentation that helps AI systems better understand codebases.
4+
5+
## What is llms.txt?
6+
7+
`llms.txt` is a proposed standard for providing structured, LLM-friendly documentation about a project. It typically includes:
8+
9+
- Project overview and purpose
10+
- Key concepts and terminology
11+
- Architecture and structure
12+
- Usage examples
13+
- Important files and directories
14+
15+
## Building a DSPy Program for llms.txt Generation
16+
17+
Let's create a DSPy program that analyzes a repository and generates comprehensive `llms.txt` documentation.
18+
19+
### Step 1: Define Our Signatures
20+
21+
First, we'll define signatures for different aspects of documentation generation:
22+
23+
```python
24+
import dspy
25+
from typing import List
26+
27+
class AnalyzeRepository(dspy.Signature):
28+
"""Analyze a repository structure and identify key components."""
29+
repo_url: str = dspy.InputField(desc="GitHub repository URL")
30+
file_tree: str = dspy.InputField(desc="Repository file structure")
31+
readme_content: str = dspy.InputField(desc="README.md content")
32+
33+
project_purpose: str = dspy.OutputField(desc="Main purpose and goals of the project")
34+
key_concepts: List[str] = dspy.OutputField(desc="List of important concepts and terminology")
35+
architecture_overview: str = dspy.OutputField(desc="High-level architecture description")
36+
37+
class AnalyzeCodeStructure(dspy.Signature):
38+
"""Analyze code structure to identify important directories and files."""
39+
file_tree: str = dspy.InputField(desc="Repository file structure")
40+
package_files: str = dspy.InputField(desc="Key package and configuration files")
41+
42+
important_directories: List[str] = dspy.OutputField(desc="Key directories and their purposes")
43+
entry_points: List[str] = dspy.OutputField(desc="Main entry points and important files")
44+
development_info: str = dspy.OutputField(desc="Development setup and workflow information")
45+
46+
class GenerateLLMsTxt(dspy.Signature):
47+
"""Generate a comprehensive llms.txt file from analyzed repository information."""
48+
project_purpose: str = dspy.InputField()
49+
key_concepts: List[str] = dspy.InputField()
50+
architecture_overview: str = dspy.InputField()
51+
important_directories: List[str] = dspy.InputField()
52+
entry_points: List[str] = dspy.InputField()
53+
development_info: str = dspy.InputField()
54+
usage_examples: str = dspy.InputField(desc="Common usage patterns and examples")
55+
56+
llms_txt_content: str = dspy.OutputField(desc="Complete llms.txt file content following the standard format")
57+
```
58+
59+
### Step 2: Create the Repository Analyzer Module
60+
61+
```python
62+
class RepositoryAnalyzer(dspy.Module):
63+
def __init__(self):
64+
super().__init__()
65+
self.analyze_repo = dspy.ChainOfThought(AnalyzeRepository)
66+
self.analyze_structure = dspy.ChainOfThought(AnalyzeCodeStructure)
67+
self.generate_examples = dspy.ChainOfThought("repo_info -> usage_examples")
68+
self.generate_llms_txt = dspy.ChainOfThought(GenerateLLMsTxt)
69+
70+
def forward(self, repo_url, file_tree, readme_content, package_files):
71+
# Analyze repository purpose and concepts
72+
repo_analysis = self.analyze_repo(
73+
repo_url=repo_url,
74+
file_tree=file_tree,
75+
readme_content=readme_content
76+
)
77+
78+
# Analyze code structure
79+
structure_analysis = self.analyze_structure(
80+
file_tree=file_tree,
81+
package_files=package_files
82+
)
83+
84+
# Generate usage examples
85+
usage_examples = self.generate_examples(
86+
repo_info=f"Purpose: {repo_analysis.project_purpose}\nConcepts: {repo_analysis.key_concepts}"
87+
)
88+
89+
# Generate final llms.txt
90+
llms_txt = self.generate_llms_txt(
91+
project_purpose=repo_analysis.project_purpose,
92+
key_concepts=repo_analysis.key_concepts,
93+
architecture_overview=repo_analysis.architecture_overview,
94+
important_directories=structure_analysis.important_directories,
95+
entry_points=structure_analysis.entry_points,
96+
development_info=structure_analysis.development_info,
97+
usage_examples=usage_examples.usage_examples
98+
)
99+
100+
return dspy.Prediction(
101+
llms_txt_content=llms_txt.llms_txt_content,
102+
analysis=repo_analysis,
103+
structure=structure_analysis
104+
)
105+
```
106+
107+
### Step 3: Gather Repository Information
108+
109+
Let's create helper functions to extract repository information:
110+
111+
```python
112+
import requests
113+
import os
114+
from pathlib import Path
115+
116+
def get_github_file_tree(repo_url):
117+
"""Get repository file structure from GitHub API."""
118+
# Extract owner/repo from URL
119+
parts = repo_url.rstrip('/').split('/')
120+
owner, repo = parts[-2], parts[-1]
121+
122+
api_url = f"https://api.github.com/repos/{owner}/{repo}/git/trees/main?recursive=1"
123+
response = requests.get(api_url)
124+
125+
if response.status_code == 200:
126+
tree_data = response.json()
127+
file_paths = [item['path'] for item in tree_data['tree'] if item['type'] == 'blob']
128+
return '\n'.join(sorted(file_paths))
129+
else:
130+
raise Exception(f"Failed to fetch repository tree: {response.status_code}")
131+
132+
def get_github_file_content(repo_url, file_path):
133+
"""Get specific file content from GitHub."""
134+
parts = repo_url.rstrip('/').split('/')
135+
owner, repo = parts[-2], parts[-1]
136+
137+
api_url = f"https://api.github.com/repos/{owner}/{repo}/contents/{file_path}"
138+
response = requests.get(api_url)
139+
140+
if response.status_code == 200:
141+
import base64
142+
content = base64.b64decode(response.json()['content']).decode('utf-8')
143+
return content
144+
else:
145+
return f"Could not fetch {file_path}"
146+
147+
def gather_repository_info(repo_url):
148+
"""Gather all necessary repository information."""
149+
file_tree = get_github_file_tree(repo_url)
150+
readme_content = get_github_file_content(repo_url, "README.md")
151+
152+
# Get key package files
153+
package_files = []
154+
for file_path in ["pyproject.toml", "setup.py", "requirements.txt", "package.json"]:
155+
try:
156+
content = get_github_file_content(repo_url, file_path)
157+
if "Could not fetch" not in content:
158+
package_files.append(f"=== {file_path} ===\n{content}")
159+
except:
160+
continue
161+
162+
package_files_content = "\n\n".join(package_files)
163+
164+
return file_tree, readme_content, package_files_content
165+
```
166+
167+
### Step 4: Configure DSPy and Generate llms.txt
168+
169+
```python
170+
def generate_llms_txt_for_dspy():
171+
# Configure DSPy (use your preferred LM)
172+
lm = dspy.LM(model="gpt-4o-mini")
173+
dspy.configure(lm=lm)
174+
os.environ["OPENAI_API_KEY"] = "<YOUR OPENAI KEY>"
175+
176+
# Initialize our analyzer
177+
analyzer = RepositoryAnalyzer()
178+
179+
# Gather DSPy repository information
180+
repo_url = "https://github.com/stanfordnlp/dspy"
181+
file_tree, readme_content, package_files = gather_repository_info(repo_url)
182+
183+
# Generate llms.txt
184+
result = analyzer(
185+
repo_url=repo_url,
186+
file_tree=file_tree,
187+
readme_content=readme_content,
188+
package_files=package_files
189+
)
190+
191+
return result
192+
193+
# Run the generation
194+
if __name__ == "__main__":
195+
result = generate_llms_txt_for_dspy()
196+
197+
# Save the generated llms.txt
198+
with open("llms.txt", "w") as f:
199+
f.write(result.llms_txt_content)
200+
201+
print("Generated llms.txt file!")
202+
print("\nPreview:")
203+
print(result.llms_txt_content[:500] + "...")
204+
```
205+
206+
## Expected Output Structure
207+
208+
The generated `llms.txt` for DSPy would follow this structure:
209+
210+
```
211+
# DSPy: Programming Language Models
212+
213+
## Project Overview
214+
DSPy is a framework for programming—rather than prompting—language models...
215+
216+
## Key Concepts
217+
- **Modules**: Building blocks for LM programs
218+
- **Signatures**: Input/output specifications
219+
- **Teleprompters**: Optimization algorithms
220+
- **Predictors**: Core reasoning components
221+
222+
## Architecture
223+
- `/dspy/`: Main package directory
224+
- `/adapters/`: Input/output format handlers
225+
- `/clients/`: LM client interfaces
226+
- `/predict/`: Core prediction modules
227+
- `/teleprompt/`: Optimization algorithms
228+
229+
## Usage Examples
230+
1. **Building a Classifier**: Using DSPy, a user can define a modular classifier that takes in text data and categorizes it into predefined classes. The user can specify the classification logic declaratively, allowing for easy adjustments and optimizations.
231+
2. **Creating a RAG Pipeline**: A developer can implement a retrieval-augmented generation pipeline that first retrieves relevant documents based on a query and then generates a coherent response using those documents. DSPy facilitates the integration of retrieval and generation components seamlessly.
232+
3. **Optimizing Prompts**: Users can leverage DSPy to create a system that automatically optimizes prompts for language models based on performance metrics, improving the quality of responses over time without manual intervention.
233+
4. **Implementing Agent Loops**: A user can design an agent loop that continuously interacts with users, learns from feedback, and refines its responses, showcasing the self-improving capabilities of the DSPy framework.
234+
5. **Compositional Code**: Developers can write compositional code that allows different modules of the AI system to interact with each other, enabling complex workflows that can be easily modified and extended.
235+
```
236+
237+
The resulting `llms.txt` file provides a comprehensive, LLM-friendly overview of the DSPy repository that can help other AI systems better understand and work with the codebase.
238+
239+
## Next Steps
240+
241+
- Extend the program to analyze multiple repositories
242+
- Add support for different documentation formats
243+
- Create metrics for documentation quality assessment
244+
- Build a web interface for interactive repository analysis

docs/mkdocs.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,8 @@ nav:
5959
- Tracking DSPy Optimizers: tutorials/optimizer_tracking/index.md
6060
- Streaming: tutorials/streaming/index.md
6161
- Async: tutorials/async/index.md
62+
- Real-World Examples:
63+
- Generating llms.txt: tutorials/llms_txt_generation/index.md
6264
- DSPy in Production: production/index.md
6365
- Community:
6466
- Community Resources: community/community-resources.md

0 commit comments

Comments
 (0)