Skip to content

Commit 1f0e417

Browse files
Viicospawamoy
andauthored
refactor: Actually generate llms.txt file as per the specification
Issue-1: #1 PR-4: #4 Co-authored-by: Timothée Mazzucotelli <dev@pawamoy.fr>
1 parent 583ac1e commit 1f0e417

File tree

5 files changed

+189
-73
lines changed

5 files changed

+189
-73
lines changed

README.md

+52-11
Original file line numberDiff line numberDiff line change
@@ -22,30 +22,71 @@ pip install mkdocs-llmstxt
2222
Enable the plugin in `mkdocs.yml`:
2323

2424
```yaml title="mkdocs.yml"
25+
site_name: My project
26+
site_description: Description of my project.
27+
site_url: https://myproject.com/ # Required for the llmstxt plugin to work.
28+
2529
plugins:
2630
- llmstxt:
27-
files:
28-
- output: llms.txt
29-
inputs:
31+
markdown_description: Long description of my project.
32+
sections:
33+
Usage documentation:
3034
- file1.md
31-
- folder/file2.md
35+
- file2.md
36+
```
37+
38+
The resulting `/llms.txt` file will be available at the root of your documentation. With the previous example, it will be accessible at https://myproject.com/llms.txt and will contain the following:
39+
40+
```markdown
41+
# My project
42+
43+
> Description of my project.
44+
45+
Long description of my project.
46+
47+
## Usage documentation
48+
49+
- [File1 title](https://myproject.com/file1.md)
50+
- [File2 title](https://myproject.com/file2.md)
3251
```
3352

34-
You can generate several files, each from its own set of input files.
53+
Each source file included in `sections` will have its own Markdown file available at the specified URL in the `/llms.txt`. See [Markdown generation](#markdown-generation) for more details.
3554

3655
File globbing is supported:
3756

3857
```yaml title="mkdocs.yml"
3958
plugins:
4059
- llmstxt:
41-
files:
42-
- output: llms.txt
43-
inputs:
44-
- file1.md
45-
- reference/*/*.md
60+
sections:
61+
Usage documentation:
62+
- index.md
63+
- usage/*.md
4664
```
4765

48-
The plugin will concatenate the rendered HTML of these input pages, clean it up a bit (with [BeautifulSoup](https://pypi.org/project/beautifulsoup4/)), convert it back to Markdown (with [Markdownify](https://pypi.org/project/markdownify)), and format it (with [Mdformat](https://pypi.org/project/mdformat)). By concatenating HTML instead of Markdown, we ensure that dynamically generated contents (API documentation, executed code blocks, snippets from other files, Jinja macros, etc.) are part of the generated text files. Credits to [Petyo Ivanov](https://github.com/petyosi) for the original idea ✨
66+
## Full output
67+
68+
Although not explicitly written out in the https://llmstxt.org/ guidelines, it is common to output a `llms-full.txt` file with every page content expanded. This file can be generated by setting the `full_output` configuration value:
69+
70+
```markdown
71+
plugins:
72+
- llmstxt:
73+
full_output: llms-full.txt
74+
sections:
75+
Usage documentation:
76+
- index.md
77+
- usage/*.md
78+
```
79+
80+
## Markdown generation
81+
82+
To generate a Markdown page from a source file, the plugin will:
83+
84+
- Cleanup the HTML output (with [BeautifulSoup](https://pypi.org/project/beautifulsoup4/))
85+
- Convert it back to Markdown (with [Markdownify](https://pypi.org/project/markdownify))
86+
87+
Doing so is necessary to ensure that dynamically generated contents (API documentation, executed code blocks, snippets from other files, Jinja macros, etc.) are part of the generated text files.
88+
89+
Credits to [Petyo Ivanov](https://github.com/petyosi) for the original idea ✨.
4990

5091
You can disable auto-cleaning of the HTML:
5192

mkdocs.yml

+6-4
Original file line numberDiff line numberDiff line change
@@ -133,11 +133,13 @@ plugins:
133133
signature_crossrefs: true
134134
summary: true
135135
- llmstxt:
136-
files:
137-
- output: llms-full.txt
138-
inputs:
136+
full_output: llms-full.txt
137+
markdown_description: This plugin automatically generates llms.txt files.
138+
sections:
139+
Usage documentation:
139140
- index.md
140-
- reference/**.md
141+
API reference:
142+
- reference/*.md
141143
- git-revision-date-localized:
142144
enabled: !ENV [DEPLOY, false]
143145
enable_creation_date: true

src/mkdocs_llmstxt/_internal/config.py

+3-8
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,11 @@
66
from mkdocs.config.base import Config as BaseConfig
77

88

9-
class _FileConfig(BaseConfig):
10-
"""Sub-config for each Markdown file."""
11-
12-
output = mkconf.Type(str)
13-
inputs = mkconf.ListOfItems(mkconf.Type(str))
14-
15-
169
class _PluginConfig(BaseConfig):
1710
"""Configuration options for the plugin."""
1811

1912
autoclean = mkconf.Type(bool, default=True)
2013
preprocess = mkconf.Optional(mkconf.File(exists=True))
21-
files = mkconf.ListOfItems(mkconf.SubConfig(_FileConfig))
14+
markdown_description = mkconf.Optional(mkconf.Type(str))
15+
full_output = mkconf.Optional(mkconf.Type(str))
16+
sections = mkconf.DictOfItems(mkconf.ListOfItems(mkconf.Type(str)))

src/mkdocs_llmstxt/_internal/plugin.py

+126-49
Original file line numberDiff line numberDiff line change
@@ -3,18 +3,18 @@
33
from __future__ import annotations
44

55
import fnmatch
6-
from collections import defaultdict
76
from itertools import chain
87
from pathlib import Path
9-
from typing import TYPE_CHECKING
8+
from typing import TYPE_CHECKING, NamedTuple, cast
9+
from urllib.parse import urljoin
1010

1111
import mdformat
1212
from bs4 import BeautifulSoup as Soup
1313
from bs4 import Tag
1414
from markdownify import ATX, MarkdownConverter
1515
from mkdocs.config.defaults import MkDocsConfig
16-
from mkdocs.exceptions import PluginError
1716
from mkdocs.plugins import BasePlugin
17+
from mkdocs.structure.pages import Page
1818

1919
from mkdocs_llmstxt._internal.config import _PluginConfig
2020
from mkdocs_llmstxt._internal.logger import _get_logger
@@ -31,6 +31,13 @@
3131
_logger = _get_logger(__name__)
3232

3333

34+
class _MDPageInfo(NamedTuple):
35+
title: str
36+
path_md: Path
37+
md_url: str
38+
content: str
39+
40+
3441
class MkdocsLLMsTxtPlugin(BasePlugin[_PluginConfig]):
3542
"""The MkDocs plugin to generate an `llms.txt` file.
3643
@@ -46,9 +53,8 @@ class MkdocsLLMsTxtPlugin(BasePlugin[_PluginConfig]):
4653
mkdocs_config: MkDocsConfig
4754
"""The global MkDocs configuration."""
4855

49-
def __init__(self) -> None:
50-
self.html_pages: dict[str, dict[str, str]] = defaultdict(dict)
51-
"""Dictionary to store the HTML contents of pages."""
56+
md_pages: dict[str, list[_MDPageInfo]]
57+
"""Dictionary mapping section names to a list of page infos."""
5258

5359
def _expand_inputs(self, inputs: list[str], page_uris: list[str]) -> list[str]:
5460
expanded: list[str] = []
@@ -72,7 +78,12 @@ def on_config(self, config: MkDocsConfig) -> MkDocsConfig | None:
7278
Returns:
7379
The same, untouched config.
7480
"""
81+
if config.site_url is None:
82+
raise ValueError("'site_url' must be set in the MkDocs configuration to be used with the 'llmstxt' plugin")
7583
self.mkdocs_config = config
84+
# A `defaultdict` could be used, but we need to retain the same order between `config.sections` and `md_pages`
85+
# (which wouldn't be guaranteed when filling `md_pages` in `on_page_content()`).
86+
self.md_pages = {section: [] for section in self.config.sections}
7687
return config
7788

7889
def on_files(self, files: Files, *, config: MkDocsConfig) -> Files | None: # noqa: ARG002
@@ -88,64 +99,130 @@ def on_files(self, files: Files, *, config: MkDocsConfig) -> Files | None: # no
8899
Returns:
89100
Modified collection or none.
90101
"""
91-
for file in self.config.files:
92-
file["inputs"] = self._expand_inputs(file["inputs"], page_uris=list(files.src_uris.keys()))
102+
page_uris = list(files.src_uris)
103+
104+
for section_name, file_list in list(self.config.sections.items()):
105+
self.config.sections[section_name] = self._expand_inputs(file_list, page_uris=page_uris)
106+
93107
return files
94108

95109
def on_page_content(self, html: str, *, page: Page, **kwargs: Any) -> str | None: # noqa: ARG002
96-
"""Record pages contents.
110+
"""Convert page content into a Markdown file and save the result to be processed in the `on_post_build` hook.
97111
98112
Hook for the [`on_page_content` event](https://www.mkdocs.org/user-guide/plugins/#on_page_content).
99-
In this hook we simply record the HTML of the pages into a dictionary whose keys are the pages' URIs.
100113
101114
Parameters:
102115
html: The rendered HTML.
103116
page: The page object.
104117
"""
105-
for file in self.config.files:
106-
if page.file.src_uri in file["inputs"]:
107-
_logger.debug(f"Adding page {page.file.src_uri} to page {file['output']}")
108-
self.html_pages[file["output"]][page.file.src_uri] = html
118+
for section_name, file_list in self.config.sections.items():
119+
if page.file.src_uri in file_list:
120+
path_md = Path(page.file.abs_dest_path).with_suffix(".md")
121+
page_md = _generate_page_markdown(
122+
html,
123+
should_autoclean=self.config.autoclean,
124+
preprocess=self.config.preprocess,
125+
path=str(path_md),
126+
)
127+
128+
md_url = Path(page.file.dest_uri).with_suffix(".md").as_posix()
129+
# Apply the same logic as in the `Page.url` property.
130+
if md_url in (".", "./"):
131+
md_url = ""
132+
133+
# Guaranteed to exist as we require `site_url` to be configured.
134+
base = cast("str", self.mkdocs_config.site_url)
135+
if not base.endswith("/"):
136+
base += "/"
137+
md_url = urljoin(base, md_url)
138+
139+
self.md_pages[section_name].append(
140+
_MDPageInfo(
141+
title=page.title if page.title is not None else page.file.src_uri,
142+
path_md=path_md,
143+
md_url=md_url,
144+
content=page_md,
145+
),
146+
)
147+
109148
return html
110149

111-
def on_post_build(self, config: MkDocsConfig, **kwargs: Any) -> None: # noqa: ARG002
112-
"""Combine all recorded pages contents and convert it to a Markdown file with BeautifulSoup and Markdownify.
150+
def on_post_build(self, *, config: MkDocsConfig, **kwargs: Any) -> None: # noqa: ARG002
151+
"""Create the final `llms.txt` file and the MD files for all selected pages.
113152
114153
Hook for the [`on_post_build` event](https://www.mkdocs.org/user-guide/plugins/#on_post_build).
115-
In this hook we concatenate all previously recorded HTML, and convert it to Markdown using Markdownify.
116154
117155
Parameters:
118156
config: MkDocs configuration.
119157
"""
120-
121-
def language_callback(tag: Tag) -> str:
122-
for css_class in chain(tag.get("class") or (), (tag.parent.get("class") or ()) if tag.parent else ()):
123-
if css_class.startswith("language-"):
124-
return css_class[9:]
125-
return ""
126-
127-
converter = MarkdownConverter(
128-
bullets="-",
129-
code_language_callback=language_callback,
130-
escape_underscores=False,
131-
heading_style=ATX,
132-
)
133-
134-
for file in self.config.files:
135-
try:
136-
html = "\n\n".join(self.html_pages[file["output"]][input_page] for input_page in file["inputs"])
137-
except KeyError as error:
138-
raise PluginError(str(error)) from error
139-
140-
soup = Soup(html, "html.parser")
141-
if self.config.autoclean:
142-
autoclean(soup)
143-
if self.config.preprocess:
144-
_preprocess(soup, self.config.preprocess, file["output"])
145-
146-
output_file = Path(config.site_dir).joinpath(file["output"])
147-
output_file.parent.mkdir(parents=True, exist_ok=True)
148-
markdown = mdformat.text(converter.convert_soup(soup), options={"wrap": "no"})
149-
output_file.write_text(markdown, encoding="utf8")
150-
151-
_logger.info(f"Generated file /{file['output']}")
158+
output_file = Path(config.site_dir).joinpath("llms.txt")
159+
output_file.parent.mkdir(parents=True, exist_ok=True)
160+
markdown = f"# {config.site_name}\n\n"
161+
162+
if config.site_description is not None:
163+
markdown += f"> {config.site_description}\n\n"
164+
165+
if self.config.markdown_description is not None:
166+
markdown += f"{self.config.markdown_description}\n\n"
167+
168+
full_markdown = markdown
169+
170+
for section_name, file_list in self.md_pages.items():
171+
markdown += f"## {section_name}\n\n"
172+
for page_title, path_md, md_url, content in file_list:
173+
path_md.write_text(content, encoding="utf8")
174+
_logger.debug(f"Generated MD file to {path_md}")
175+
markdown += f"- [{page_title}]({md_url})\n"
176+
markdown += "\n"
177+
178+
output_file.write_text(markdown, encoding="utf8")
179+
_logger.debug("Generated file /llms.txt")
180+
181+
if self.config.full_output is not None:
182+
full_output_file = Path(config.site_dir).joinpath(self.config.full_output)
183+
for section_name, file_list in self.md_pages.items():
184+
list_content = "\n".join(info.content for info in file_list)
185+
full_markdown += f"# {section_name}\n\n{list_content}"
186+
full_output_file.write_text(full_markdown, encoding="utf8")
187+
_logger.debug(f"Generated file /{self.config.full_output}.txt")
188+
189+
190+
def _language_callback(tag: Tag) -> str:
191+
for css_class in chain(tag.get("class") or (), (tag.parent.get("class") or ()) if tag.parent else ()):
192+
if css_class.startswith("language-"):
193+
return css_class[9:]
194+
return ""
195+
196+
197+
_converter = MarkdownConverter(
198+
bullets="-",
199+
code_language_callback=_language_callback,
200+
escape_underscores=False,
201+
heading_style=ATX,
202+
)
203+
204+
205+
def _generate_page_markdown(
206+
html: str,
207+
*,
208+
should_autoclean: bool,
209+
preprocess: str | None,
210+
path: str,
211+
) -> str:
212+
"""Convert HTML to Markdown.
213+
214+
Parameters:
215+
html: The HTML content.
216+
should_autoclean: Whether to autoclean the HTML.
217+
preprocess: An optional path of a Python module containing a `preprocess` function.
218+
path: The output path of the relevant Markdown file.
219+
220+
Returns:
221+
The Markdown content.
222+
"""
223+
soup = Soup(html, "html.parser")
224+
if should_autoclean:
225+
autoclean(soup)
226+
if preprocess:
227+
_preprocess(soup, preprocess, path)
228+
return mdformat.text(_converter.convert_soup(soup), options={"wrap": "no"})

src/mkdocs_llmstxt/_internal/preprocess.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
from __future__ import annotations
44

5+
import html
56
import sys
67
from importlib.util import module_from_spec, spec_from_file_location
78
from typing import TYPE_CHECKING
@@ -98,4 +99,4 @@ def autoclean(soup: Soup) -> None:
9899

99100
# Remove line numbers from code blocks.
100101
for element in soup.find_all("table", attrs={"class": "highlighttable"}):
101-
element.replace_with(Soup(f"<pre>{element.find('code').get_text()}</pre>", "html.parser")) # type: ignore[union-attr]
102+
element.replace_with(Soup(f"<pre>{html.escape(element.find('code').get_text())}</pre>", "html.parser")) # type: ignore[union-attr]

0 commit comments

Comments
 (0)