-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Describe the bug
The HTMLStandaloneBuilder
always runs the prune
function of IndexBuilder
, disregarding whether a full or an incremental build is performed.
For larger projects, this results in a significant performance impact:
sphinx-build init 0:00:36.326489
sphinx-build read 0:00:53.654105
sphinx-build checks 0:00:01.543727
sphinx-build prepare write 0:00:00.020864
sphinx-build write 0:06:05.105934
sphinx-build copy, dump 0:10:08.679523 <-
As is evident by the function execution times, most of the dumping step is spent in prune
.
This function is used to get rid of references to files that were deleted between to builds (you can find the code here).
It does so by intersecting the existing mappings of the search index with the documents that were detected in the read step.
Calling the prune
function in a full build does not seem to alter the search index mappings at all (since they were built using the info from the same run's read step).
I tested this by comparing the sizes of the index mappings before and after the intersection was performed.
Additionally I compared the sizes of the search index files of two full builds, one where the index was pruned, and one where it wasn't.
While the order of the entries differed (because of parallel reading), the sizes were exactly the same:
❯ stat -c "%n %s" searchindex_purge.js
searchindex_purge.js 53530275
❯ stat -c "%n %s" searchindex_no_purge.js
searchindex_no_purge.js 53530275
In conclusion, it seems be safe to skip the pruning step in cold builds.
How to Reproduce
Unfortunately, I cannot share details on the project I'm working on.
However, while the performance issue might not be evident in other projects, this behavior can be replacated by simply running a full build.
If you use the following extension, the pruning will be logged to console:
from typing import Iterable
from sphinx.util import logging # pylint: disable=no-name-in-module
from sphinx.search import IndexBuilder
logger = logging.getLogger("__name__")
def setup(app):
logger.info("Patching pruning routine")
IndexBuilder.prune = _prune
return {"parallel_read_safe": True, "parallel_write_safe": True, "version": "1.0.0"}
def _prune(self, docnames: Iterable[str]) -> None:
"""Remove data for all docnames not in the list."""
new_titles = {}
new_alltitles = {}
new_filenames = {}
for docname in docnames:
if docname in self._titles:
new_titles[docname] = self._titles[docname]
new_alltitles[docname] = self._all_titles[docname]
new_filenames[docname] = self._filenames[docname]
self._titles = new_titles
self._filenames = new_filenames
self._all_titles = new_alltitles
logger.info("Pruning search index")
for wordnames in self._mapping.values():
wordnames.intersection_update(docnames)
for wordnames in self._title_mapping.values():
wordnames.intersection_update(docnames)
Environment Information
Platform: darwin; (macOS-13.0.1-arm64-arm-64bit)
Python version: 3.10.5 (main, Jul 21 2022, 10:19:31) [Clang 13.0.0 (clang-1300.0.27.3)])
Python implementation: CPython
Sphinx version: 6.1.2
Docutils version: 0.19
Jinja2 version: 3.1.2
Pygments version: 2.13.0