Skip to content

HTMLStandaloneBuilder prunes search index in full builds #11120

@Rubyfi

Description

@Rubyfi

Describe the bug

The HTMLStandaloneBuilder always runs the prune function of IndexBuilder, disregarding whether a full or an incremental build is performed.
For larger projects, this results in a significant performance impact:
203964205-18aea334-5a1d-434f-baaf-0afbad68a3f9

sphinx-build init            0:00:36.326489
sphinx-build read            0:00:53.654105
sphinx-build checks          0:00:01.543727
sphinx-build prepare write   0:00:00.020864
sphinx-build write           0:06:05.105934
sphinx-build copy, dump      0:10:08.679523 <-

As is evident by the function execution times, most of the dumping step is spent in prune.
This function is used to get rid of references to files that were deleted between to builds (you can find the code here).
It does so by intersecting the existing mappings of the search index with the documents that were detected in the read step.

Calling the prune function in a full build does not seem to alter the search index mappings at all (since they were built using the info from the same run's read step).
I tested this by comparing the sizes of the index mappings before and after the intersection was performed.
Additionally I compared the sizes of the search index files of two full builds, one where the index was pruned, and one where it wasn't.
While the order of the entries differed (because of parallel reading), the sizes were exactly the same:

❯ stat -c "%n %s" searchindex_purge.js
searchindex_purge.js 53530275
❯ stat -c "%n %s" searchindex_no_purge.js
searchindex_no_purge.js 53530275

In conclusion, it seems be safe to skip the pruning step in cold builds.

How to Reproduce

Unfortunately, I cannot share details on the project I'm working on.
However, while the performance issue might not be evident in other projects, this behavior can be replacated by simply running a full build.

If you use the following extension, the pruning will be logged to console:

from typing import Iterable

from sphinx.util import logging  # pylint: disable=no-name-in-module
from sphinx.search import IndexBuilder
logger = logging.getLogger("__name__")

def setup(app):
    logger.info("Patching pruning routine")
    IndexBuilder.prune = _prune
    return {"parallel_read_safe": True, "parallel_write_safe": True, "version": "1.0.0"}

def _prune(self, docnames: Iterable[str]) -> None:
    """Remove data for all docnames not in the list."""
    new_titles = {}
    new_alltitles = {}
    new_filenames = {}
    for docname in docnames:
        if docname in self._titles:
            new_titles[docname] = self._titles[docname]
            new_alltitles[docname] = self._all_titles[docname]
            new_filenames[docname] = self._filenames[docname]
    self._titles = new_titles
    self._filenames = new_filenames
    self._all_titles = new_alltitles
    logger.info("Pruning search index")
    for wordnames in self._mapping.values():
        wordnames.intersection_update(docnames)
    for wordnames in self._title_mapping.values():
        wordnames.intersection_update(docnames)

Environment Information

Platform:              darwin; (macOS-13.0.1-arm64-arm-64bit)
Python version:        3.10.5 (main, Jul 21 2022, 10:19:31) [Clang 13.0.0 (clang-1300.0.27.3)])
Python implementation: CPython
Sphinx version:        6.1.2
Docutils version:      0.19
Jinja2 version:        3.1.2
Pygments version:      2.13.0

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions