Skip to content

XWIKI-23239: Improve Solr indexing speed through parallelization and batch processing #4235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 6, 2025

Conversation

michitux
Copy link
Contributor

@michitux michitux commented Jun 4, 2025

Jira URL

https://jira.xwiki.org/browse/XWIKI-23239

Changes

Description

  • Collect indexed documents and submit them in batches to Solr for more
    efficient processing and less overhead due to repeated calls to Solr.
  • Move all Solr client requests into a separate executor to allow the
    next batch to be prepared while the previous one is committed.

Clarifications

  • I'm not sure regarding the ExecutorService design, I felt that going with another custom queue would be more overhead, but it might have fit better into the existing design.
  • I fear test coverage for this code is relatively low. The change in the first commit feels quite safe, the second one more dangerous. We could also decide to split them into separate Jira issues and backport the first one to LTS branches.
  • With these changes, re-indexing the whole wiki with just the flavor takes about 10 seconds. The initial queue size is 10k items. In this setup, the bottleneck seems to be adding the documents to Solr and committing (roughly half/half). Providing the data to index takes about 7 seconds of CPU time. It could be interesting to further increase the batch size, but it's not clear to me if it is actually worth it. I've tried increasing the limit to 10k documents, but it didn't change much, maybe also because the limit on the characters is actually the bottleneck - and I wouldn't want to increase that much beyond 10M to avoid excessive memory usage.

Screenshots & Video

No UI changes.

Executed Tests

LANG=C.UTF-8 mvn clean install -Pdocker,legacy,integration-tests,snapshotModules,quality -pl :xwiki-platform-search-solr-api,:xwiki-platform-search-test-docker

Manual test of re-indexing an almost empty wiki (with flavor).

Expected merging strategy

  • Prefers squash: Yes
  • Backport on branches:
    • No clear, see above. Maybe just the first commit?

michitux added 2 commits June 4, 2025 14:35
…batch processing

* Collect indexed documents and submit them in batches to Solr for more
  efficient processing and less overhead due to repeated calls to Solr.
…batch processing

* Move all Solr client requests into a separate executor to allow the
  next batch to be prepared while the previous one is committed.
@michitux
Copy link
Contributor Author

michitux commented Jun 5, 2025

@tmortagne Any comments regarding backports/splitting the changes in two parts?

@michitux michitux self-assigned this Jun 5, 2025
@tmortagne
Copy link
Member

@tmortagne Any comments regarding backports/splitting the changes in two parts?

If it was me, I would probably keep those improvements only on master, but if you feel it's safe enough, I trust you.

@michitux
Copy link
Contributor Author

michitux commented Jun 6, 2025

If it was me, I would probably keep those improvements only on master, but if you feel it's safe enough, I trust you.

As I've said, to me the first commit seems pretty safe while the second one seems a bit more dangerous. But I guess we can take a similar approach as for previous performance improvements - apply them only on master and then cherry-pick them after they were used a bit.

@michitux michitux merged commit 6d35708 into xwiki:master Jun 6, 2025
@michitux michitux deleted the XWIKI-23239 branch June 6, 2025 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants