XWIKI-23239: Improve Solr indexing speed through parallelization and batch processing #4235

michitux · 2025-06-04T14:13:33Z

Jira URL

https://jira.xwiki.org/browse/XWIKI-23239

Changes

Description

Collect indexed documents and submit them in batches to Solr for more
efficient processing and less overhead due to repeated calls to Solr.
Move all Solr client requests into a separate executor to allow the
next batch to be prepared while the previous one is committed.

Clarifications

I'm not sure regarding the ExecutorService design, I felt that going with another custom queue would be more overhead, but it might have fit better into the existing design.
I fear test coverage for this code is relatively low. The change in the first commit feels quite safe, the second one more dangerous. We could also decide to split them into separate Jira issues and backport the first one to LTS branches.
With these changes, re-indexing the whole wiki with just the flavor takes about 10 seconds. The initial queue size is 10k items. In this setup, the bottleneck seems to be adding the documents to Solr and committing (roughly half/half). Providing the data to index takes about 7 seconds of CPU time. It could be interesting to further increase the batch size, but it's not clear to me if it is actually worth it. I've tried increasing the limit to 10k documents, but it didn't change much, maybe also because the limit on the characters is actually the bottleneck - and I wouldn't want to increase that much beyond 10M to avoid excessive memory usage.

Screenshots & Video

No UI changes.

Executed Tests

LANG=C.UTF-8 mvn clean install -Pdocker,legacy,integration-tests,snapshotModules,quality -pl :xwiki-platform-search-solr-api,:xwiki-platform-search-test-docker

Manual test of re-indexing an almost empty wiki (with flavor).

Expected merging strategy

Prefers squash: Yes
Backport on branches:
- No clear, see above. Maybe just the first commit?

…batch processing * Collect indexed documents and submit them in batches to Solr for more efficient processing and less overhead due to repeated calls to Solr.

…batch processing * Move all Solr client requests into a separate executor to allow the next batch to be prepared while the previous one is committed.

michitux · 2025-06-05T07:44:47Z

@tmortagne Any comments regarding backports/splitting the changes in two parts?

tmortagne · 2025-06-05T10:00:58Z

@tmortagne Any comments regarding backports/splitting the changes in two parts?

If it was me, I would probably keep those improvements only on master, but if you feel it's safe enough, I trust you.

michitux · 2025-06-06T08:18:22Z

If it was me, I would probably keep those improvements only on master, but if you feel it's safe enough, I trust you.

As I've said, to me the first commit seems pretty safe while the second one seems a bit more dangerous. But I guess we can take a similar approach as for previous performance improvements - apply them only on master and then cherry-pick them after they were used a bit.

michitux added 2 commits June 4, 2025 14:35

XWIKI-23239: Improve Solr indexing speed through parallelization and …

488ff7d

…batch processing * Collect indexed documents and submit them in batches to Solr for more efficient processing and less overhead due to repeated calls to Solr.

XWIKI-23239: Improve Solr indexing speed through parallelization and …

b12297a

…batch processing * Move all Solr client requests into a separate executor to allow the next batch to be prepared while the previous one is committed.

tmortagne approved these changes Jun 5, 2025

View reviewed changes

michitux self-assigned this Jun 5, 2025

michitux merged commit 6d35708 into xwiki:master Jun 6, 2025

michitux deleted the XWIKI-23239 branch June 6, 2025 08:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

XWIKI-23239: Improve Solr indexing speed through parallelization and batch processing #4235

XWIKI-23239: Improve Solr indexing speed through parallelization and batch processing #4235

Uh oh!

michitux commented Jun 4, 2025

Uh oh!

michitux commented Jun 5, 2025

Uh oh!

tmortagne commented Jun 5, 2025

Uh oh!

michitux commented Jun 6, 2025

Uh oh!

Uh oh!

Uh oh!

XWIKI-23239: Improve Solr indexing speed through parallelization and batch processing #4235

XWIKI-23239: Improve Solr indexing speed through parallelization and batch processing #4235

Uh oh!

Conversation

michitux commented Jun 4, 2025

Jira URL

Changes

Description

Clarifications

Screenshots & Video

Executed Tests

Expected merging strategy

Uh oh!

michitux commented Jun 5, 2025

Uh oh!

tmortagne commented Jun 5, 2025

Uh oh!

michitux commented Jun 6, 2025

Uh oh!

Uh oh!