Skip to content

SOLR-17711 index fetcher doesn't need timeout #3356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kotman12
Copy link
Contributor

https://issues.apache.org/jira/browse/SOLR-17711

Description

There is now a default total request timeout applied to IndexFetcher because of Jetty Http2 client migration. We ran into a few problems with this already during replication of large number of shards on a single host. Sometimes we hit machine bandwidth limits because shards are downloaded in parallel. Thus requests for the next segment of each shard compete with each other and sometimes take more than 120 seconds. The 120 second number is important because it is the default idle timeout and ever since this change, the idle timeout happens to be the total request timeout in the http2 client. I don't think that is a good idea in the case of IndexFetcher, hence this change.

Solution

Revert to the pre-Http2SolrClient behavior of no timeout in IndexFetcher.

Tests

Not aware of an easy way to test this reasonably, i.e. to accelerate the clock.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@dsmiley dsmiley self-requested a review May 18, 2025 00:29
Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks.
Can you please add to CHANGES.txt in the improvement section?

@kotman12
Copy link
Contributor Author

@dsmiley Just saw your message to the mailing list, thanks. So it looks like what I propose here is incidentally a no-op as it currently stands? Should I just wait for your PR then? Looks like I'd have to make this change to the recoveryOnlyClient which may or may not be the right thing to do but the question might be moot if we apply your planned changes.

@kotman12
Copy link
Contributor Author

kotman12 commented May 27, 2025

Looks good, thanks. Can you please add to CHANGES.txt in the improvement section?

@dsmiley I updated the CHANGES.txt under 10 but I'm wondering if this should go out in 9.9? I'd argue this is something between a bug and an optimization since the request timeouts were added very likely by accident.

I've also moved the override to the recovery client in UpdateShardHandler. Maybe that is overkill but, again, I want to stress that until we moved to Jetty's HttpClient there was no request timeout on this at all.

@dsmiley
Copy link
Contributor

dsmiley commented May 27, 2025

Until the underlying timeout propagation is fixed, this PR is in limbo. I view this as a bug, not an optimization, since the user experience is just like a bug. For users with large shards, Solr worked and then later breaks after an upgrade. Let's try to get this fixed in 9.9 somehow. It's tempting to switch back to Apache HttpClient there, in the interests of stability / confidence. WDYT? Then in parallel improve Jetty Http2SolrClient without feeling rushed for 9.9.

CC @iamsanjay

@kotman12
Copy link
Contributor Author

kotman12 commented May 27, 2025

@dsmiley is the request timeout even at the HttpClient level? I didn't think it was so I don't think we have any propagation to worry about in this particular case (I think that applies to connect and idle timeouts). Also I moved the setting of the request timeout to the "root" HttpClient (recoveryOnlyClient) anyways.

Also I realized I meant improvement (as you originally suggested) not optimization ... but the point about this being a bug probably still stands.

@dsmiley
Copy link
Contributor

dsmiley commented May 29, 2025

I quickly looked at Jetty HttpClient and there's no request timeout there. It's on the Jetty Request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants