-
Notifications
You must be signed in to change notification settings - Fork 13
Add back concurrency semaphore #593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #593 +/- ##
==========================================
+ Coverage 96.20% 96.22% +0.02%
==========================================
Files 29 29
Lines 1870 1880 +10
==========================================
+ Hits 1799 1809 +10
Misses 71 71
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR reintroduces concurrency semaphores to limit the number of simultaneous S3 connections and reduces connection overhead by reusing metadata and leveraging lower‐level API calls.
- Replaced boto3 resource calls with client and paginators to list bucket objects.
- Updated download_file function signatures throughout the code to include an expected_size parameter.
- Adjusted semaphore usage and TransferConfig settings to enforce connection limits.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
tests/test_servicex_dataset.py | Updated AsyncMock lambda signature to include expected_size. |
tests/test_minio_adapter.py | Added a new test verifying download_file behavior with expected_size. |
servicex/query_core.py | Updated download_file call to pass expected_size and adjusted get_signed_url accordingly. |
servicex/minio_adapter.py | Introduced semaphores to limit concurrent S3 requests and switched to client-paginator for list_bucket. |
Comments suppressed due to low confidence (1)
servicex/minio_adapter.py:48
- Consider updating _transferconfig concurrently or clarifying in the documentation that _transferconfig remains fixed at max_concurrency=5 for optimized connection usage, to avoid confusion between semaphore and TransferConfig limits.
global _sem; _sem = asyncio.Semaphore(concurrency)
@gordonwatts I think this may be behind some of the stuttering you were seeing with the progress bars: a large number of simultaneous downloads could delay the status update requests for a very long time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an expert of the low-level boto workings. But it looks to me like:
- Up to 5 simultaneous bucket listings
- Up to 10 simultaneous downloads
- Up to 5 streams per download.
Is that right? Could you put the default config in the PR's text - just so that people going back have an idea.
I'm eager to see how this woks under various circumstances. I need to try this out again on a plane to see if it is robust against broken connections. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Welcome back, semaphores.. we missed you!
I think these important performance settings need to be highlighted with comments a bit better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
How did all of this RNtuple code get in here.? |
@BenGalewsky branch merge I'm afraid. |
d8e7c85
to
deacb44
Compare
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
3d1fe44
to
5cb8f4d
Compare
I had misunderstood the concurrency limit in the boto3 configuration as a global limit, but it is only a limit on the number of parallel downloads for each individual download request. Therefore we could still be spawning a huge number of simultaneous S3 connections.
This PR puts semaphores back to limit the number of distinct simultaneous connections. It also tries to reduce the total number of connections made:
The new magic numbers are