Skip to content

Add back concurrency semaphore #593

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

ponyisi
Copy link
Collaborator

@ponyisi ponyisi commented May 17, 2025

I had misunderstood the concurrency limit in the boto3 configuration as a global limit, but it is only a limit on the number of parallel downloads for each individual download request. Therefore we could still be spawning a huge number of simultaneous S3 connections.

This PR puts semaphores back to limit the number of distinct simultaneous connections. It also tries to reduce the total number of connections made:

  • it reuses metadata obtained in the bucket listing,
  • it uses a lower level part of the library that means that fewer requests are made.

The new magic numbers are

  • maximum 5 simultaneous bucket listings
  • maximum 10 distinct files being downloaded simultaneously (configurable)
  • maximum 5 download streams per file (internal boto configuration)

Copy link

codecov bot commented May 17, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.22%. Comparing base (9c1849d) to head (5cb8f4d).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #593      +/-   ##
==========================================
+ Coverage   96.20%   96.22%   +0.02%     
==========================================
  Files          29       29              
  Lines        1870     1880      +10     
==========================================
+ Hits         1799     1809      +10     
  Misses         71       71              
Flag Coverage Δ
unittests 96.22% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ponyisi ponyisi requested a review from Copilot May 17, 2025 04:42
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR reintroduces concurrency semaphores to limit the number of simultaneous S3 connections and reduces connection overhead by reusing metadata and leveraging lower‐level API calls.

  • Replaced boto3 resource calls with client and paginators to list bucket objects.
  • Updated download_file function signatures throughout the code to include an expected_size parameter.
  • Adjusted semaphore usage and TransferConfig settings to enforce connection limits.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
tests/test_servicex_dataset.py Updated AsyncMock lambda signature to include expected_size.
tests/test_minio_adapter.py Added a new test verifying download_file behavior with expected_size.
servicex/query_core.py Updated download_file call to pass expected_size and adjusted get_signed_url accordingly.
servicex/minio_adapter.py Introduced semaphores to limit concurrent S3 requests and switched to client-paginator for list_bucket.
Comments suppressed due to low confidence (1)

servicex/minio_adapter.py:48

  • Consider updating _transferconfig concurrently or clarifying in the documentation that _transferconfig remains fixed at max_concurrency=5 for optimized connection usage, to avoid confusion between semaphore and TransferConfig limits.
global _sem; _sem = asyncio.Semaphore(concurrency)

@ponyisi
Copy link
Collaborator Author

ponyisi commented May 17, 2025

@gordonwatts I think this may be behind some of the stuttering you were seeing with the progress bars: a large number of simultaneous downloads could delay the status update requests for a very long time.

Copy link
Collaborator

@gordonwatts gordonwatts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert of the low-level boto workings. But it looks to me like:

  • Up to 5 simultaneous bucket listings
  • Up to 10 simultaneous downloads
  • Up to 5 streams per download.

Is that right? Could you put the default config in the PR's text - just so that people going back have an idea.

I'm eager to see how this woks under various circumstances. I need to try this out again on a plane to see if it is robust against broken connections. :-)

Copy link
Contributor

@BenGalewsky BenGalewsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome back, semaphores.. we missed you!
I think these important performance settings need to be highlighted with comments a bit better

@ponyisi ponyisi requested a review from BenGalewsky May 19, 2025 17:07
Copy link
Collaborator

@gordonwatts gordonwatts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@BenGalewsky
Copy link
Contributor

How did all of this RNtuple code get in here.?

@ponyisi
Copy link
Collaborator Author

ponyisi commented May 20, 2025

@BenGalewsky branch merge I'm afraid.

@ponyisi ponyisi force-pushed the download-concurrency branch from d8e7c85 to deacb44 Compare May 22, 2025 17:27
@ponyisi ponyisi force-pushed the download-concurrency branch from 3d1fe44 to 5cb8f4d Compare May 22, 2025 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants