Skip to content

remove s3 bucket polling when waiting for transformation results #587

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 46 commits into from
Jun 25, 2025

Conversation

MattShirley
Copy link
Collaborator

Client side work for ssl-hep/ServiceX#1049

Copy link

codecov bot commented May 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.50%. Comparing base (eb0ff9c) to head (7ddfc60).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #587      +/-   ##
==========================================
+ Coverage   96.35%   96.50%   +0.14%     
==========================================
  Files          29       29              
  Lines        1892     1973      +81     
==========================================
+ Hits         1823     1904      +81     
  Misses         69       69              
Flag Coverage Δ
unittests 96.50% <100.00%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Client-side removal of Minio bucket polling in favor of ServiceX’s direct results API, with corresponding adapter support and test updates.

  • Added get_transformation_results method to the ServiceX adapter and wired it into download logic.
  • Updated core query logic (download_files) to pass a begin_at timestamp and call the new API instead of polling Minio.
  • Refactored tests to mock servicex.get_transformation_results and added unit tests covering its success and error responses.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/test_servicex_dataset.py Replaced mock_minio.list_bucket with servicex.get_transformation_results mocks and reset calls
tests/test_servicex_adapter.py Imported datetime and added async tests for get_transformation_results handling 200/403/404/500 statuses
tests/test_dataset.py Passed new begin_at argument to download_files and mocked the ServiceX results API
servicex/servicex_adapter.py Implemented get_transformation_results method with status‐code checks
servicex/query_core.py Updated transform_complete and download_files to accept and forward begin_at and call the new API
Comments suppressed due to low confidence (1)

tests/test_servicex_dataset.py:209

  • The string literal for request_id has mismatched quotes and includes an extra trailing double quote; this will likely cause the test to fail. Use consistent quoting, for example "123-456-789".
servicex.submit_transform.return_value = {"request_id": '123-456-789"'}

for file in files:
if file.filename not in files_seen:
file_path = file.get("file-path", "").replace("/", ":")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relies on being kept in line with the transformer sidecar, and I think it may fail in certain cases (like the parquet output format is selected). Should we add a new field in the transform_result table that stores the object_name as uploaded to S3 (as determined by https://github.com/ssl-hep/ServiceX/blob/1991e6b2ea00dcbd8cdb9b9ed32fd44049f0dea3/transformer_sidecar/src/transformer_sidecar/transformer.py#L349 etc.) ?

@@ -557,15 +566,22 @@ async def get_signed_url(
if self.minio:
# if self.minio exists, self.current_status will too
if self.current_status.files_completed > len(files_seen):
files = await self.minio.list_bucket()
new_begin_at = datetime.datetime.now(tz=datetime.timezone.utc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is relying on synchronization of clocks between the client and the server. Wouldn't it be better to set new_begin_at to be the latest result timestamp we see in the transform_results?

@@ -551,21 +557,27 @@ async def get_signed_url(
if progress:
progress.advance(task_id=download_progress, task_type="Download")

later_than: datetime.datetime | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not initialize later_than to datetime.min to avoid all of the extra if later_than is None or constructs?

@MattShirley
Copy link
Collaborator Author

Added the feature flag and updated tests. All work here is done and ready for final review.

Copy link
Contributor

@BenGalewsky BenGalewsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this has been tested with a serviceX deployment that doesn't yet have the capabilities property in the info?

@MattShirley
Copy link
Collaborator Author

I assume this has been tested with a serviceX deployment that doesn't yet have the capabilities property in the info?

Yes, I've tested this against ServiceX's develop branch locally (without the feature flag enabled). There's also parameterized testing added in this PR to test both when it is and isn't present.

Copy link
Collaborator

@ponyisi ponyisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, see the few comments.

return self._servicex_info

async def get_servicex_capabilities(self) -> List[str]:
return (await self.get_servicex_info()).capabilities
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test for this line?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@ponyisi ponyisi merged commit d9511c9 into master Jun 25, 2025
35 checks passed
@ponyisi ponyisi deleted the remove-s3-bucket-polling branch June 25, 2025 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants