Skip to content

Conversation

davidangb
Copy link
Contributor

@davidangb davidangb commented Oct 9, 2024

Introduces a new API at

POST /api/workspaces/{workspaceNamespace}/{workspaceName}/entities/{entityType}/paired-tsv

This API will:

  1. List the files in the workspace's bucket, filtered to a given bucket prefix
  2. Attempt to pair those files together based on Illumina paired-end file naming conventions as well as other well-known naming conventions supplied by Product
  3. Generate and download a TSV containing the results of those file pairings

The driver use case for this API is the "Data Uploader" in Terra UI, though we may find that scripters/notebook users also want to use the API.

I have tested this running locally against ~100,000 files in a bucket, and the file-matching portion of the algorithm executes in < 2 seconds. The end result is a 30MB TSV so the API is slow overall, but the size is unavoidable at that scale.

@davidangb davidangb changed the title POC: file-matching for data uploader CORE-123: new file-pairing API for data uploader Nov 1, 2024
@davidangb davidangb marked this pull request as ready for review November 1, 2024 20:29
Copy link
Contributor

@kevinmarete kevinmarete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I just added a comment on implementing a paging solution to handle a large number of files.

@davidangb davidangb requested a review from dvoet November 4, 2024 19:34
val fileList: List[GcsObjectName] =
googleServicesDao.listBucket(workspaceBucket, Option(matchingOptions.prefix), recursive)

logger.info(s"found ${fileList.length} files")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to place a limit on this based on the number of files returned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in 9505b6e

@davidangb davidangb requested a review from dvoet November 5, 2024 14:55
@davidangb davidangb merged commit 611acf8 into develop Nov 5, 2024
13 checks passed
@davidangb davidangb deleted the da_AJ-2025_fileMatchingPOC branch November 5, 2024 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants