Local Pipeline `api` and `worker` #876

bpblanken · 2024-08-19T15:43:16Z

No description provided.

…nstitute/seqr-loading-pipelines into benb/local_pipeline

…institute/seqr-loading-pipelines into benb/some_inheritance_refactoring

…nstitute/seqr-loading-pipelines into benb/local_pipeline

…s into benb/local_pipeline

bpblanken · 2024-08-21T17:24:48Z

v03_pipeline/api/app.py

+        raise web.HTTPInternalServerError(reason=error_reason) from e
+
+
+async def loading_pipeline_enqueue(request: web.Request) -> web.Response:


The idea here is for this process to accept requests and manage a queue of length one (just a file on the filesystem). It did not seem feasible for this process itself to run the luigi tasks, since they are likely to take many hours.

bpblanken · 2024-08-21T17:26:09Z

requirements.in

@@ -5,3 +5,5 @@ luigi>=3.4.0
 gnomad==0.6.4
 google-cloud-storage>=2.14.0
 google-cloud-secret-manager>=2.20.0
+aiofiles==24.1.0
+pydantic==2.8.2


aiofiles for async aware file io and pydantic to help with the POST body validation/typing.

bpblanken · 2024-08-21T17:27:12Z

v03_pipeline/bin/pipeline_worker.py

+logger = get_logger(__name__)
+
+
+def main():


The idea is for this process to run as a sidecar in the same pod, basically looping indefinitely for a request to show up, then blocking on doing the work, then releasing the request once it succeeds or finishes.

bpblanken · 2024-08-21T17:30:37Z

v03_pipeline/bin/pipeline_worker.py

+                    for i in range(len(lpr.projects_to_run))
+                ],
+            ]
+            luigi.build(tasks)


There are two options for scheduling here:

Run with --local-scheduler, which means this process will handle scheduling. This is what we're doing now in dataproc and works fine.

We stand up a second process in the pod that's running the luigi central scheduler (literally just the luigid command) that will expose the luigi UI. The centralized scheduler would be doing more for us if we had multiple workers, but with only one it isn't strictly necessary.

My preference is to at least try to get the centralized scheduler working since it appears lightweight and the UI is passable. It was super easy to get working locally (I went down the rabbit hole for ~an hour or so last week when trying to run tasks inside of aiohttp and came up pretty quickly). The main unknowns there are how to expose the luigid endpoint in helm correctly, but I think this is at most a day of futzing.

Agreed that if we can get this done in under about a week it would be worth it, having a UI exposed to users so they can see how pipeline runs are progressing would be a very valuable feature

bpblanken · 2024-08-21T17:31:45Z

v03_pipeline/lib/model/cached_reference_dataset_query.py

@@ -13,7 +13,7 @@
 )


-class CachedReferenceDatasetQuery(Enum):
+class CachedReferenceDatasetQuery(str, Enum):


StrEnum is available in python3.11 only. It's necessary for the json serialization in the endpoint.

bpblanken · 2024-08-21T17:35:43Z

v03_pipeline/bin/pipeline_worker.py

+                k: v for k, v in lpr.model_dump().items() if k != 'projects_to_run'
+            }
+            tasks = [
+                UpdateCachedReferenceDatasetQueries(


These are the three tasks necessary to get everything to run correctly. They do not have any dependency structure between them, though they share upstream dependencies (like UpdatedReferenceDataset or WriteImprotedCallset).

The blockers to condensing this to a single task are mostly around how we have split tasks apart in airflow. We have a for project_guid in projects_to_run loop there so that each project task can be viewed in isolation.

hanars

Exciting stuff!

jklugherz · 2024-08-22T18:36:18Z

v03_pipeline/api/app.py

+    return web.json_response(
+        {'Successfully queued': lpr.model_dump()},
+        status=web_exceptions.HTTPAccepted.status_code,
+    )


 async def status(_: web.Request) -> web.Response:
    return web.json_response({'success': True})


 async def init_web_app():


Why use an async web server here as opposed to a normal synchronous one?

Good question! This is basically a consistency choice. This web server comes embedded in the hail docker image and is what the hail search service uses. It is however, overkill. Especially if it requires aiofiles to touch the filesystem 🤷

bpblanken added 25 commits August 16, 2024 15:23

First commit

3f133dc

ruff

680c2d3

tasks

7c7068e

Merge branch 'benb/some_inheritance_refactoring' of github.com:broadi…

3e20bc5

…nstitute/seqr-loading-pipelines into benb/local_pipeline

remap/pedigree paths

9df0698

add paths

f56a497

add model test

196d9eb

Finish bad requests tests

c73d7cb

ruff

ab7e6b7

a pattern for a queue

165479f

cleanup

c683a3d

Still hackin

a8f4fb7

In progress

60eae6f

almost there?

edd2560

lint

5e3b408

lint worker

c1b7bd0

update worker

5fcb8fd

Remove force from local pipeline

148874b

well

6a4869b

cleanup

315f203

finish pipeline worker

1e64515

Merge branch 'benb/remove_reference_data_env_var' of github.com:broad…

a3e81eb

…institute/seqr-loading-pipelines into benb/some_inheritance_refactoring

Merge branch 'benb/some_inheritance_refactoring' of github.com:broadi…

972f22f

…nstitute/seqr-loading-pipelines into benb/local_pipeline

ruff

1e2aa9e

ruff

6a72ae4

Base automatically changed from benb/some_inheritance_refactoring to benb/remove_reference_data_env_var August 19, 2024 17:51

ruff

30affec

bpblanken changed the base branch from benb/remove_reference_data_env_var to dev August 20, 2024 22:25

bpblanken added 2 commits August 20, 2024 18:26

liftover

0e6b1ba

update worker

b9ce4e2

bpblanken added 4 commits August 21, 2024 11:50

Merge branch 'dev' of github.com:broadinstitute/seqr-loading-pipeline…

4e9dafd

…s into benb/local_pipeline

ruff

e3aa1cc

add crdqs

fb99463

remove force

610a8dc

bpblanken commented Aug 21, 2024

View reviewed changes

bpblanken changed the title ~~Benb/local pipeline~~ Local Pipeline api and worker Aug 21, 2024

bpblanken marked this pull request as ready for review August 21, 2024 17:36

bpblanken requested a review from a team as a code owner August 21, 2024 17:36

hanars approved these changes Aug 22, 2024

View reviewed changes

jklugherz reviewed Aug 22, 2024

View reviewed changes

bpblanken merged commit 8602958 into dev Aug 28, 2024
3 checks passed

bpblanken deleted the benb/local_pipeline branch August 28, 2024 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Local Pipeline `api` and `worker` #876

Local Pipeline `api` and `worker` #876

Uh oh!

bpblanken commented Aug 19, 2024

Uh oh!

bpblanken Aug 21, 2024

Uh oh!

bpblanken Aug 21, 2024

Uh oh!

bpblanken Aug 21, 2024

Uh oh!

bpblanken Aug 21, 2024 •

edited

Loading

Uh oh!

hanars Aug 22, 2024

Uh oh!

bpblanken Aug 21, 2024

Uh oh!

bpblanken Aug 21, 2024

Uh oh!

hanars left a comment

Uh oh!

jklugherz Aug 22, 2024

Uh oh!

bpblanken Aug 22, 2024

Uh oh!

Uh oh!

Uh oh!

		raise web.HTTPInternalServerError(reason=error_reason) from e


		async def loading_pipeline_enqueue(request: web.Request) -> web.Response:

Local Pipeline api and worker #876

Local Pipeline api and worker #876

Uh oh!

Conversation

bpblanken commented Aug 19, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bpblanken Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanars left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Local Pipeline `api` and `worker` #876

Local Pipeline `api` and `worker` #876

bpblanken Aug 21, 2024 •

edited

Loading