-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[25.0] Improve performance of job cache query #20319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[25.0] Improve performance of job cache query #20319
Conversation
It's fast but broken 😆 😅 |
Delete |
Once it actually works! I'm afraid this only works if all outputs are deleted. |
I think this should be working now. This is now running the any outputs deleted logic against the job ids returned by the inner query. Speedup is about the same. This is the SQL (from the SELECT
filtered_jobs_subquery.job_id,
filtered_jobs_subquery.input1_4,
filtered_jobs_subquery."queries_0|input2_5"
FROM
(
SELECT
job.id AS job_id,
job_to_input_dataset_1.dataset_id AS input1_4,
job_to_input_dataset_2.dataset_id AS "queries_0|input2_5"
FROM
job
JOIN history ON job.history_id = history.id
JOIN job_parameter AS job_parameter_1 ON job.id = job_parameter_1.job_id
JOIN job_parameter AS job_parameter_2 ON job.id = job_parameter_2.job_id
JOIN job_to_input_dataset AS job_to_input_dataset_1 ON job_to_input_dataset_1.job_id = job.id
JOIN history_dataset_association AS history_dataset_association_1 ON job_to_input_dataset_1.dataset_id = history_dataset_association_1.id
JOIN history_dataset_association AS history_dataset_association_2 ON history_dataset_association_2.dataset_id = history_dataset_association_1.dataset_id
JOIN job_to_input_dataset AS job_to_input_dataset_2 ON job_to_input_dataset_2.job_id = job.id
JOIN history_dataset_association AS history_dataset_association_3 ON job_to_input_dataset_2.dataset_id = history_dataset_association_3.id
JOIN history_dataset_association AS history_dataset_association_4 ON history_dataset_association_4.dataset_id = history_dataset_association_3.dataset_id
WHERE
job.tool_id = 'cat1'
AND (
job.user_id = 1
OR history.published = true
)
AND job.copied_from_job_id IS NULL
AND job.tool_version = '1.0.0'
AND job.state IN ('ok')
AND job.id = job_parameter_1.job_id
AND job_parameter_1.name = 'input1'
AND job_parameter_1.value LIKE '{"values": [{"id": %, "src": "hda"}]}'
AND job.id = job_parameter_2.job_id
AND job_parameter_2.name = 'queries'
AND job_parameter_2.value LIKE '[{"__index__": 0, "input2": {"values": [{"id": %, "src": "hda"}]}}]'
AND job_to_input_dataset_1.name IN ('input1', 'input1')
AND history_dataset_association_2.id = 4
AND (
(
job_to_input_dataset_1.dataset_version IN (
0, history_dataset_association_1.version
)
OR history_dataset_association_1.update_time < job.create_time
)
AND history_dataset_association_1.extension = history_dataset_association_2.extension
AND history_dataset_association_1.name = history_dataset_association_2.name
OR history_dataset_association_1.id IN (
SELECT
history_dataset_association.id
FROM
history_dataset_association,
history_dataset_association_history AS history_dataset_association_history_1
WHERE
history_dataset_association.id = history_dataset_association_history_1.history_dataset_association_id
AND history_dataset_association_history_1.name = history_dataset_association_2.name
AND history_dataset_association_history_1.extension = history_dataset_association_2.extension
AND job_to_input_dataset_1.dataset_version = history_dataset_association_history_1.version
AND history_dataset_association_history_1.metadata = history_dataset_association_2.metadata
)
)
AND (
history_dataset_association_1.deleted = false
OR history_dataset_association_2.deleted = false
)
AND job_to_input_dataset_2.name IN ('queries_0|input2', 'input2')
AND history_dataset_association_4.id = 5
AND (
(
job_to_input_dataset_2.dataset_version IN (
0, history_dataset_association_3.version
)
OR history_dataset_association_3.update_time < job.create_time
)
AND history_dataset_association_3.extension = history_dataset_association_4.extension
AND history_dataset_association_3.name = history_dataset_association_4.name
OR history_dataset_association_3.id IN (
SELECT
history_dataset_association.id
FROM
history_dataset_association,
history_dataset_association_history AS history_dataset_association_history_2
WHERE
history_dataset_association.id = history_dataset_association_history_2.history_dataset_association_id
AND history_dataset_association_history_2.name = history_dataset_association_4.name
AND history_dataset_association_history_2.extension = history_dataset_association_4.extension
AND job_to_input_dataset_2.dataset_version = history_dataset_association_history_2.version
AND history_dataset_association_history_2.metadata = history_dataset_association_4.metadata
)
)
AND (
history_dataset_association_3.deleted = false
OR history_dataset_association_4.deleted = false
)
GROUP BY
job.id,
job_to_input_dataset_1.dataset_id,
job_to_input_dataset_2.dataset_id
) AS filtered_jobs_subquery
WHERE
NOT (
EXISTS (
SELECT
*
FROM
job_to_output_dataset_collection,
history_dataset_collection_association
WHERE
job_to_output_dataset_collection.job_id = filtered_jobs_subquery.job_id
AND job_to_output_dataset_collection.dataset_collection_id = history_dataset_collection_association.id
AND history_dataset_collection_association.deleted = true
)
)
AND NOT (
EXISTS (
SELECT
*
FROM
job_to_output_dataset,
history_dataset_association
WHERE
job_to_output_dataset.job_id = filtered_jobs_subquery.job_id
AND job_to_output_dataset.dataset_id = history_dataset_association.id
AND history_dataset_association.deleted = true
)
)
ORDER BY
filtered_jobs_subquery.job_id DESC
|
Deployed this to main and it's working nicely. |
Managed to get this down to "Planning Time": 65.306,
"Execution Time": 27.935 which is ~ 3500X speedup with https://gist.github.com/mvdbeek/2d4e235bfd9531de7c87de0f0365ffe6. It's gonna be a challenge to translate that back to sqlalchemy though 😅 |
https://gist.githubusercontent.com/mvdbeek/becf93f9df6f3a764b878eae4f31fc3a/raw/2fead2a6fcc65d90c48c4c22d89fdd5b9323d741/optimized_input_subquery.sql is the |
081f63e
to
5508a40
Compare
Well, you optimize one thing and you degrade another. I've been going into the direction of doing the input equivalence first before comparing jobs, which is a much more stringent way to filter candidate jobs. But that revealed how inefficient the HDCA / DCE equivalence search is. I have another version that improves that in https://gist.github.com/mvdbeek/881f75162e0f457e8a66af112f06a6a9#file-ctes_to_prefilter_hdca-sql-L1-L56 ... again going to be tricky to turn this back into sqlalchemy |
e3b3751
to
6c4f905
Compare
I've deployed this in its current state on usegalaxy.org and the performance is pretty good:
that's the workflow handler logs for the the IWC atacseq workflow. multiqc has many collection inputs, those are a little slower, but I think this is still fine and a big improvement, and it's rare that you'd consume many independent collection inputs. |
by turning the correlated subquery that checks for deleted job outputs into an outerjoin. Brings the query time down from 374937.263 to 19373.252, so roughly a 20-fold improvement. Still a little slow but more managable. The remainder can likely be improved by an additional compound index SQL Before: ``` SELECT job.id, job_to_input_dataset_1.dataset_id, job_to_input_dataset_2.dataset_id AS dataset_id_1 FROM job JOIN history ON job.history_id = history.id JOIN job_parameter AS job_parameter_1 ON job.id = job_parameter_1.job_id JOIN job_parameter AS job_parameter_2 ON job.id = job_parameter_2.job_id JOIN job_parameter AS job_parameter_3 ON job.id = job_parameter_3.job_id JOIN job_parameter AS job_parameter_4 ON job.id = job_parameter_4.job_id JOIN job_to_input_dataset AS job_to_input_dataset_1 ON job_to_input_dataset_1.job_id = job.id JOIN history_dataset_association AS history_dataset_association_1 ON job_to_input_dataset_1.dataset_id = history_dataset_association_1.id JOIN history_dataset_association AS history_dataset_association_2 ON history_dataset_association_2.dataset_id = history_dataset_association_1.dataset_id JOIN job_to_input_dataset AS job_to_input_dataset_2 ON job_to_input_dataset_2.job_id = job.id JOIN history_dataset_association AS history_dataset_association_3 ON job_to_input_dataset_2.dataset_id = history_dataset_association_3.id JOIN history_dataset_association AS history_dataset_association_4 ON history_dataset_association_4.dataset_id = history_dataset_association_3.dataset_id WHERE job.tool_id = 'toolshed.g2.bx.psu.edu/repos/iuc/hisat2/hisat2/2.2.1+galaxy1' AND ( job.user_id = 392010 OR history.published = true ) AND job.copied_from_job_id IS NULL AND job.tool_version = '2.2.1+galaxy1' AND job.state IN ('ok') AND ( EXISTS ( SELECT history_dataset_collection_association.id FROM history_dataset_collection_association, job_to_output_dataset_collection WHERE job.id = job_to_output_dataset_collection.job_id AND history_dataset_collection_association.id = job_to_output_dataset_collection.dataset_collection_id AND history_dataset_collection_association.deleted = true ) ) = false AND ( EXISTS ( SELECT history_dataset_association.id FROM history_dataset_association, job_to_output_dataset WHERE job.id = job_to_output_dataset.job_id AND history_dataset_association.id = job_to_output_dataset.dataset_id AND history_dataset_association.deleted = true ) ) = false AND job.id = job_parameter_1.job_id AND job_parameter_1.name = 'reference_genome' AND job_parameter_1.value LIKE '{""__current_case__"": 1, ""history_item"": {""values"": [{""id"": %, ""src"": ""hda""}]}, ""source"": ""history""}' AND job.id = job_parameter_2.job_id AND job_parameter_2.name = 'library' AND job_parameter_2.value LIKE '{""__current_case__"": 0, ""input_1"": {""values"": [{""id"": %, ""src"": ""hda""}]}, ""rna_strandness"": """", ""type"": ""single""}' AND job.id = job_parameter_3.job_id AND job_parameter_3.name = 'sum' AND job_parameter_3.value = '{""new_summary"": false, ""summary_file"": false}' AND job.id = job_parameter_4.job_id AND job_parameter_4.name = 'adv' AND job_parameter_4.value = '{""alignment_options"": {""__current_case__"": 0, ""alignment_options_selector"": ""defaults""}, ""input_options"": {""__current_case__"": 0, ""input_options_selector"": ""defaults""}, ""other_options"": {""__current_case__"": 0, ""other_options_selector"": ""defaults""}, ""output_options"": {""__current_case__"": 0, ""output_options_selector"": ""defaults""}, ""reporting_options"": {""__current_case__"": 0, ""reporting_options_selector"": ""defaults""}, ""sam_options"": {""__current_case__"": 0, ""sam_options_selector"": ""defaults""}, ""scoring_options"": {""__current_case__"": 0, ""scoring_options_selector"": ""defaults""}, ""spliced_options"": {""__current_case__"": 0, ""spliced_options_selector"": ""defaults""}}' AND job_to_input_dataset_1.name IN ('reference_genome|history_item', 'history_item') AND history_dataset_association_2.id = 152775960 AND ( ( job_to_input_dataset_1.dataset_version IN (0, history_dataset_association_1.version) OR history_dataset_association_1.update_time < job.create_time ) AND history_dataset_association_1.extension = history_dataset_association_2.extension AND history_dataset_association_1.name = history_dataset_association_2.name OR history_dataset_association_1.id IN ( SELECT history_dataset_association.id FROM history_dataset_association, history_dataset_association_history AS history_dataset_association_history_1 WHERE history_dataset_association.id = history_dataset_association_history_1.history_dataset_association_id AND history_dataset_association_history_1.name = history_dataset_association_2.name AND history_dataset_association_history_1.extension = history_dataset_association_2.extension AND job_to_input_dataset_1.dataset_version = history_dataset_association_history_1.version AND history_dataset_association_history_1.metadata = history_dataset_association_2.metadata ) ) AND ( history_dataset_association_1.deleted = false OR history_dataset_association_2.deleted = false ) AND job_to_input_dataset_2.name IN ('input_1', 'library|input_1') AND history_dataset_association_4.id = 152726579 AND ( ( job_to_input_dataset_2.dataset_version IN (0, history_dataset_association_3.version) OR history_dataset_association_3.update_time < job.create_time ) AND history_dataset_association_3.extension = history_dataset_association_4.extension AND history_dataset_association_3.name = history_dataset_association_4.name OR history_dataset_association_3.id IN ( SELECT history_dataset_association.id FROM history_dataset_association, history_dataset_association_history AS history_dataset_association_history_2 WHERE history_dataset_association.id = history_dataset_association_history_2.history_dataset_association_id AND history_dataset_association_history_2.name = history_dataset_association_4.name AND history_dataset_association_history_2.extension = history_dataset_association_4.extension AND job_to_input_dataset_2.dataset_version = history_dataset_association_history_2.version AND history_dataset_association_history_2.metadata = history_dataset_association_4.metadata ) ) AND ( history_dataset_association_3.deleted = false OR history_dataset_association_4.deleted = false ) GROUP BY job.id, job_to_input_dataset_1.dataset_id, job_to_input_dataset_2.dataset_id ORDER BY job.id DESC; ``` after (changed the aliases manually): ``` EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS, FORMAT JSON) SELECT job.id, job_to_input_dataset_1.dataset_id, job_to_input_dataset_2.dataset_id AS dataset_id_1 FROM job JOIN history ON job.history_id = history.id LEFT OUTER JOIN job_to_output_dataset_collection AS job_to_output_dataset_collection_1 ON job.id = job_to_output_dataset_collection_1.job_id LEFT OUTER JOIN history_dataset_collection_association AS history_dataset_collection_association_1_deleted ON history_dataset_collection_association_1_deleted.id = job_to_output_dataset_collection_1.dataset_collection_id AND history_dataset_collection_association_1_deleted.deleted = true LEFT OUTER JOIN job_to_output_dataset AS job_to_output_dataset_1 ON job.id = job_to_output_dataset_1.job_id LEFT OUTER JOIN history_dataset_association AS history_dataset_association_1_deleted ON history_dataset_association_1_deleted.id = job_to_output_dataset_1.dataset_id AND history_dataset_association_1_deleted.deleted = true JOIN job_parameter AS job_parameter_1 ON job.id = job_parameter_1.job_id JOIN job_parameter AS job_parameter_2 ON job.id = job_parameter_2.job_id JOIN job_parameter AS job_parameter_3 ON job.id = job_parameter_3.job_id JOIN job_parameter AS job_parameter_4 ON job.id = job_parameter_4.job_id JOIN job_to_input_dataset AS job_to_input_dataset_1 ON job_to_input_dataset_1.job_id = job.id JOIN history_dataset_association AS history_dataset_association_1 ON job_to_input_dataset_1.dataset_id = history_dataset_association_1.id JOIN history_dataset_association AS history_dataset_association_2 ON history_dataset_association_2.dataset_id = history_dataset_association_1.dataset_id JOIN job_to_input_dataset AS job_to_input_dataset_2 ON job_to_input_dataset_2.job_id = job.id JOIN history_dataset_association AS history_dataset_association_3 ON job_to_input_dataset_2.dataset_id = history_dataset_association_3.id JOIN history_dataset_association AS history_dataset_association_4 ON history_dataset_association_4.dataset_id = history_dataset_association_3.dataset_id WHERE job.tool_id = 'toolshed.g2.bx.psu.edu/repos/iuc/hisat2/hisat2/2.2.1+galaxy1' AND ( job.user_id = 392010 OR history.published = true ) AND job.copied_from_job_id IS NULL AND job.tool_version = '2.2.1+galaxy1' AND job.state IN ('ok') AND job_to_output_dataset_collection_1.job_id IS NULL AND job_to_output_dataset_1.job_id IS NULL AND job.id = job_parameter_1.job_id AND job_parameter_1.name = 'reference_genome' AND job_parameter_1.value LIKE '{""__current_case__"": 1, ""history_item"": {""values"": [{""id"": %, ""src"": ""hda""}]}, ""source"": ""history""}' AND job.id = job_parameter_2.job_id AND job_parameter_2.name = 'library' AND job_parameter_2.value LIKE '{""__current_case__"": 0, ""input_1"": {""values"": [{""id"": %, ""src"": ""hda""}]}, ""rna_strandness"": """", ""type"": ""single""}' AND job.id = job_parameter_3.job_id AND job_parameter_3.name = 'sum' AND job_parameter_3.value = '{""new_summary"": false, ""summary_file"": false}' AND job.id = job_parameter_4.job_id AND job_parameter_4.name = 'adv' AND job_parameter_4.value = '{""alignment_options"": {""__current_case__"": 0, ""alignment_options_selector"": ""defaults""}, ""input_options"": {""__current_case__"": 0, ""input_options_selector"": ""defaults""}, ""other_options"": {""__current_case__"": 0, ""other_options_selector"": ""defaults""}, ""output_options"": {""__current_case__"": 0, ""output_options_selector"": ""defaults""}, ""reporting_options"": {""__current_case__"": 0, ""reporting_options_selector"": ""defaults""}, ""sam_options"": {""__current_case__"": 0, ""sam_options_selector"": ""defaults""}, ""scoring_options"": {""__current_case__"": 0, ""scoring_options_selector"": ""defaults""}, ""spliced_options"": {""__current_case__"": 0, ""spliced_options_selector"": ""defaults""}}' AND job_to_input_dataset_1.name IN ('reference_genome|history_item', 'history_item') AND history_dataset_association_2.id = 152775960 AND ( ( job_to_input_dataset_1.dataset_version IN (0, history_dataset_association_1.version) OR history_dataset_association_1.update_time < job.create_time ) AND history_dataset_association_1.extension = history_dataset_association_2.extension AND history_dataset_association_1.name = history_dataset_association_2.name OR history_dataset_association_1.id IN ( SELECT history_dataset_association.id FROM history_dataset_association, history_dataset_association_history AS history_dataset_association_history_1 WHERE history_dataset_association.id = history_dataset_association_history_1.history_dataset_association_id AND history_dataset_association_history_1.name = history_dataset_association_2.name AND history_dataset_association_history_1.extension = history_dataset_association_2.extension AND job_to_input_dataset_1.dataset_version = history_dataset_association_history_1.version AND history_dataset_association_history_1.metadata = history_dataset_association_2.metadata ) ) AND ( history_dataset_association_1.deleted = false OR history_dataset_association_2.deleted = false ) AND job_to_input_dataset_2.name IN ('input_1', 'library|input_1') AND history_dataset_association_4.id = 152726579 AND ( ( job_to_input_dataset_2.dataset_version IN (0, history_dataset_association_3.version) OR history_dataset_association_3.update_time < job.create_time ) AND history_dataset_association_3.extension = history_dataset_association_4.extension AND history_dataset_association_3.name = history_dataset_association_4.name OR history_dataset_association_3.id IN ( SELECT history_dataset_association.id FROM history_dataset_association, history_dataset_association_history AS history_dataset_association_history_2 WHERE history_dataset_association.id = history_dataset_association_history_2.history_dataset_association_id AND history_dataset_association_history_2.name = history_dataset_association_4.name AND history_dataset_association_history_2.extension = history_dataset_association_4.extension AND job_to_input_dataset_2.dataset_version = history_dataset_association_history_2.version AND history_dataset_association_history_2.metadata = history_dataset_association_4.metadata ) ) AND ( history_dataset_association_3.deleted = false OR history_dataset_association_4.deleted = false ) GROUP BY job.id, job_to_input_dataset_1.dataset_id, job_to_input_dataset_2.dataset_id ORDER BY job.id DESC; ```
This should be roughly equivalent to the CTE in https://gist.github.com/mvdbeek/2d4e235bfd9531de7c87de0f0365ffe6
Previously, the dataset_id comparison was applied late in queries involving nested collections, leading to large intermediate result sets. This change introduces CTEs to pre-filter the target dataset_ids on the right-hand side of the comparison. By using IN (SELECT dataset_id FROM cte_name), the database can prune non-matching rows earlier in the query execution plan, significantly reducing execution time and resource consumption.
by using computed signatures of collections.
down to only those that share dataset_ids with original input DCE collection.
This is way more efficient.
Otherwise the query planner decides on a merge join on job_ids_cte.job_id = job.id.
6c4f905
to
32914f2
Compare
With the last added test I'm confident this works well. It has been on usegalaxy.org for the last 10 days or so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is remarkable! Thank you!
193affd
into
galaxyproject:release_25.0
by turning the correlated subquery that checks for deleted job outputs into an outerjoin.
Brings the query time down from 374937.263 to 19373.252, so roughly a 20-fold improvement. Still a little slow but more managable. The remainder can likely be improved by an additional compound index
SQL Before:
after (changed the aliases manually):
How to test the changes?
(Select all options that apply)
License