[25.0] Improve performance of job cache query #20319

mvdbeek · 2025-05-21T10:02:27Z

by turning the correlated subquery that checks for deleted job outputs into an outerjoin.
Brings the query time down from 374937.263 to 19373.252, so roughly a 20-fold improvement. Still a little slow but more managable. The remainder can likely be improved by an additional compound index

SQL Before:

SELECT
	job.id,
	job_to_input_dataset_1.dataset_id,
	job_to_input_dataset_2.dataset_id AS dataset_id_1
FROM
	job
	JOIN history ON job.history_id = history.id
	JOIN job_parameter AS job_parameter_1 ON job.id = job_parameter_1.job_id
	JOIN job_parameter AS job_parameter_2 ON job.id = job_parameter_2.job_id
	JOIN job_parameter AS job_parameter_3 ON job.id = job_parameter_3.job_id
	JOIN job_parameter AS job_parameter_4 ON job.id = job_parameter_4.job_id
	JOIN job_to_input_dataset AS job_to_input_dataset_1 ON job_to_input_dataset_1.job_id = job.id
	JOIN history_dataset_association AS history_dataset_association_1 ON job_to_input_dataset_1.dataset_id = history_dataset_association_1.id
	JOIN history_dataset_association AS history_dataset_association_2 ON history_dataset_association_2.dataset_id = history_dataset_association_1.dataset_id
	JOIN job_to_input_dataset AS job_to_input_dataset_2 ON job_to_input_dataset_2.job_id = job.id
	JOIN history_dataset_association AS history_dataset_association_3 ON job_to_input_dataset_2.dataset_id = history_dataset_association_3.id
	JOIN history_dataset_association AS history_dataset_association_4 ON history_dataset_association_4.dataset_id = history_dataset_association_3.dataset_id
WHERE
	job.tool_id = 'toolshed.g2.bx.psu.edu/repos/iuc/hisat2/hisat2/2.2.1+galaxy1'
	AND (
		job.user_id = 392010
		OR history.published = true
	)
	AND job.copied_from_job_id IS NULL
	AND job.tool_version = '2.2.1+galaxy1'
	AND job.state IN ('ok')
	AND (
		EXISTS (
			SELECT
				history_dataset_collection_association.id
			FROM
				history_dataset_collection_association,
				job_to_output_dataset_collection
			WHERE
				job.id = job_to_output_dataset_collection.job_id
				AND history_dataset_collection_association.id = job_to_output_dataset_collection.dataset_collection_id
				AND history_dataset_collection_association.deleted = true
		)
	) = false
	AND (
		EXISTS (
			SELECT
				history_dataset_association.id
			FROM
				history_dataset_association,
				job_to_output_dataset
			WHERE
				job.id = job_to_output_dataset.job_id
				AND history_dataset_association.id = job_to_output_dataset.dataset_id
				AND history_dataset_association.deleted = true
		)
	) = false
	AND job.id = job_parameter_1.job_id
	AND job_parameter_1.name = 'reference_genome'
	AND job_parameter_1.value LIKE '{""__current_case__"": 1, ""history_item"": {""values"": [{""id"": %, ""src"": ""hda""}]}, ""source"": ""history""}'
	AND job.id = job_parameter_2.job_id
	AND job_parameter_2.name = 'library'
	AND job_parameter_2.value LIKE '{""__current_case__"": 0, ""input_1"": {""values"": [{""id"": %, ""src"": ""hda""}]}, ""rna_strandness"": """", ""type"": ""single""}'
	AND job.id = job_parameter_3.job_id
	AND job_parameter_3.name = 'sum'
	AND job_parameter_3.value = '{""new_summary"": false, ""summary_file"": false}'
	AND job.id = job_parameter_4.job_id
	AND job_parameter_4.name = 'adv'
	AND job_parameter_4.value = '{""alignment_options"": {""__current_case__"": 0, ""alignment_options_selector"": ""defaults""}, ""input_options"": {""__current_case__"": 0, ""input_options_selector"": ""defaults""}, ""other_options"": {""__current_case__"": 0, ""other_options_selector"": ""defaults""}, ""output_options"": {""__current_case__"": 0, ""output_options_selector"": ""defaults""}, ""reporting_options"": {""__current_case__"": 0, ""reporting_options_selector"": ""defaults""}, ""sam_options"": {""__current_case__"": 0, ""sam_options_selector"": ""defaults""}, ""scoring_options"": {""__current_case__"": 0, ""scoring_options_selector"": ""defaults""}, ""spliced_options"": {""__current_case__"": 0, ""spliced_options_selector"": ""defaults""}}'
	AND job_to_input_dataset_1.name IN ('reference_genome|history_item', 'history_item')
	AND history_dataset_association_2.id = 152775960
	AND (
		(
			job_to_input_dataset_1.dataset_version IN (0, history_dataset_association_1.version)
			OR history_dataset_association_1.update_time < job.create_time
		)
		AND history_dataset_association_1.extension = history_dataset_association_2.extension
		AND history_dataset_association_1.name = history_dataset_association_2.name
		OR history_dataset_association_1.id IN (
			SELECT
				history_dataset_association.id
			FROM
				history_dataset_association,
				history_dataset_association_history AS history_dataset_association_history_1
			WHERE
				history_dataset_association.id = history_dataset_association_history_1.history_dataset_association_id
				AND history_dataset_association_history_1.name = history_dataset_association_2.name
				AND history_dataset_association_history_1.extension = history_dataset_association_2.extension
				AND job_to_input_dataset_1.dataset_version = history_dataset_association_history_1.version
				AND history_dataset_association_history_1.metadata = history_dataset_association_2.metadata
		)
	)
	AND (
		history_dataset_association_1.deleted = false
		OR history_dataset_association_2.deleted = false
	)
	AND job_to_input_dataset_2.name IN ('input_1', 'library|input_1')
	AND history_dataset_association_4.id = 152726579
	AND (
		(
			job_to_input_dataset_2.dataset_version IN (0, history_dataset_association_3.version)
			OR history_dataset_association_3.update_time < job.create_time
		)
		AND history_dataset_association_3.extension = history_dataset_association_4.extension
		AND history_dataset_association_3.name = history_dataset_association_4.name
		OR history_dataset_association_3.id IN (
			SELECT
				history_dataset_association.id
			FROM
				history_dataset_association,
				history_dataset_association_history AS history_dataset_association_history_2
			WHERE
				history_dataset_association.id = history_dataset_association_history_2.history_dataset_association_id
				AND history_dataset_association_history_2.name = history_dataset_association_4.name
				AND history_dataset_association_history_2.extension = history_dataset_association_4.extension
				AND job_to_input_dataset_2.dataset_version = history_dataset_association_history_2.version
				AND history_dataset_association_history_2.metadata = history_dataset_association_4.metadata
		)
	)
	AND (
		history_dataset_association_3.deleted = false
		OR history_dataset_association_4.deleted = false
	)
GROUP BY
	job.id,
	job_to_input_dataset_1.dataset_id,
	job_to_input_dataset_2.dataset_id
ORDER BY
	job.id DESC;

after (changed the aliases manually):

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS, FORMAT JSON)
SELECT
    job.id,
    job_to_input_dataset_1.dataset_id,
    job_to_input_dataset_2.dataset_id AS dataset_id_1
FROM
    job
    JOIN history ON job.history_id = history.id
    LEFT OUTER JOIN job_to_output_dataset_collection AS job_to_output_dataset_collection_1 ON job.id = job_to_output_dataset_collection_1.job_id
    LEFT OUTER JOIN history_dataset_collection_association AS history_dataset_collection_association_1_deleted ON history_dataset_collection_association_1_deleted.id = job_to_output_dataset_collection_1.dataset_collection_id
    AND history_dataset_collection_association_1_deleted.deleted = true
    LEFT OUTER JOIN job_to_output_dataset AS job_to_output_dataset_1 ON job.id = job_to_output_dataset_1.job_id
    LEFT OUTER JOIN history_dataset_association AS history_dataset_association_1_deleted ON history_dataset_association_1_deleted.id = job_to_output_dataset_1.dataset_id
    AND history_dataset_association_1_deleted.deleted = true
    JOIN job_parameter AS job_parameter_1 ON job.id = job_parameter_1.job_id
    JOIN job_parameter AS job_parameter_2 ON job.id = job_parameter_2.job_id
    JOIN job_parameter AS job_parameter_3 ON job.id = job_parameter_3.job_id
    JOIN job_parameter AS job_parameter_4 ON job.id = job_parameter_4.job_id
    JOIN job_to_input_dataset AS job_to_input_dataset_1 ON job_to_input_dataset_1.job_id = job.id
    JOIN history_dataset_association AS history_dataset_association_1 ON job_to_input_dataset_1.dataset_id = history_dataset_association_1.id
    JOIN history_dataset_association AS history_dataset_association_2 ON history_dataset_association_2.dataset_id = history_dataset_association_1.dataset_id
    JOIN job_to_input_dataset AS job_to_input_dataset_2 ON job_to_input_dataset_2.job_id = job.id
    JOIN history_dataset_association AS history_dataset_association_3 ON job_to_input_dataset_2.dataset_id = history_dataset_association_3.id
    JOIN history_dataset_association AS history_dataset_association_4 ON history_dataset_association_4.dataset_id = history_dataset_association_3.dataset_id
WHERE
    job.tool_id = 'toolshed.g2.bx.psu.edu/repos/iuc/hisat2/hisat2/2.2.1+galaxy1'
    AND (
        job.user_id = 392010
        OR history.published = true
    )
    AND job.copied_from_job_id IS NULL
    AND job.tool_version = '2.2.1+galaxy1'
    AND job.state IN ('ok')
    AND job_to_output_dataset_collection_1.job_id IS NULL
    AND job_to_output_dataset_1.job_id IS NULL
    AND job.id = job_parameter_1.job_id
    AND job_parameter_1.name = 'reference_genome'
    AND job_parameter_1.value LIKE '{""__current_case__"": 1, ""history_item"": {""values"": [{""id"": %, ""src"": ""hda""}]}, ""source"": ""history""}'
    AND job.id = job_parameter_2.job_id
    AND job_parameter_2.name = 'library'
    AND job_parameter_2.value LIKE '{""__current_case__"": 0, ""input_1"": {""values"": [{""id"": %, ""src"": ""hda""}]}, ""rna_strandness"": """", ""type"": ""single""}'
    AND job.id = job_parameter_3.job_id
    AND job_parameter_3.name = 'sum'
    AND job_parameter_3.value = '{""new_summary"": false, ""summary_file"": false}'
    AND job.id = job_parameter_4.job_id
    AND job_parameter_4.name = 'adv'
    AND job_parameter_4.value = '{""alignment_options"": {""__current_case__"": 0, ""alignment_options_selector"": ""defaults""}, ""input_options"": {""__current_case__"": 0, ""input_options_selector"": ""defaults""}, ""other_options"": {""__current_case__"": 0, ""other_options_selector"": ""defaults""}, ""output_options"": {""__current_case__"": 0, ""output_options_selector"": ""defaults""}, ""reporting_options"": {""__current_case__"": 0, ""reporting_options_selector"": ""defaults""}, ""sam_options"": {""__current_case__"": 0, ""sam_options_selector"": ""defaults""}, ""scoring_options"": {""__current_case__"": 0, ""scoring_options_selector"": ""defaults""}, ""spliced_options"": {""__current_case__"": 0, ""spliced_options_selector"": ""defaults""}}'
    AND job_to_input_dataset_1.name IN ('reference_genome|history_item', 'history_item')
    AND history_dataset_association_2.id = 152775960
    AND (
        (
            job_to_input_dataset_1.dataset_version IN (0, history_dataset_association_1.version)
            OR history_dataset_association_1.update_time < job.create_time
        )
        AND history_dataset_association_1.extension = history_dataset_association_2.extension
        AND history_dataset_association_1.name = history_dataset_association_2.name
        OR history_dataset_association_1.id IN (
            SELECT
                history_dataset_association.id
            FROM
                history_dataset_association,
                history_dataset_association_history AS history_dataset_association_history_1
            WHERE
                history_dataset_association.id = history_dataset_association_history_1.history_dataset_association_id
                AND history_dataset_association_history_1.name = history_dataset_association_2.name
                AND history_dataset_association_history_1.extension = history_dataset_association_2.extension
                AND job_to_input_dataset_1.dataset_version = history_dataset_association_history_1.version
                AND history_dataset_association_history_1.metadata = history_dataset_association_2.metadata
        )
    )
    AND (
        history_dataset_association_1.deleted = false
        OR history_dataset_association_2.deleted = false
    )
    AND job_to_input_dataset_2.name IN ('input_1', 'library|input_1')
    AND history_dataset_association_4.id = 152726579
    AND (
        (
            job_to_input_dataset_2.dataset_version IN (0, history_dataset_association_3.version)
            OR history_dataset_association_3.update_time < job.create_time
        )
        AND history_dataset_association_3.extension = history_dataset_association_4.extension
        AND history_dataset_association_3.name = history_dataset_association_4.name
        OR history_dataset_association_3.id IN (
            SELECT
                history_dataset_association.id
            FROM
                history_dataset_association,
                history_dataset_association_history AS history_dataset_association_history_2
            WHERE
                history_dataset_association.id = history_dataset_association_history_2.history_dataset_association_id
                AND history_dataset_association_history_2.name = history_dataset_association_4.name
                AND history_dataset_association_history_2.extension = history_dataset_association_4.extension
                AND job_to_input_dataset_2.dataset_version = history_dataset_association_history_2.version
                AND history_dataset_association_history_2.metadata = history_dataset_association_4.metadata
        )
    )
    AND (
        history_dataset_association_3.deleted = false
        OR history_dataset_association_4.deleted = false
    )
GROUP BY
    job.id,
    job_to_input_dataset_1.dataset_id,
    job_to_input_dataset_2.dataset_id
ORDER BY
    job.id DESC;

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

mvdbeek · 2025-05-21T10:36:26Z

It's fast but broken 😆 😅

nsoranzo · 2025-05-21T15:00:08Z

Delete any_output_dataset_collection_instances_deleted and any_output_dataset_deleted from lib/galaxy/model/__init__.py ?

mvdbeek · 2025-05-21T15:04:22Z

Once it actually works! I'm afraid this only works if all outputs are deleted.

mvdbeek · 2025-05-21T17:50:18Z

I think this should be working now. This is now running the any outputs deleted logic against the job ids returned by the inner query. Speedup is about the same. This is the SQL (from the test_workflow_rerun_with_use_cached_job test case):

SELECT 
  filtered_jobs_subquery.job_id, 
  filtered_jobs_subquery.input1_4, 
  filtered_jobs_subquery."queries_0|input2_5" 
FROM 
  (
    SELECT 
      job.id AS job_id, 
      job_to_input_dataset_1.dataset_id AS input1_4, 
      job_to_input_dataset_2.dataset_id AS "queries_0|input2_5" 
    FROM 
      job 
      JOIN history ON job.history_id = history.id 
      JOIN job_parameter AS job_parameter_1 ON job.id = job_parameter_1.job_id 
      JOIN job_parameter AS job_parameter_2 ON job.id = job_parameter_2.job_id 
      JOIN job_to_input_dataset AS job_to_input_dataset_1 ON job_to_input_dataset_1.job_id = job.id 
      JOIN history_dataset_association AS history_dataset_association_1 ON job_to_input_dataset_1.dataset_id = history_dataset_association_1.id 
      JOIN history_dataset_association AS history_dataset_association_2 ON history_dataset_association_2.dataset_id = history_dataset_association_1.dataset_id 
      JOIN job_to_input_dataset AS job_to_input_dataset_2 ON job_to_input_dataset_2.job_id = job.id 
      JOIN history_dataset_association AS history_dataset_association_3 ON job_to_input_dataset_2.dataset_id = history_dataset_association_3.id 
      JOIN history_dataset_association AS history_dataset_association_4 ON history_dataset_association_4.dataset_id = history_dataset_association_3.dataset_id 
    WHERE 
      job.tool_id = 'cat1' 
      AND (
        job.user_id = 1 
        OR history.published = true
      ) 
      AND job.copied_from_job_id IS NULL 
      AND job.tool_version = '1.0.0' 
      AND job.state IN ('ok') 
      AND job.id = job_parameter_1.job_id 
      AND job_parameter_1.name = 'input1' 
      AND job_parameter_1.value LIKE '{"values": [{"id": %, "src": "hda"}]}' 
      AND job.id = job_parameter_2.job_id 
      AND job_parameter_2.name = 'queries' 
      AND job_parameter_2.value LIKE '[{"__index__": 0, "input2": {"values": [{"id": %, "src": "hda"}]}}]' 
      AND job_to_input_dataset_1.name IN ('input1', 'input1') 
      AND history_dataset_association_2.id = 4 
      AND (
        (
          job_to_input_dataset_1.dataset_version IN (
            0, history_dataset_association_1.version
          ) 
          OR history_dataset_association_1.update_time < job.create_time
        ) 
        AND history_dataset_association_1.extension = history_dataset_association_2.extension 
        AND history_dataset_association_1.name = history_dataset_association_2.name 
        OR history_dataset_association_1.id IN (
          SELECT 
            history_dataset_association.id 
          FROM 
            history_dataset_association, 
            history_dataset_association_history AS history_dataset_association_history_1 
          WHERE 
            history_dataset_association.id = history_dataset_association_history_1.history_dataset_association_id 
            AND history_dataset_association_history_1.name = history_dataset_association_2.name 
            AND history_dataset_association_history_1.extension = history_dataset_association_2.extension 
            AND job_to_input_dataset_1.dataset_version = history_dataset_association_history_1.version 
            AND history_dataset_association_history_1.metadata = history_dataset_association_2.metadata
        )
      ) 
      AND (
        history_dataset_association_1.deleted = false 
        OR history_dataset_association_2.deleted = false
      ) 
      AND job_to_input_dataset_2.name IN ('queries_0|input2', 'input2') 
      AND history_dataset_association_4.id = 5 
      AND (
        (
          job_to_input_dataset_2.dataset_version IN (
            0, history_dataset_association_3.version
          ) 
          OR history_dataset_association_3.update_time < job.create_time
        ) 
        AND history_dataset_association_3.extension = history_dataset_association_4.extension 
        AND history_dataset_association_3.name = history_dataset_association_4.name 
        OR history_dataset_association_3.id IN (
          SELECT 
            history_dataset_association.id 
          FROM 
            history_dataset_association, 
            history_dataset_association_history AS history_dataset_association_history_2 
          WHERE 
            history_dataset_association.id = history_dataset_association_history_2.history_dataset_association_id 
            AND history_dataset_association_history_2.name = history_dataset_association_4.name 
            AND history_dataset_association_history_2.extension = history_dataset_association_4.extension 
            AND job_to_input_dataset_2.dataset_version = history_dataset_association_history_2.version 
            AND history_dataset_association_history_2.metadata = history_dataset_association_4.metadata
        )
      ) 
      AND (
        history_dataset_association_3.deleted = false 
        OR history_dataset_association_4.deleted = false
      ) 
    GROUP BY 
      job.id, 
      job_to_input_dataset_1.dataset_id, 
      job_to_input_dataset_2.dataset_id
  ) AS filtered_jobs_subquery 
WHERE 
  NOT (
    EXISTS (
      SELECT 
        * 
      FROM 
        job_to_output_dataset_collection, 
        history_dataset_collection_association 
      WHERE 
        job_to_output_dataset_collection.job_id = filtered_jobs_subquery.job_id 
        AND job_to_output_dataset_collection.dataset_collection_id = history_dataset_collection_association.id 
        AND history_dataset_collection_association.deleted = true
    )
  ) 
  AND NOT (
    EXISTS (
      SELECT 
        * 
      FROM 
        job_to_output_dataset, 
        history_dataset_association 
      WHERE 
        job_to_output_dataset.job_id = filtered_jobs_subquery.job_id 
        AND job_to_output_dataset.dataset_id = history_dataset_association.id 
        AND history_dataset_association.deleted = true
    )
  ) 
ORDER BY 
  filtered_jobs_subquery.job_id DESC

mvdbeek · 2025-05-22T08:46:20Z

Deployed this to main and it's working nicely.

mvdbeek · 2025-05-23T08:32:42Z

Managed to get this down to

    "Planning Time": 65.306,
    "Execution Time": 27.935

which is ~ 3500X speedup with https://gist.github.com/mvdbeek/2d4e235bfd9531de7c87de0f0365ffe6. It's gonna be a challenge to translate that back to sqlalchemy though 😅

mvdbeek · 2025-05-23T10:15:32Z

https://gist.githubusercontent.com/mvdbeek/becf93f9df6f3a764b878eae4f31fc3a/raw/2fead2a6fcc65d90c48c4c22d89fdd5b9323d741/optimized_input_subquery.sql is the ~~final~~ query, 35ms planning + 35 ms execution = 5356 X speedup (for this particular query).

mvdbeek · 2025-05-23T16:43:26Z

Well, you optimize one thing and you degrade another. I've been going into the direction of doing the input equivalence first before comparing jobs, which is a much more stringent way to filter candidate jobs. But that revealed how inefficient the HDCA / DCE equivalence search is.

I have another version that improves that in https://gist.github.com/mvdbeek/881f75162e0f457e8a66af112f06a6a9#file-ctes_to_prefilter_hdca-sql-L1-L56 ... again going to be tricky to turn this back into sqlalchemy

mvdbeek · 2025-06-02T12:35:32Z

I've deployed this in its current state on usegalaxy.org and the performance is pretty good:

Jun 02 07:29:48 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:48,072 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (202.187 ms)
Jun 02 07:29:48 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:48,519 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (99.797 ms)
Jun 02 07:29:48 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:48,808 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (75.654 ms)
Jun 02 07:29:49 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:49,048 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (52.037 ms)
Jun 02 07:29:49 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:49,311 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (128.012 ms)
Jun 02 07:29:49 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:49,508 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (42.464 ms)
Jun 02 07:29:49 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:49,690 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (59.633 ms)
Jun 02 07:29:49 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:49,857 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (39.209 ms)
Jun 02 07:29:50 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:50,027 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (37.393 ms)
Jun 02 07:29:50 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:50,245 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (81.697 ms)
Jun 02 07:29:50 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:50,548 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (91.697 ms)
Jun 02 07:29:50 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:50,718 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (41.128 ms)
Jun 02 07:29:50 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:50,870 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (33.748 ms)
Jun 02 07:29:51 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:51,043 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (50.559 ms)
Jun 02 07:29:51 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:51,253 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (71.050 ms)
Jun 02 07:29:51 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:51,416 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (35.949 ms)
Jun 02 07:29:51 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:51,629 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (83.257 ms)
Jun 02 07:29:51 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:51,813 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (54.070 ms)
Jun 02 07:29:52 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:52,038 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] No equivalent jobs found (77.386 ms)
Jun 02 07:29:52 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:52,360 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (142.855 ms)
Jun 02 07:29:52 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:52,578 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (41.876 ms)
Jun 02 07:29:52 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:52,750 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (34.481 ms)
Jun 02 07:29:52 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:52,952 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (38.632 ms)
Jun 02 07:29:53 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:53,158 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (32.990 ms)
Jun 02 07:29:53 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:53,296 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (27.677 ms)
Jun 02 07:29:54 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:54,525 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] No equivalent jobs found (1059.227 ms)
Jun 02 07:29:55 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:55,034 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (349.786 ms)

that's the workflow handler logs for the the IWC atacseq workflow. multiqc has many collection inputs, those are a little slower, but I think this is still fine and a big improvement, and it's rare that you'd consume many independent collection inputs.

by turning the correlated subquery that checks for deleted job outputs into an outerjoin. Brings the query time down from 374937.263 to 19373.252, so roughly a 20-fold improvement. Still a little slow but more managable. The remainder can likely be improved by an additional compound index SQL Before: ``` SELECT job.id, job_to_input_dataset_1.dataset_id, job_to_input_dataset_2.dataset_id AS dataset_id_1 FROM job JOIN history ON job.history_id = history.id JOIN job_parameter AS job_parameter_1 ON job.id = job_parameter_1.job_id JOIN job_parameter AS job_parameter_2 ON job.id = job_parameter_2.job_id JOIN job_parameter AS job_parameter_3 ON job.id = job_parameter_3.job_id JOIN job_parameter AS job_parameter_4 ON job.id = job_parameter_4.job_id JOIN job_to_input_dataset AS job_to_input_dataset_1 ON job_to_input_dataset_1.job_id = job.id JOIN history_dataset_association AS history_dataset_association_1 ON job_to_input_dataset_1.dataset_id = history_dataset_association_1.id JOIN history_dataset_association AS history_dataset_association_2 ON history_dataset_association_2.dataset_id = history_dataset_association_1.dataset_id JOIN job_to_input_dataset AS job_to_input_dataset_2 ON job_to_input_dataset_2.job_id = job.id JOIN history_dataset_association AS history_dataset_association_3 ON job_to_input_dataset_2.dataset_id = history_dataset_association_3.id JOIN history_dataset_association AS history_dataset_association_4 ON history_dataset_association_4.dataset_id = history_dataset_association_3.dataset_id WHERE job.tool_id = 'toolshed.g2.bx.psu.edu/repos/iuc/hisat2/hisat2/2.2.1+galaxy1' AND ( job.user_id = 392010 OR history.published = true ) AND job.copied_from_job_id IS NULL AND job.tool_version = '2.2.1+galaxy1' AND job.state IN ('ok') AND ( EXISTS ( SELECT history_dataset_collection_association.id FROM history_dataset_collection_association, job_to_output_dataset_collection WHERE job.id = job_to_output_dataset_collection.job_id AND history_dataset_collection_association.id = job_to_output_dataset_collection.dataset_collection_id AND history_dataset_collection_association.deleted = true ) ) = false AND ( EXISTS ( SELECT history_dataset_association.id FROM history_dataset_association, job_to_output_dataset WHERE job.id = job_to_output_dataset.job_id AND history_dataset_association.id = job_to_output_dataset.dataset_id AND history_dataset_association.deleted = true ) ) = false AND job.id = job_parameter_1.job_id AND job_parameter_1.name = 'reference_genome' AND job_parameter_1.value LIKE '{""__current_case__"": 1, ""history_item"": {""values"": [{""id"": %, ""src"": ""hda""}]}, ""source"": ""history""}' AND job.id = job_parameter_2.job_id AND job_parameter_2.name = 'library' AND job_parameter_2.value LIKE '{""__current_case__"": 0, ""input_1"": {""values"": [{""id"": %, ""src"": ""hda""}]}, ""rna_strandness"": """", ""type"": ""single""}' AND job.id = job_parameter_3.job_id AND job_parameter_3.name = 'sum' AND job_parameter_3.value = '{""new_summary"": false, ""summary_file"": false}' AND job.id = job_parameter_4.job_id AND job_parameter_4.name = 'adv' AND job_parameter_4.value = '{""alignment_options"": {""__current_case__"": 0, ""alignment_options_selector"": ""defaults""}, ""input_options"": {""__current_case__"": 0, ""input_options_selector"": ""defaults""}, ""other_options"": {""__current_case__"": 0, ""other_options_selector"": ""defaults""}, ""output_options"": {""__current_case__"": 0, ""output_options_selector"": ""defaults""}, ""reporting_options"": {""__current_case__"": 0, ""reporting_options_selector"": ""defaults""}, ""sam_options"": {""__current_case__"": 0, ""sam_options_selector"": ""defaults""}, ""scoring_options"": {""__current_case__"": 0, ""scoring_options_selector"": ""defaults""}, ""spliced_options"": {""__current_case__"": 0, ""spliced_options_selector"": ""defaults""}}' AND job_to_input_dataset_1.name IN ('reference_genome|history_item', 'history_item') AND history_dataset_association_2.id = 152775960 AND ( ( job_to_input_dataset_1.dataset_version IN (0, history_dataset_association_1.version) OR history_dataset_association_1.update_time < job.create_time ) AND history_dataset_association_1.extension = history_dataset_association_2.extension AND history_dataset_association_1.name = history_dataset_association_2.name OR history_dataset_association_1.id IN ( SELECT history_dataset_association.id FROM history_dataset_association, history_dataset_association_history AS history_dataset_association_history_1 WHERE history_dataset_association.id = history_dataset_association_history_1.history_dataset_association_id AND history_dataset_association_history_1.name = history_dataset_association_2.name AND history_dataset_association_history_1.extension = history_dataset_association_2.extension AND job_to_input_dataset_1.dataset_version = history_dataset_association_history_1.version AND history_dataset_association_history_1.metadata = history_dataset_association_2.metadata ) ) AND ( history_dataset_association_1.deleted = false OR history_dataset_association_2.deleted = false ) AND job_to_input_dataset_2.name IN ('input_1', 'library|input_1') AND history_dataset_association_4.id = 152726579 AND ( ( job_to_input_dataset_2.dataset_version IN (0, history_dataset_association_3.version) OR history_dataset_association_3.update_time < job.create_time ) AND history_dataset_association_3.extension = history_dataset_association_4.extension AND history_dataset_association_3.name = history_dataset_association_4.name OR history_dataset_association_3.id IN ( SELECT history_dataset_association.id FROM history_dataset_association, history_dataset_association_history AS history_dataset_association_history_2 WHERE history_dataset_association.id = history_dataset_association_history_2.history_dataset_association_id AND history_dataset_association_history_2.name = history_dataset_association_4.name AND history_dataset_association_history_2.extension = history_dataset_association_4.extension AND job_to_input_dataset_2.dataset_version = history_dataset_association_history_2.version AND history_dataset_association_history_2.metadata = history_dataset_association_4.metadata ) ) AND ( history_dataset_association_3.deleted = false OR history_dataset_association_4.deleted = false ) GROUP BY job.id, job_to_input_dataset_1.dataset_id, job_to_input_dataset_2.dataset_id ORDER BY job.id DESC; ``` after (changed the aliases manually): ``` EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS, FORMAT JSON) SELECT job.id, job_to_input_dataset_1.dataset_id, job_to_input_dataset_2.dataset_id AS dataset_id_1 FROM job JOIN history ON job.history_id = history.id LEFT OUTER JOIN job_to_output_dataset_collection AS job_to_output_dataset_collection_1 ON job.id = job_to_output_dataset_collection_1.job_id LEFT OUTER JOIN history_dataset_collection_association AS history_dataset_collection_association_1_deleted ON history_dataset_collection_association_1_deleted.id = job_to_output_dataset_collection_1.dataset_collection_id AND history_dataset_collection_association_1_deleted.deleted = true LEFT OUTER JOIN job_to_output_dataset AS job_to_output_dataset_1 ON job.id = job_to_output_dataset_1.job_id LEFT OUTER JOIN history_dataset_association AS history_dataset_association_1_deleted ON history_dataset_association_1_deleted.id = job_to_output_dataset_1.dataset_id AND history_dataset_association_1_deleted.deleted = true JOIN job_parameter AS job_parameter_1 ON job.id = job_parameter_1.job_id JOIN job_parameter AS job_parameter_2 ON job.id = job_parameter_2.job_id JOIN job_parameter AS job_parameter_3 ON job.id = job_parameter_3.job_id JOIN job_parameter AS job_parameter_4 ON job.id = job_parameter_4.job_id JOIN job_to_input_dataset AS job_to_input_dataset_1 ON job_to_input_dataset_1.job_id = job.id JOIN history_dataset_association AS history_dataset_association_1 ON job_to_input_dataset_1.dataset_id = history_dataset_association_1.id JOIN history_dataset_association AS history_dataset_association_2 ON history_dataset_association_2.dataset_id = history_dataset_association_1.dataset_id JOIN job_to_input_dataset AS job_to_input_dataset_2 ON job_to_input_dataset_2.job_id = job.id JOIN history_dataset_association AS history_dataset_association_3 ON job_to_input_dataset_2.dataset_id = history_dataset_association_3.id JOIN history_dataset_association AS history_dataset_association_4 ON history_dataset_association_4.dataset_id = history_dataset_association_3.dataset_id WHERE job.tool_id = 'toolshed.g2.bx.psu.edu/repos/iuc/hisat2/hisat2/2.2.1+galaxy1' AND ( job.user_id = 392010 OR history.published = true ) AND job.copied_from_job_id IS NULL AND job.tool_version = '2.2.1+galaxy1' AND job.state IN ('ok') AND job_to_output_dataset_collection_1.job_id IS NULL AND job_to_output_dataset_1.job_id IS NULL AND job.id = job_parameter_1.job_id AND job_parameter_1.name = 'reference_genome' AND job_parameter_1.value LIKE '{""__current_case__"": 1, ""history_item"": {""values"": [{""id"": %, ""src"": ""hda""}]}, ""source"": ""history""}' AND job.id = job_parameter_2.job_id AND job_parameter_2.name = 'library' AND job_parameter_2.value LIKE '{""__current_case__"": 0, ""input_1"": {""values"": [{""id"": %, ""src"": ""hda""}]}, ""rna_strandness"": """", ""type"": ""single""}' AND job.id = job_parameter_3.job_id AND job_parameter_3.name = 'sum' AND job_parameter_3.value = '{""new_summary"": false, ""summary_file"": false}' AND job.id = job_parameter_4.job_id AND job_parameter_4.name = 'adv' AND job_parameter_4.value = '{""alignment_options"": {""__current_case__"": 0, ""alignment_options_selector"": ""defaults""}, ""input_options"": {""__current_case__"": 0, ""input_options_selector"": ""defaults""}, ""other_options"": {""__current_case__"": 0, ""other_options_selector"": ""defaults""}, ""output_options"": {""__current_case__"": 0, ""output_options_selector"": ""defaults""}, ""reporting_options"": {""__current_case__"": 0, ""reporting_options_selector"": ""defaults""}, ""sam_options"": {""__current_case__"": 0, ""sam_options_selector"": ""defaults""}, ""scoring_options"": {""__current_case__"": 0, ""scoring_options_selector"": ""defaults""}, ""spliced_options"": {""__current_case__"": 0, ""spliced_options_selector"": ""defaults""}}' AND job_to_input_dataset_1.name IN ('reference_genome|history_item', 'history_item') AND history_dataset_association_2.id = 152775960 AND ( ( job_to_input_dataset_1.dataset_version IN (0, history_dataset_association_1.version) OR history_dataset_association_1.update_time < job.create_time ) AND history_dataset_association_1.extension = history_dataset_association_2.extension AND history_dataset_association_1.name = history_dataset_association_2.name OR history_dataset_association_1.id IN ( SELECT history_dataset_association.id FROM history_dataset_association, history_dataset_association_history AS history_dataset_association_history_1 WHERE history_dataset_association.id = history_dataset_association_history_1.history_dataset_association_id AND history_dataset_association_history_1.name = history_dataset_association_2.name AND history_dataset_association_history_1.extension = history_dataset_association_2.extension AND job_to_input_dataset_1.dataset_version = history_dataset_association_history_1.version AND history_dataset_association_history_1.metadata = history_dataset_association_2.metadata ) ) AND ( history_dataset_association_1.deleted = false OR history_dataset_association_2.deleted = false ) AND job_to_input_dataset_2.name IN ('input_1', 'library|input_1') AND history_dataset_association_4.id = 152726579 AND ( ( job_to_input_dataset_2.dataset_version IN (0, history_dataset_association_3.version) OR history_dataset_association_3.update_time < job.create_time ) AND history_dataset_association_3.extension = history_dataset_association_4.extension AND history_dataset_association_3.name = history_dataset_association_4.name OR history_dataset_association_3.id IN ( SELECT history_dataset_association.id FROM history_dataset_association, history_dataset_association_history AS history_dataset_association_history_2 WHERE history_dataset_association.id = history_dataset_association_history_2.history_dataset_association_id AND history_dataset_association_history_2.name = history_dataset_association_4.name AND history_dataset_association_history_2.extension = history_dataset_association_4.extension AND job_to_input_dataset_2.dataset_version = history_dataset_association_history_2.version AND history_dataset_association_history_2.metadata = history_dataset_association_4.metadata ) ) AND ( history_dataset_association_3.deleted = false OR history_dataset_association_4.deleted = false ) GROUP BY job.id, job_to_input_dataset_1.dataset_id, job_to_input_dataset_2.dataset_id ORDER BY job.id DESC; ```

This should be roughly equivalent to the CTE in https://gist.github.com/mvdbeek/2d4e235bfd9531de7c87de0f0365ffe6

Previously, the dataset_id comparison was applied late in queries involving nested collections, leading to large intermediate result sets. This change introduces CTEs to pre-filter the target dataset_ids on the right-hand side of the comparison. By using IN (SELECT dataset_id FROM cte_name), the database can prune non-matching rows earlier in the query execution plan, significantly reducing execution time and resource consumption.

by using computed signatures of collections.

down to only those that share dataset_ids with original input DCE collection.

This is way more efficient.

Otherwise the query planner decides on a merge join on job_ids_cte.job_id = job.id.

mvdbeek · 2025-06-11T18:55:03Z

With the last added test I'm confident this works well. It has been on usegalaxy.org for the last 10 days or so.

ahmedhamidawan

This is remarkable! Thank you!

mvdbeek added area/performance area/job-caching labels May 21, 2025

github-actions bot added this to the 25.1 milestone May 21, 2025

mvdbeek marked this pull request as draft May 23, 2025 12:50

mvdbeek force-pushed the performance_fix_deleted_output_check branch from 081f63e to 5508a40 Compare May 23, 2025 15:06

mvdbeek force-pushed the performance_fix_deleted_output_check branch 7 times, most recently from e3b3751 to 6c4f905 Compare June 2, 2025 12:05

ahmedhamidawan modified the milestones: 25.1, 25.0 Jun 2, 2025

mvdbeek added 7 commits June 11, 2025 20:54

Fix where clause

015a01f

Restore not exists logic but runs against inner subquery

904a23e

Limit jobs based on input datasets in inner subquery

fad9c91

This should be roughly equivalent to the CTE in https://gist.github.com/mvdbeek/2d4e235bfd9531de7c87de0f0365ffe6

JobParmeter is now always fully prefixed

283c854

Use inner join for leaf HDAs inside HDCA / DCE search

a43b4f9

mvdbeek added 14 commits June 11, 2025 20:54

Improve performance of equivalent collection matching

d9e2dd5

by using computed signatures of collections.

Implement candidate signature logic for DCEs as well

6f2d620

Join to hdca / hda history to limit candidate search space

e756983

Add pre-filtering step that narrows DCEs

e62691a

down to only those that share dataset_ids with original input DCE collection.

Add pre-filtering step for search by hdca

d4982f4

Force job_id materialization

27cbd67

Replace history_dataset_association_history subquery with outer join

575e7f7

This is way more efficient.

Materialize unordered results, then order by job id

8e8d899

Otherwise the query planner decides on a merge join on job_ids_cte.job_id = job.id.

Ensure query columns are unique

9d3268e

Limit materialized hint to postgresql

2d98fbf

Make search work on sqlite and postgres

615d1c7

Drop unnecessary materialize hint

829f566

Restore compatibility with postgresql < 12

7fb58fd

Add test that confirms element sorting must match

32914f2

mvdbeek force-pushed the performance_fix_deleted_output_check branch from 6c4f905 to 32914f2 Compare June 11, 2025 18:54

mvdbeek marked this pull request as ready for review June 11, 2025 18:54

ahmedhamidawan approved these changes Jun 11, 2025

View reviewed changes

ahmedhamidawan merged commit 193affd into galaxyproject:release_25.0 Jun 11, 2025
52 of 56 checks passed

This comment was marked as resolved.

Sign in to view

ahmedhamidawan added the kind/enhancement label Jun 11, 2025

nsoranzo deleted the performance_fix_deleted_output_check branch June 11, 2025 23:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[25.0] Improve performance of job cache query #20319

[25.0] Improve performance of job cache query #20319

Uh oh!

mvdbeek commented May 21, 2025 •

edited by nsoranzo

Loading

Uh oh!

mvdbeek commented May 21, 2025

Uh oh!

nsoranzo commented May 21, 2025

Uh oh!

mvdbeek commented May 21, 2025

Uh oh!

mvdbeek commented May 21, 2025

Uh oh!

mvdbeek commented May 22, 2025

Uh oh!

mvdbeek commented May 23, 2025

Uh oh!

mvdbeek commented May 23, 2025 •

edited

Loading

Uh oh!

mvdbeek commented May 23, 2025

Uh oh!

mvdbeek commented Jun 2, 2025

Uh oh!

mvdbeek commented Jun 11, 2025

Uh oh!

ahmedhamidawan left a comment

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

[25.0] Improve performance of job cache query #20319

[25.0] Improve performance of job cache query #20319

Uh oh!

Conversation

mvdbeek commented May 21, 2025 • edited by nsoranzo Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to test the changes?

License

Uh oh!

mvdbeek commented May 21, 2025

Uh oh!

nsoranzo commented May 21, 2025

Uh oh!

mvdbeek commented May 21, 2025

Uh oh!

mvdbeek commented May 21, 2025

Uh oh!

mvdbeek commented May 22, 2025

Uh oh!

mvdbeek commented May 23, 2025

Uh oh!

mvdbeek commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mvdbeek commented May 23, 2025

Uh oh!

mvdbeek commented Jun 2, 2025

Uh oh!

mvdbeek commented Jun 11, 2025

Uh oh!

ahmedhamidawan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

mvdbeek commented May 21, 2025 •

edited by nsoranzo

Loading

mvdbeek commented May 23, 2025 •

edited

Loading