Skip to content

[25.0] Improve performance of job cache query #20319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented May 21, 2025

by turning the correlated subquery that checks for deleted job outputs into an outerjoin.
Brings the query time down from 374937.263 to 19373.252, so roughly a 20-fold improvement. Still a little slow but more managable. The remainder can likely be improved by an additional compound index

SQL Before:

SELECT
	job.id,
	job_to_input_dataset_1.dataset_id,
	job_to_input_dataset_2.dataset_id AS dataset_id_1
FROM
	job
	JOIN history ON job.history_id = history.id
	JOIN job_parameter AS job_parameter_1 ON job.id = job_parameter_1.job_id
	JOIN job_parameter AS job_parameter_2 ON job.id = job_parameter_2.job_id
	JOIN job_parameter AS job_parameter_3 ON job.id = job_parameter_3.job_id
	JOIN job_parameter AS job_parameter_4 ON job.id = job_parameter_4.job_id
	JOIN job_to_input_dataset AS job_to_input_dataset_1 ON job_to_input_dataset_1.job_id = job.id
	JOIN history_dataset_association AS history_dataset_association_1 ON job_to_input_dataset_1.dataset_id = history_dataset_association_1.id
	JOIN history_dataset_association AS history_dataset_association_2 ON history_dataset_association_2.dataset_id = history_dataset_association_1.dataset_id
	JOIN job_to_input_dataset AS job_to_input_dataset_2 ON job_to_input_dataset_2.job_id = job.id
	JOIN history_dataset_association AS history_dataset_association_3 ON job_to_input_dataset_2.dataset_id = history_dataset_association_3.id
	JOIN history_dataset_association AS history_dataset_association_4 ON history_dataset_association_4.dataset_id = history_dataset_association_3.dataset_id
WHERE
	job.tool_id = 'toolshed.g2.bx.psu.edu/repos/iuc/hisat2/hisat2/2.2.1+galaxy1'
	AND (
		job.user_id = 392010
		OR history.published = true
	)
	AND job.copied_from_job_id IS NULL
	AND job.tool_version = '2.2.1+galaxy1'
	AND job.state IN ('ok')
	AND (
		EXISTS (
			SELECT
				history_dataset_collection_association.id
			FROM
				history_dataset_collection_association,
				job_to_output_dataset_collection
			WHERE
				job.id = job_to_output_dataset_collection.job_id
				AND history_dataset_collection_association.id = job_to_output_dataset_collection.dataset_collection_id
				AND history_dataset_collection_association.deleted = true
		)
	) = false
	AND (
		EXISTS (
			SELECT
				history_dataset_association.id
			FROM
				history_dataset_association,
				job_to_output_dataset
			WHERE
				job.id = job_to_output_dataset.job_id
				AND history_dataset_association.id = job_to_output_dataset.dataset_id
				AND history_dataset_association.deleted = true
		)
	) = false
	AND job.id = job_parameter_1.job_id
	AND job_parameter_1.name = 'reference_genome'
	AND job_parameter_1.value LIKE '{""__current_case__"": 1, ""history_item"": {""values"": [{""id"": %, ""src"": ""hda""}]}, ""source"": ""history""}'
	AND job.id = job_parameter_2.job_id
	AND job_parameter_2.name = 'library'
	AND job_parameter_2.value LIKE '{""__current_case__"": 0, ""input_1"": {""values"": [{""id"": %, ""src"": ""hda""}]}, ""rna_strandness"": """", ""type"": ""single""}'
	AND job.id = job_parameter_3.job_id
	AND job_parameter_3.name = 'sum'
	AND job_parameter_3.value = '{""new_summary"": false, ""summary_file"": false}'
	AND job.id = job_parameter_4.job_id
	AND job_parameter_4.name = 'adv'
	AND job_parameter_4.value = '{""alignment_options"": {""__current_case__"": 0, ""alignment_options_selector"": ""defaults""}, ""input_options"": {""__current_case__"": 0, ""input_options_selector"": ""defaults""}, ""other_options"": {""__current_case__"": 0, ""other_options_selector"": ""defaults""}, ""output_options"": {""__current_case__"": 0, ""output_options_selector"": ""defaults""}, ""reporting_options"": {""__current_case__"": 0, ""reporting_options_selector"": ""defaults""}, ""sam_options"": {""__current_case__"": 0, ""sam_options_selector"": ""defaults""}, ""scoring_options"": {""__current_case__"": 0, ""scoring_options_selector"": ""defaults""}, ""spliced_options"": {""__current_case__"": 0, ""spliced_options_selector"": ""defaults""}}'
	AND job_to_input_dataset_1.name IN ('reference_genome|history_item', 'history_item')
	AND history_dataset_association_2.id = 152775960
	AND (
		(
			job_to_input_dataset_1.dataset_version IN (0, history_dataset_association_1.version)
			OR history_dataset_association_1.update_time < job.create_time
		)
		AND history_dataset_association_1.extension = history_dataset_association_2.extension
		AND history_dataset_association_1.name = history_dataset_association_2.name
		OR history_dataset_association_1.id IN (
			SELECT
				history_dataset_association.id
			FROM
				history_dataset_association,
				history_dataset_association_history AS history_dataset_association_history_1
			WHERE
				history_dataset_association.id = history_dataset_association_history_1.history_dataset_association_id
				AND history_dataset_association_history_1.name = history_dataset_association_2.name
				AND history_dataset_association_history_1.extension = history_dataset_association_2.extension
				AND job_to_input_dataset_1.dataset_version = history_dataset_association_history_1.version
				AND history_dataset_association_history_1.metadata = history_dataset_association_2.metadata
		)
	)
	AND (
		history_dataset_association_1.deleted = false
		OR history_dataset_association_2.deleted = false
	)
	AND job_to_input_dataset_2.name IN ('input_1', 'library|input_1')
	AND history_dataset_association_4.id = 152726579
	AND (
		(
			job_to_input_dataset_2.dataset_version IN (0, history_dataset_association_3.version)
			OR history_dataset_association_3.update_time < job.create_time
		)
		AND history_dataset_association_3.extension = history_dataset_association_4.extension
		AND history_dataset_association_3.name = history_dataset_association_4.name
		OR history_dataset_association_3.id IN (
			SELECT
				history_dataset_association.id
			FROM
				history_dataset_association,
				history_dataset_association_history AS history_dataset_association_history_2
			WHERE
				history_dataset_association.id = history_dataset_association_history_2.history_dataset_association_id
				AND history_dataset_association_history_2.name = history_dataset_association_4.name
				AND history_dataset_association_history_2.extension = history_dataset_association_4.extension
				AND job_to_input_dataset_2.dataset_version = history_dataset_association_history_2.version
				AND history_dataset_association_history_2.metadata = history_dataset_association_4.metadata
		)
	)
	AND (
		history_dataset_association_3.deleted = false
		OR history_dataset_association_4.deleted = false
	)
GROUP BY
	job.id,
	job_to_input_dataset_1.dataset_id,
	job_to_input_dataset_2.dataset_id
ORDER BY
	job.id DESC;

after (changed the aliases manually):

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS, FORMAT JSON)
SELECT
    job.id,
    job_to_input_dataset_1.dataset_id,
    job_to_input_dataset_2.dataset_id AS dataset_id_1
FROM
    job
    JOIN history ON job.history_id = history.id
    LEFT OUTER JOIN job_to_output_dataset_collection AS job_to_output_dataset_collection_1 ON job.id = job_to_output_dataset_collection_1.job_id
    LEFT OUTER JOIN history_dataset_collection_association AS history_dataset_collection_association_1_deleted ON history_dataset_collection_association_1_deleted.id = job_to_output_dataset_collection_1.dataset_collection_id
    AND history_dataset_collection_association_1_deleted.deleted = true
    LEFT OUTER JOIN job_to_output_dataset AS job_to_output_dataset_1 ON job.id = job_to_output_dataset_1.job_id
    LEFT OUTER JOIN history_dataset_association AS history_dataset_association_1_deleted ON history_dataset_association_1_deleted.id = job_to_output_dataset_1.dataset_id
    AND history_dataset_association_1_deleted.deleted = true
    JOIN job_parameter AS job_parameter_1 ON job.id = job_parameter_1.job_id
    JOIN job_parameter AS job_parameter_2 ON job.id = job_parameter_2.job_id
    JOIN job_parameter AS job_parameter_3 ON job.id = job_parameter_3.job_id
    JOIN job_parameter AS job_parameter_4 ON job.id = job_parameter_4.job_id
    JOIN job_to_input_dataset AS job_to_input_dataset_1 ON job_to_input_dataset_1.job_id = job.id
    JOIN history_dataset_association AS history_dataset_association_1 ON job_to_input_dataset_1.dataset_id = history_dataset_association_1.id
    JOIN history_dataset_association AS history_dataset_association_2 ON history_dataset_association_2.dataset_id = history_dataset_association_1.dataset_id
    JOIN job_to_input_dataset AS job_to_input_dataset_2 ON job_to_input_dataset_2.job_id = job.id
    JOIN history_dataset_association AS history_dataset_association_3 ON job_to_input_dataset_2.dataset_id = history_dataset_association_3.id
    JOIN history_dataset_association AS history_dataset_association_4 ON history_dataset_association_4.dataset_id = history_dataset_association_3.dataset_id
WHERE
    job.tool_id = 'toolshed.g2.bx.psu.edu/repos/iuc/hisat2/hisat2/2.2.1+galaxy1'
    AND (
        job.user_id = 392010
        OR history.published = true
    )
    AND job.copied_from_job_id IS NULL
    AND job.tool_version = '2.2.1+galaxy1'
    AND job.state IN ('ok')
    AND job_to_output_dataset_collection_1.job_id IS NULL
    AND job_to_output_dataset_1.job_id IS NULL
    AND job.id = job_parameter_1.job_id
    AND job_parameter_1.name = 'reference_genome'
    AND job_parameter_1.value LIKE '{""__current_case__"": 1, ""history_item"": {""values"": [{""id"": %, ""src"": ""hda""}]}, ""source"": ""history""}'
    AND job.id = job_parameter_2.job_id
    AND job_parameter_2.name = 'library'
    AND job_parameter_2.value LIKE '{""__current_case__"": 0, ""input_1"": {""values"": [{""id"": %, ""src"": ""hda""}]}, ""rna_strandness"": """", ""type"": ""single""}'
    AND job.id = job_parameter_3.job_id
    AND job_parameter_3.name = 'sum'
    AND job_parameter_3.value = '{""new_summary"": false, ""summary_file"": false}'
    AND job.id = job_parameter_4.job_id
    AND job_parameter_4.name = 'adv'
    AND job_parameter_4.value = '{""alignment_options"": {""__current_case__"": 0, ""alignment_options_selector"": ""defaults""}, ""input_options"": {""__current_case__"": 0, ""input_options_selector"": ""defaults""}, ""other_options"": {""__current_case__"": 0, ""other_options_selector"": ""defaults""}, ""output_options"": {""__current_case__"": 0, ""output_options_selector"": ""defaults""}, ""reporting_options"": {""__current_case__"": 0, ""reporting_options_selector"": ""defaults""}, ""sam_options"": {""__current_case__"": 0, ""sam_options_selector"": ""defaults""}, ""scoring_options"": {""__current_case__"": 0, ""scoring_options_selector"": ""defaults""}, ""spliced_options"": {""__current_case__"": 0, ""spliced_options_selector"": ""defaults""}}'
    AND job_to_input_dataset_1.name IN ('reference_genome|history_item', 'history_item')
    AND history_dataset_association_2.id = 152775960
    AND (
        (
            job_to_input_dataset_1.dataset_version IN (0, history_dataset_association_1.version)
            OR history_dataset_association_1.update_time < job.create_time
        )
        AND history_dataset_association_1.extension = history_dataset_association_2.extension
        AND history_dataset_association_1.name = history_dataset_association_2.name
        OR history_dataset_association_1.id IN (
            SELECT
                history_dataset_association.id
            FROM
                history_dataset_association,
                history_dataset_association_history AS history_dataset_association_history_1
            WHERE
                history_dataset_association.id = history_dataset_association_history_1.history_dataset_association_id
                AND history_dataset_association_history_1.name = history_dataset_association_2.name
                AND history_dataset_association_history_1.extension = history_dataset_association_2.extension
                AND job_to_input_dataset_1.dataset_version = history_dataset_association_history_1.version
                AND history_dataset_association_history_1.metadata = history_dataset_association_2.metadata
        )
    )
    AND (
        history_dataset_association_1.deleted = false
        OR history_dataset_association_2.deleted = false
    )
    AND job_to_input_dataset_2.name IN ('input_1', 'library|input_1')
    AND history_dataset_association_4.id = 152726579
    AND (
        (
            job_to_input_dataset_2.dataset_version IN (0, history_dataset_association_3.version)
            OR history_dataset_association_3.update_time < job.create_time
        )
        AND history_dataset_association_3.extension = history_dataset_association_4.extension
        AND history_dataset_association_3.name = history_dataset_association_4.name
        OR history_dataset_association_3.id IN (
            SELECT
                history_dataset_association.id
            FROM
                history_dataset_association,
                history_dataset_association_history AS history_dataset_association_history_2
            WHERE
                history_dataset_association.id = history_dataset_association_history_2.history_dataset_association_id
                AND history_dataset_association_history_2.name = history_dataset_association_4.name
                AND history_dataset_association_history_2.extension = history_dataset_association_4.extension
                AND job_to_input_dataset_2.dataset_version = history_dataset_association_history_2.version
                AND history_dataset_association_history_2.metadata = history_dataset_association_4.metadata
        )
    )
    AND (
        history_dataset_association_3.deleted = false
        OR history_dataset_association_4.deleted = false
    )
GROUP BY
    job.id,
    job_to_input_dataset_1.dataset_id,
    job_to_input_dataset_2.dataset_id
ORDER BY
    job.id DESC;

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@mvdbeek
Copy link
Member Author

mvdbeek commented May 21, 2025

It's fast but broken 😆 😅

@nsoranzo
Copy link
Member

Delete any_output_dataset_collection_instances_deleted and any_output_dataset_deleted from lib/galaxy/model/__init__.py ?

@mvdbeek
Copy link
Member Author

mvdbeek commented May 21, 2025

Once it actually works! I'm afraid this only works if all outputs are deleted.

@mvdbeek
Copy link
Member Author

mvdbeek commented May 21, 2025

I think this should be working now. This is now running the any outputs deleted logic against the job ids returned by the inner query. Speedup is about the same. This is the SQL (from the test_workflow_rerun_with_use_cached_job test case):

SELECT 
  filtered_jobs_subquery.job_id, 
  filtered_jobs_subquery.input1_4, 
  filtered_jobs_subquery."queries_0|input2_5" 
FROM 
  (
    SELECT 
      job.id AS job_id, 
      job_to_input_dataset_1.dataset_id AS input1_4, 
      job_to_input_dataset_2.dataset_id AS "queries_0|input2_5" 
    FROM 
      job 
      JOIN history ON job.history_id = history.id 
      JOIN job_parameter AS job_parameter_1 ON job.id = job_parameter_1.job_id 
      JOIN job_parameter AS job_parameter_2 ON job.id = job_parameter_2.job_id 
      JOIN job_to_input_dataset AS job_to_input_dataset_1 ON job_to_input_dataset_1.job_id = job.id 
      JOIN history_dataset_association AS history_dataset_association_1 ON job_to_input_dataset_1.dataset_id = history_dataset_association_1.id 
      JOIN history_dataset_association AS history_dataset_association_2 ON history_dataset_association_2.dataset_id = history_dataset_association_1.dataset_id 
      JOIN job_to_input_dataset AS job_to_input_dataset_2 ON job_to_input_dataset_2.job_id = job.id 
      JOIN history_dataset_association AS history_dataset_association_3 ON job_to_input_dataset_2.dataset_id = history_dataset_association_3.id 
      JOIN history_dataset_association AS history_dataset_association_4 ON history_dataset_association_4.dataset_id = history_dataset_association_3.dataset_id 
    WHERE 
      job.tool_id = 'cat1' 
      AND (
        job.user_id = 1 
        OR history.published = true
      ) 
      AND job.copied_from_job_id IS NULL 
      AND job.tool_version = '1.0.0' 
      AND job.state IN ('ok') 
      AND job.id = job_parameter_1.job_id 
      AND job_parameter_1.name = 'input1' 
      AND job_parameter_1.value LIKE '{"values": [{"id": %, "src": "hda"}]}' 
      AND job.id = job_parameter_2.job_id 
      AND job_parameter_2.name = 'queries' 
      AND job_parameter_2.value LIKE '[{"__index__": 0, "input2": {"values": [{"id": %, "src": "hda"}]}}]' 
      AND job_to_input_dataset_1.name IN ('input1', 'input1') 
      AND history_dataset_association_2.id = 4 
      AND (
        (
          job_to_input_dataset_1.dataset_version IN (
            0, history_dataset_association_1.version
          ) 
          OR history_dataset_association_1.update_time < job.create_time
        ) 
        AND history_dataset_association_1.extension = history_dataset_association_2.extension 
        AND history_dataset_association_1.name = history_dataset_association_2.name 
        OR history_dataset_association_1.id IN (
          SELECT 
            history_dataset_association.id 
          FROM 
            history_dataset_association, 
            history_dataset_association_history AS history_dataset_association_history_1 
          WHERE 
            history_dataset_association.id = history_dataset_association_history_1.history_dataset_association_id 
            AND history_dataset_association_history_1.name = history_dataset_association_2.name 
            AND history_dataset_association_history_1.extension = history_dataset_association_2.extension 
            AND job_to_input_dataset_1.dataset_version = history_dataset_association_history_1.version 
            AND history_dataset_association_history_1.metadata = history_dataset_association_2.metadata
        )
      ) 
      AND (
        history_dataset_association_1.deleted = false 
        OR history_dataset_association_2.deleted = false
      ) 
      AND job_to_input_dataset_2.name IN ('queries_0|input2', 'input2') 
      AND history_dataset_association_4.id = 5 
      AND (
        (
          job_to_input_dataset_2.dataset_version IN (
            0, history_dataset_association_3.version
          ) 
          OR history_dataset_association_3.update_time < job.create_time
        ) 
        AND history_dataset_association_3.extension = history_dataset_association_4.extension 
        AND history_dataset_association_3.name = history_dataset_association_4.name 
        OR history_dataset_association_3.id IN (
          SELECT 
            history_dataset_association.id 
          FROM 
            history_dataset_association, 
            history_dataset_association_history AS history_dataset_association_history_2 
          WHERE 
            history_dataset_association.id = history_dataset_association_history_2.history_dataset_association_id 
            AND history_dataset_association_history_2.name = history_dataset_association_4.name 
            AND history_dataset_association_history_2.extension = history_dataset_association_4.extension 
            AND job_to_input_dataset_2.dataset_version = history_dataset_association_history_2.version 
            AND history_dataset_association_history_2.metadata = history_dataset_association_4.metadata
        )
      ) 
      AND (
        history_dataset_association_3.deleted = false 
        OR history_dataset_association_4.deleted = false
      ) 
    GROUP BY 
      job.id, 
      job_to_input_dataset_1.dataset_id, 
      job_to_input_dataset_2.dataset_id
  ) AS filtered_jobs_subquery 
WHERE 
  NOT (
    EXISTS (
      SELECT 
        * 
      FROM 
        job_to_output_dataset_collection, 
        history_dataset_collection_association 
      WHERE 
        job_to_output_dataset_collection.job_id = filtered_jobs_subquery.job_id 
        AND job_to_output_dataset_collection.dataset_collection_id = history_dataset_collection_association.id 
        AND history_dataset_collection_association.deleted = true
    )
  ) 
  AND NOT (
    EXISTS (
      SELECT 
        * 
      FROM 
        job_to_output_dataset, 
        history_dataset_association 
      WHERE 
        job_to_output_dataset.job_id = filtered_jobs_subquery.job_id 
        AND job_to_output_dataset.dataset_id = history_dataset_association.id 
        AND history_dataset_association.deleted = true
    )
  ) 
ORDER BY 
  filtered_jobs_subquery.job_id DESC

@mvdbeek
Copy link
Member Author

mvdbeek commented May 22, 2025

Deployed this to main and it's working nicely.

@mvdbeek
Copy link
Member Author

mvdbeek commented May 23, 2025

Managed to get this down to

    "Planning Time": 65.306,
    "Execution Time": 27.935

which is ~ 3500X speedup with https://gist.github.com/mvdbeek/2d4e235bfd9531de7c87de0f0365ffe6. It's gonna be a challenge to translate that back to sqlalchemy though 😅

@mvdbeek
Copy link
Member Author

mvdbeek commented May 23, 2025

https://gist.githubusercontent.com/mvdbeek/becf93f9df6f3a764b878eae4f31fc3a/raw/2fead2a6fcc65d90c48c4c22d89fdd5b9323d741/optimized_input_subquery.sql is the final query, 35ms planning + 35 ms execution = 5356 X speedup (for this particular query).

@mvdbeek mvdbeek marked this pull request as draft May 23, 2025 12:50
@mvdbeek mvdbeek force-pushed the performance_fix_deleted_output_check branch from 081f63e to 5508a40 Compare May 23, 2025 15:06
@mvdbeek
Copy link
Member Author

mvdbeek commented May 23, 2025

Well, you optimize one thing and you degrade another. I've been going into the direction of doing the input equivalence first before comparing jobs, which is a much more stringent way to filter candidate jobs. But that revealed how inefficient the HDCA / DCE equivalence search is.

I have another version that improves that in https://gist.github.com/mvdbeek/881f75162e0f457e8a66af112f06a6a9#file-ctes_to_prefilter_hdca-sql-L1-L56 ... again going to be tricky to turn this back into sqlalchemy

@mvdbeek mvdbeek force-pushed the performance_fix_deleted_output_check branch 7 times, most recently from e3b3751 to 6c4f905 Compare June 2, 2025 12:05
@mvdbeek
Copy link
Member Author

mvdbeek commented Jun 2, 2025

I've deployed this in its current state on usegalaxy.org and the performance is pretty good:

Jun 02 07:29:48 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:48,072 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (202.187 ms)
Jun 02 07:29:48 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:48,519 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (99.797 ms)
Jun 02 07:29:48 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:48,808 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (75.654 ms)
Jun 02 07:29:49 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:49,048 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (52.037 ms)
Jun 02 07:29:49 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:49,311 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (128.012 ms)
Jun 02 07:29:49 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:49,508 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (42.464 ms)
Jun 02 07:29:49 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:49,690 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (59.633 ms)
Jun 02 07:29:49 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:49,857 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (39.209 ms)
Jun 02 07:29:50 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:50,027 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (37.393 ms)
Jun 02 07:29:50 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:50,245 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (81.697 ms)
Jun 02 07:29:50 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:50,548 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (91.697 ms)
Jun 02 07:29:50 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:50,718 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (41.128 ms)
Jun 02 07:29:50 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:50,870 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (33.748 ms)
Jun 02 07:29:51 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:51,043 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (50.559 ms)
Jun 02 07:29:51 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:51,253 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (71.050 ms)
Jun 02 07:29:51 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:51,416 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (35.949 ms)
Jun 02 07:29:51 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:51,629 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (83.257 ms)
Jun 02 07:29:51 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:51,813 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (54.070 ms)
Jun 02 07:29:52 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:52,038 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] No equivalent jobs found (77.386 ms)
Jun 02 07:29:52 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:52,360 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (142.855 ms)
Jun 02 07:29:52 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:52,578 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (41.876 ms)
Jun 02 07:29:52 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:52,750 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (34.481 ms)
Jun 02 07:29:52 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:52,952 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (38.632 ms)
Jun 02 07:29:53 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:53,158 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (32.990 ms)
Jun 02 07:29:53 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:53,296 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (27.677 ms)
Jun 02 07:29:54 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:54,525 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] No equivalent jobs found (1059.227 ms)
Jun 02 07:29:55 galaxy-main3 galaxyctl[1202979]: galaxy.managers.jobs INFO 2025-06-02 07:29:55,034 [pN:workflow_scheduler0,p:1202979,tN:WorkflowRequestMonitor.monitor_thread] Found equivalent job (349.786 ms)

that's the workflow handler logs for the the IWC atacseq workflow. multiqc has many collection inputs, those are a little slower, but I think this is still fine and a big improvement, and it's rare that you'd consume many independent collection inputs.

@ahmedhamidawan ahmedhamidawan modified the milestones: 25.1, 25.0 Jun 2, 2025
mvdbeek added 7 commits June 11, 2025 20:54
by turning the correlated subquery that checks for deleted job outputs
into an outerjoin.
Brings the query time down from 374937.263 to 19373.252, so roughly a
20-fold improvement. Still a little slow but more managable.
The remainder can likely be improved by an additional compound index

SQL Before:
```
SELECT
	job.id,
	job_to_input_dataset_1.dataset_id,
	job_to_input_dataset_2.dataset_id AS dataset_id_1
FROM
	job
	JOIN history ON job.history_id = history.id
	JOIN job_parameter AS job_parameter_1 ON job.id = job_parameter_1.job_id
	JOIN job_parameter AS job_parameter_2 ON job.id = job_parameter_2.job_id
	JOIN job_parameter AS job_parameter_3 ON job.id = job_parameter_3.job_id
	JOIN job_parameter AS job_parameter_4 ON job.id = job_parameter_4.job_id
	JOIN job_to_input_dataset AS job_to_input_dataset_1 ON job_to_input_dataset_1.job_id = job.id
	JOIN history_dataset_association AS history_dataset_association_1 ON job_to_input_dataset_1.dataset_id = history_dataset_association_1.id
	JOIN history_dataset_association AS history_dataset_association_2 ON history_dataset_association_2.dataset_id = history_dataset_association_1.dataset_id
	JOIN job_to_input_dataset AS job_to_input_dataset_2 ON job_to_input_dataset_2.job_id = job.id
	JOIN history_dataset_association AS history_dataset_association_3 ON job_to_input_dataset_2.dataset_id = history_dataset_association_3.id
	JOIN history_dataset_association AS history_dataset_association_4 ON history_dataset_association_4.dataset_id = history_dataset_association_3.dataset_id
WHERE
	job.tool_id = 'toolshed.g2.bx.psu.edu/repos/iuc/hisat2/hisat2/2.2.1+galaxy1'
	AND (
		job.user_id = 392010
		OR history.published = true
	)
	AND job.copied_from_job_id IS NULL
	AND job.tool_version = '2.2.1+galaxy1'
	AND job.state IN ('ok')
	AND (
		EXISTS (
			SELECT
				history_dataset_collection_association.id
			FROM
				history_dataset_collection_association,
				job_to_output_dataset_collection
			WHERE
				job.id = job_to_output_dataset_collection.job_id
				AND history_dataset_collection_association.id = job_to_output_dataset_collection.dataset_collection_id
				AND history_dataset_collection_association.deleted = true
		)
	) = false
	AND (
		EXISTS (
			SELECT
				history_dataset_association.id
			FROM
				history_dataset_association,
				job_to_output_dataset
			WHERE
				job.id = job_to_output_dataset.job_id
				AND history_dataset_association.id = job_to_output_dataset.dataset_id
				AND history_dataset_association.deleted = true
		)
	) = false
	AND job.id = job_parameter_1.job_id
	AND job_parameter_1.name = 'reference_genome'
	AND job_parameter_1.value LIKE '{""__current_case__"": 1, ""history_item"": {""values"": [{""id"": %, ""src"": ""hda""}]}, ""source"": ""history""}'
	AND job.id = job_parameter_2.job_id
	AND job_parameter_2.name = 'library'
	AND job_parameter_2.value LIKE '{""__current_case__"": 0, ""input_1"": {""values"": [{""id"": %, ""src"": ""hda""}]}, ""rna_strandness"": """", ""type"": ""single""}'
	AND job.id = job_parameter_3.job_id
	AND job_parameter_3.name = 'sum'
	AND job_parameter_3.value = '{""new_summary"": false, ""summary_file"": false}'
	AND job.id = job_parameter_4.job_id
	AND job_parameter_4.name = 'adv'
	AND job_parameter_4.value = '{""alignment_options"": {""__current_case__"": 0, ""alignment_options_selector"": ""defaults""}, ""input_options"": {""__current_case__"": 0, ""input_options_selector"": ""defaults""}, ""other_options"": {""__current_case__"": 0, ""other_options_selector"": ""defaults""}, ""output_options"": {""__current_case__"": 0, ""output_options_selector"": ""defaults""}, ""reporting_options"": {""__current_case__"": 0, ""reporting_options_selector"": ""defaults""}, ""sam_options"": {""__current_case__"": 0, ""sam_options_selector"": ""defaults""}, ""scoring_options"": {""__current_case__"": 0, ""scoring_options_selector"": ""defaults""}, ""spliced_options"": {""__current_case__"": 0, ""spliced_options_selector"": ""defaults""}}'
	AND job_to_input_dataset_1.name IN ('reference_genome|history_item', 'history_item')
	AND history_dataset_association_2.id = 152775960
	AND (
		(
			job_to_input_dataset_1.dataset_version IN (0, history_dataset_association_1.version)
			OR history_dataset_association_1.update_time < job.create_time
		)
		AND history_dataset_association_1.extension = history_dataset_association_2.extension
		AND history_dataset_association_1.name = history_dataset_association_2.name
		OR history_dataset_association_1.id IN (
			SELECT
				history_dataset_association.id
			FROM
				history_dataset_association,
				history_dataset_association_history AS history_dataset_association_history_1
			WHERE
				history_dataset_association.id = history_dataset_association_history_1.history_dataset_association_id
				AND history_dataset_association_history_1.name = history_dataset_association_2.name
				AND history_dataset_association_history_1.extension = history_dataset_association_2.extension
				AND job_to_input_dataset_1.dataset_version = history_dataset_association_history_1.version
				AND history_dataset_association_history_1.metadata = history_dataset_association_2.metadata
		)
	)
	AND (
		history_dataset_association_1.deleted = false
		OR history_dataset_association_2.deleted = false
	)
	AND job_to_input_dataset_2.name IN ('input_1', 'library|input_1')
	AND history_dataset_association_4.id = 152726579
	AND (
		(
			job_to_input_dataset_2.dataset_version IN (0, history_dataset_association_3.version)
			OR history_dataset_association_3.update_time < job.create_time
		)
		AND history_dataset_association_3.extension = history_dataset_association_4.extension
		AND history_dataset_association_3.name = history_dataset_association_4.name
		OR history_dataset_association_3.id IN (
			SELECT
				history_dataset_association.id
			FROM
				history_dataset_association,
				history_dataset_association_history AS history_dataset_association_history_2
			WHERE
				history_dataset_association.id = history_dataset_association_history_2.history_dataset_association_id
				AND history_dataset_association_history_2.name = history_dataset_association_4.name
				AND history_dataset_association_history_2.extension = history_dataset_association_4.extension
				AND job_to_input_dataset_2.dataset_version = history_dataset_association_history_2.version
				AND history_dataset_association_history_2.metadata = history_dataset_association_4.metadata
		)
	)
	AND (
		history_dataset_association_3.deleted = false
		OR history_dataset_association_4.deleted = false
	)
GROUP BY
	job.id,
	job_to_input_dataset_1.dataset_id,
	job_to_input_dataset_2.dataset_id
ORDER BY
	job.id DESC;
```

after (changed the aliases manually):
```
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS, FORMAT JSON)
SELECT
    job.id,
    job_to_input_dataset_1.dataset_id,
    job_to_input_dataset_2.dataset_id AS dataset_id_1
FROM
    job
    JOIN history ON job.history_id = history.id
    LEFT OUTER JOIN job_to_output_dataset_collection AS job_to_output_dataset_collection_1 ON job.id = job_to_output_dataset_collection_1.job_id
    LEFT OUTER JOIN history_dataset_collection_association AS history_dataset_collection_association_1_deleted ON history_dataset_collection_association_1_deleted.id = job_to_output_dataset_collection_1.dataset_collection_id
    AND history_dataset_collection_association_1_deleted.deleted = true
    LEFT OUTER JOIN job_to_output_dataset AS job_to_output_dataset_1 ON job.id = job_to_output_dataset_1.job_id
    LEFT OUTER JOIN history_dataset_association AS history_dataset_association_1_deleted ON history_dataset_association_1_deleted.id = job_to_output_dataset_1.dataset_id
    AND history_dataset_association_1_deleted.deleted = true
    JOIN job_parameter AS job_parameter_1 ON job.id = job_parameter_1.job_id
    JOIN job_parameter AS job_parameter_2 ON job.id = job_parameter_2.job_id
    JOIN job_parameter AS job_parameter_3 ON job.id = job_parameter_3.job_id
    JOIN job_parameter AS job_parameter_4 ON job.id = job_parameter_4.job_id
    JOIN job_to_input_dataset AS job_to_input_dataset_1 ON job_to_input_dataset_1.job_id = job.id
    JOIN history_dataset_association AS history_dataset_association_1 ON job_to_input_dataset_1.dataset_id = history_dataset_association_1.id
    JOIN history_dataset_association AS history_dataset_association_2 ON history_dataset_association_2.dataset_id = history_dataset_association_1.dataset_id
    JOIN job_to_input_dataset AS job_to_input_dataset_2 ON job_to_input_dataset_2.job_id = job.id
    JOIN history_dataset_association AS history_dataset_association_3 ON job_to_input_dataset_2.dataset_id = history_dataset_association_3.id
    JOIN history_dataset_association AS history_dataset_association_4 ON history_dataset_association_4.dataset_id = history_dataset_association_3.dataset_id
WHERE
    job.tool_id = 'toolshed.g2.bx.psu.edu/repos/iuc/hisat2/hisat2/2.2.1+galaxy1'
    AND (
        job.user_id = 392010
        OR history.published = true
    )
    AND job.copied_from_job_id IS NULL
    AND job.tool_version = '2.2.1+galaxy1'
    AND job.state IN ('ok')
    AND job_to_output_dataset_collection_1.job_id IS NULL
    AND job_to_output_dataset_1.job_id IS NULL
    AND job.id = job_parameter_1.job_id
    AND job_parameter_1.name = 'reference_genome'
    AND job_parameter_1.value LIKE '{""__current_case__"": 1, ""history_item"": {""values"": [{""id"": %, ""src"": ""hda""}]}, ""source"": ""history""}'
    AND job.id = job_parameter_2.job_id
    AND job_parameter_2.name = 'library'
    AND job_parameter_2.value LIKE '{""__current_case__"": 0, ""input_1"": {""values"": [{""id"": %, ""src"": ""hda""}]}, ""rna_strandness"": """", ""type"": ""single""}'
    AND job.id = job_parameter_3.job_id
    AND job_parameter_3.name = 'sum'
    AND job_parameter_3.value = '{""new_summary"": false, ""summary_file"": false}'
    AND job.id = job_parameter_4.job_id
    AND job_parameter_4.name = 'adv'
    AND job_parameter_4.value = '{""alignment_options"": {""__current_case__"": 0, ""alignment_options_selector"": ""defaults""}, ""input_options"": {""__current_case__"": 0, ""input_options_selector"": ""defaults""}, ""other_options"": {""__current_case__"": 0, ""other_options_selector"": ""defaults""}, ""output_options"": {""__current_case__"": 0, ""output_options_selector"": ""defaults""}, ""reporting_options"": {""__current_case__"": 0, ""reporting_options_selector"": ""defaults""}, ""sam_options"": {""__current_case__"": 0, ""sam_options_selector"": ""defaults""}, ""scoring_options"": {""__current_case__"": 0, ""scoring_options_selector"": ""defaults""}, ""spliced_options"": {""__current_case__"": 0, ""spliced_options_selector"": ""defaults""}}'
    AND job_to_input_dataset_1.name IN ('reference_genome|history_item', 'history_item')
    AND history_dataset_association_2.id = 152775960
    AND (
        (
            job_to_input_dataset_1.dataset_version IN (0, history_dataset_association_1.version)
            OR history_dataset_association_1.update_time < job.create_time
        )
        AND history_dataset_association_1.extension = history_dataset_association_2.extension
        AND history_dataset_association_1.name = history_dataset_association_2.name
        OR history_dataset_association_1.id IN (
            SELECT
                history_dataset_association.id
            FROM
                history_dataset_association,
                history_dataset_association_history AS history_dataset_association_history_1
            WHERE
                history_dataset_association.id = history_dataset_association_history_1.history_dataset_association_id
                AND history_dataset_association_history_1.name = history_dataset_association_2.name
                AND history_dataset_association_history_1.extension = history_dataset_association_2.extension
                AND job_to_input_dataset_1.dataset_version = history_dataset_association_history_1.version
                AND history_dataset_association_history_1.metadata = history_dataset_association_2.metadata
        )
    )
    AND (
        history_dataset_association_1.deleted = false
        OR history_dataset_association_2.deleted = false
    )
    AND job_to_input_dataset_2.name IN ('input_1', 'library|input_1')
    AND history_dataset_association_4.id = 152726579
    AND (
        (
            job_to_input_dataset_2.dataset_version IN (0, history_dataset_association_3.version)
            OR history_dataset_association_3.update_time < job.create_time
        )
        AND history_dataset_association_3.extension = history_dataset_association_4.extension
        AND history_dataset_association_3.name = history_dataset_association_4.name
        OR history_dataset_association_3.id IN (
            SELECT
                history_dataset_association.id
            FROM
                history_dataset_association,
                history_dataset_association_history AS history_dataset_association_history_2
            WHERE
                history_dataset_association.id = history_dataset_association_history_2.history_dataset_association_id
                AND history_dataset_association_history_2.name = history_dataset_association_4.name
                AND history_dataset_association_history_2.extension = history_dataset_association_4.extension
                AND job_to_input_dataset_2.dataset_version = history_dataset_association_history_2.version
                AND history_dataset_association_history_2.metadata = history_dataset_association_4.metadata
        )
    )
    AND (
        history_dataset_association_3.deleted = false
        OR history_dataset_association_4.deleted = false
    )
GROUP BY
    job.id,
    job_to_input_dataset_1.dataset_id,
    job_to_input_dataset_2.dataset_id
ORDER BY
    job.id DESC;

```
Previously, the dataset_id comparison was applied late in queries
involving nested collections, leading to large intermediate result sets.

This change introduces CTEs to pre-filter the target dataset_ids on the
right-hand side of the comparison. By using IN (SELECT dataset_id FROM
cte_name), the database can prune non-matching rows earlier in the query
execution plan, significantly reducing execution time and resource
consumption.
@mvdbeek mvdbeek force-pushed the performance_fix_deleted_output_check branch from 6c4f905 to 32914f2 Compare June 11, 2025 18:54
@mvdbeek mvdbeek marked this pull request as ready for review June 11, 2025 18:54
@mvdbeek
Copy link
Member Author

mvdbeek commented Jun 11, 2025

With the last added test I'm confident this works well. It has been on usegalaxy.org for the last 10 days or so.

Copy link
Member

@ahmedhamidawan ahmedhamidawan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is remarkable! Thank you!

@ahmedhamidawan ahmedhamidawan merged commit 193affd into galaxyproject:release_25.0 Jun 11, 2025
52 of 56 checks passed

This comment was marked as resolved.

@nsoranzo nsoranzo deleted the performance_fix_deleted_output_check branch June 11, 2025 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants