Add statements to clickhouse.sql that fix primary keys of underlying MySQL tables #11738

alisman · 2025-10-08T21:08:35Z

No description provided.

alisman · 2025-10-08T21:13:00Z

src/main/resources/db-scripts/clickhouse/clickhouse.sql

        JOIN sample_derived sd ON sd.internal_id = subquery.sample_id;

+
+--Adds primary key to the sample_cna_event table for Clickhouse-only


@sheridancbio what do you think about fixing the primary keys of the "slung" tables as part of derivation. we create new table with appropriate keys, copy data into it and then switch the table out.
for the genetic_alteration table of public portal, this takes 5 minutes.

the alternative is that, as you suggested, we put table definition in sling process, which i don't love.

sheridancbio

preliminary comment:

This approach would work ... it basically "derives" tables with the desired primary key from the raw table with the default ordering. The main weakness of this approach is runtime efficiency - we would be waiting for the duplication of large tables ... in particular genetic_alteration. This would further slow down the table derivation stage of import. Also, there may be issues with doing a single "all at once" copy of genetic_alteration ... such as memory exhaustion when constructing the dataset to be stored. We should test this approach on a heavyweight database before adopting it.

There is an proposal underway (RFC100) which would (if adopted) replace the entire import process which feeds data through mysql into clickhouse. So it may be that we don't really need to overthink this derived table SQL file if the process will (sometime in 2026) be based on parquet files generated locally and uploaded directly into clickhouse. So perhaps any workable interim solution would be acceptable if we could still use the current approach without so much overhead that the overnight cycle of imports is still practical for our databases.

One detail to look into would be alternatives to the format of the "INSERT INTO ..." statements. The subordinate SELECT statement approach might use more memory than a different format which would also copy over all the data. I think a statement like INSERT INTO TABLE A TABLE B does this functionality in MySQL. Maybe something similar is available in Clickhouse and would be more performant.

alisman · 2025-10-08T21:42:51Z

@sheridancbio agreed would better not do this. for the public database the table is 65GB, this takes only five mins.

…na_events tables

…able

alisman changed the base branch from master to demo-clickhouse-only-db October 8, 2025 21:08

alisman force-pushed the table-updates branch from 448c37d to b9b1b49 Compare October 8, 2025 21:10

alisman commented Oct 8, 2025

View reviewed changes

alisman requested a review from sheridancbio October 8, 2025 21:13

sheridancbio reviewed Oct 8, 2025

View reviewed changes

sql to add Clickhouse primary keys to genetic_alteration and sample_c…

e908c85

…na_events tables

alisman force-pushed the table-updates branch from b9b1b49 to e908c85 Compare October 9, 2025 18:13

dippindots added the do not merge label Oct 14, 2025

dippindots marked this pull request as draft October 14, 2025 15:14

Add cancer_study_identifier to primary key of genomic_event_derived t…

5393f94

…able

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add statements to clickhouse.sql that fix primary keys of underlying MySQL tables #11738

Add statements to clickhouse.sql that fix primary keys of underlying MySQL tables #11738

Uh oh!

alisman commented Oct 8, 2025

Uh oh!

alisman Oct 8, 2025

Uh oh!

sheridancbio left a comment

Uh oh!

alisman commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		JOIN sample_derived sd ON sd.internal_id = subquery.sample_id;


		--Adds primary key to the sample_cna_event table for Clickhouse-only

Uh oh!

Add statements to clickhouse.sql that fix primary keys of underlying MySQL tables #11738

Are you sure you want to change the base?

Add statements to clickhouse.sql that fix primary keys of underlying MySQL tables #11738

Uh oh!

Conversation

alisman commented Oct 8, 2025

Uh oh!

alisman Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

sheridancbio left a comment

Choose a reason for hiding this comment

Uh oh!

alisman commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants