Skip to content

Conversation

alisman
Copy link
Contributor

@alisman alisman commented Oct 8, 2025

No description provided.

@alisman alisman changed the base branch from master to demo-clickhouse-only-db October 8, 2025 21:08
JOIN sample_derived sd ON sd.internal_id = subquery.sample_id;


--Adds primary key to the sample_cna_event table for Clickhouse-only
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sheridancbio what do you think about fixing the primary keys of the "slung" tables as part of derivation. we create new table with appropriate keys, copy data into it and then switch the table out.
for the genetic_alteration table of public portal, this takes 5 minutes.

the alternative is that, as you suggested, we put table definition in sling process, which i don't love.

@alisman alisman requested a review from sheridancbio October 8, 2025 21:13
Copy link
Contributor

@sheridancbio sheridancbio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preliminary comment:

This approach would work ... it basically "derives" tables with the desired primary key from the raw table with the default ordering. The main weakness of this approach is runtime efficiency - we would be waiting for the duplication of large tables ... in particular genetic_alteration. This would further slow down the table derivation stage of import. Also, there may be issues with doing a single "all at once" copy of genetic_alteration ... such as memory exhaustion when constructing the dataset to be stored. We should test this approach on a heavyweight database before adopting it.

There is an proposal underway (RFC100) which would (if adopted) replace the entire import process which feeds data through mysql into clickhouse. So it may be that we don't really need to overthink this derived table SQL file if the process will (sometime in 2026) be based on parquet files generated locally and uploaded directly into clickhouse. So perhaps any workable interim solution would be acceptable if we could still use the current approach without so much overhead that the overnight cycle of imports is still practical for our databases.

One detail to look into would be alternatives to the format of the "INSERT INTO ..." statements. The subordinate SELECT statement approach might use more memory than a different format which would also copy over all the data. I think a statement like INSERT INTO TABLE A TABLE B does this functionality in MySQL. Maybe something similar is available in Clickhouse and would be more performant.

@alisman
Copy link
Contributor Author

alisman commented Oct 8, 2025

@sheridancbio agreed would better not do this. for the public database the table is 65GB, this takes only five mins.

@dippindots dippindots marked this pull request as draft October 14, 2025 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants