-
-
Notifications
You must be signed in to change notification settings - Fork 727
Add statements to clickhouse.sql that fix primary keys of underlying MySQL tables #11738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: demo-clickhouse-only-db
Are you sure you want to change the base?
Conversation
JOIN sample_derived sd ON sd.internal_id = subquery.sample_id; | ||
|
||
|
||
--Adds primary key to the sample_cna_event table for Clickhouse-only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sheridancbio what do you think about fixing the primary keys of the "slung" tables as part of derivation. we create new table with appropriate keys, copy data into it and then switch the table out.
for the genetic_alteration table of public portal, this takes 5 minutes.
the alternative is that, as you suggested, we put table definition in sling process, which i don't love.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preliminary comment:
This approach would work ... it basically "derives" tables with the desired primary key from the raw table with the default ordering. The main weakness of this approach is runtime efficiency - we would be waiting for the duplication of large tables ... in particular genetic_alteration. This would further slow down the table derivation stage of import. Also, there may be issues with doing a single "all at once" copy of genetic_alteration ... such as memory exhaustion when constructing the dataset to be stored. We should test this approach on a heavyweight database before adopting it.
There is an proposal underway (RFC100) which would (if adopted) replace the entire import process which feeds data through mysql into clickhouse. So it may be that we don't really need to overthink this derived table SQL file if the process will (sometime in 2026) be based on parquet files generated locally and uploaded directly into clickhouse. So perhaps any workable interim solution would be acceptable if we could still use the current approach without so much overhead that the overnight cycle of imports is still practical for our databases.
One detail to look into would be alternatives to the format of the "INSERT INTO ..." statements. The subordinate SELECT statement approach might use more memory than a different format which would also copy over all the data. I think a statement like INSERT INTO TABLE A TABLE B
does this functionality in MySQL. Maybe something similar is available in Clickhouse and would be more performant.
@sheridancbio agreed would better not do this. for the public database the table is 65GB, this takes only five mins. |
No description provided.