Enable liquid clustering for delta lakes #316

mikix · 2024-05-21T18:28:01Z

In practice, for most of our big raw tables, this will involve clustering on a single column: id, in which case Delta Lake falls back to a very simple ZORDER algorithm (making sure that the data files are ordered by id, to make looking up via id easy) -- which is still good, just not as sleek and cool as the full "liquid clustering" algorithm.

But it's still good to do for two reasons:

This new paradigm is apparently the future of Delta Lake optimization, replacing normal ZORDER and partitioning. And you can't convert an existing table to it. So if we start creating clustered tables now, even if it's secretly just a ZORDER, we can benefit from future improvements to the algorithm and/or add more clustering columns if we want to.
There are a few tables that do cluster by multiple columns and thus benefit from the cool liquid clustering logic - namely the completion tables. Those tables don't get crazy large anyway, but still - one of them scales with the number of encounters, so that's not chump change.

Delta Lake Protocol Notes:

This bumps the delta lake writer protocol to 7 but keeps the reader protocol at 1 - this means Athena can keep reading these tables (I tested that). But it does mean that very old versions of the ETL might not be able to write to these tables. Since we don't expect multiple ETL installations to be pointing at the same delta lakes often (usually just one engineer is doing that), because you have to create a table first before this becomes a problem, and because you'd need a truly ancient ETL, I'm not worried about this.

Checklist

Consider if documentation (like in docs/) needs to be updated
Consider if tests should be added

mikix · 2024-10-03T13:38:15Z

pyproject.toml

    "ctakesclient >= 5.1, < 6",
    "cumulus-fhir-support >= 1.2, < 2",
-    "delta-spark >= 3, < 4",
+    "delta-spark >= 3.2.1, < 4",


3.2.1 had a specific fix for "clustering by a single column" that we were waiting for, since that's our primary use case here

github-actions · 2024-10-03T14:02:32Z

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines	Covered	Coverage	Threshold	Status
3448	3386	98%	98%	🟢

New Files

No new covered files...

Modified Files

File	Coverage	Status
cumulus_etl/init.py	100%	🟢
cumulus_etl/formats/deltalake.py	100%	🟢
TOTAL	100%	🟢

updated for commit: 1194196 by action🐍

mikix force-pushed the mikix/clustering branch from b13924e to 1788dd8 Compare May 21, 2024 18:35

mikix force-pushed the mikix/clustering branch from 1788dd8 to b55a263 Compare May 30, 2024 20:27

mikix force-pushed the mikix/clustering branch 2 times, most recently from 79cc5be to 9a61bec Compare October 3, 2024 13:32

mikix commented Oct 3, 2024

View reviewed changes

deltalake: enable liquid clustering for new tables

1194196

mikix force-pushed the mikix/clustering branch from 9a61bec to 1194196 Compare October 3, 2024 13:44

mikix changed the title ~~WIP: enable liquid clustering for delta lakes~~ Enable liquid clustering for delta lakes Oct 3, 2024

mikix marked this pull request as ready for review October 3, 2024 13:45

dogversioning approved these changes Oct 3, 2024

View reviewed changes

mikix merged commit 6c62a53 into main Oct 3, 2024
3 checks passed

mikix deleted the mikix/clustering branch October 3, 2024 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable liquid clustering for delta lakes #316

Enable liquid clustering for delta lakes #316

Uh oh!

mikix commented May 21, 2024 •

edited

Loading

Uh oh!

mikix Oct 3, 2024

Uh oh!

github-actions bot commented Oct 3, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Enable liquid clustering for delta lakes #316

Enable liquid clustering for delta lakes #316

Uh oh!

Conversation

mikix commented May 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

mikix Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 3, 2024

☂️ Python Coverage

Overall Coverage

New Files

Modified Files

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikix commented May 21, 2024 •

edited

Loading