Skip to content

Conversation

@mikix
Copy link
Contributor

@mikix mikix commented May 21, 2024

In practice, for most of our big raw tables, this will involve clustering on a single column: id, in which case Delta Lake falls back to a very simple ZORDER algorithm (making sure that the data files are ordered by id, to make looking up via id easy) -- which is still good, just not as sleek and cool as the full "liquid clustering" algorithm.

But it's still good to do for two reasons:

  • This new paradigm is apparently the future of Delta Lake optimization, replacing normal ZORDER and partitioning. And you can't convert an existing table to it. So if we start creating clustered tables now, even if it's secretly just a ZORDER, we can benefit from future improvements to the algorithm and/or add more clustering columns if we want to.
  • There are a few tables that do cluster by multiple columns and thus benefit from the cool liquid clustering logic - namely the completion tables. Those tables don't get crazy large anyway, but still - one of them scales with the number of encounters, so that's not chump change.

Delta Lake Protocol Notes:

  • This bumps the delta lake writer protocol to 7 but keeps the reader protocol at 1 - this means Athena can keep reading these tables (I tested that). But it does mean that very old versions of the ETL might not be able to write to these tables. Since we don't expect multiple ETL installations to be pointing at the same delta lakes often (usually just one engineer is doing that), because you have to create a table first before this becomes a problem, and because you'd need a truly ancient ETL, I'm not worried about this.

Checklist

  • Consider if documentation (like in docs/) needs to be updated
  • Consider if tests should be added

@mikix mikix force-pushed the mikix/clustering branch from b13924e to 1788dd8 Compare May 21, 2024 18:35
@mikix mikix force-pushed the mikix/clustering branch from 1788dd8 to b55a263 Compare May 30, 2024 20:27
@mikix mikix force-pushed the mikix/clustering branch 2 times, most recently from 79cc5be to 9a61bec Compare October 3, 2024 13:32
"ctakesclient >= 5.1, < 6",
"cumulus-fhir-support >= 1.2, < 2",
"delta-spark >= 3, < 4",
"delta-spark >= 3.2.1, < 4",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.2.1 had a specific fix for "clustering by a single column" that we were waiting for, since that's our primary use case here

@mikix mikix force-pushed the mikix/clustering branch from 9a61bec to 1194196 Compare October 3, 2024 13:44
@mikix mikix changed the title WIP: enable liquid clustering for delta lakes Enable liquid clustering for delta lakes Oct 3, 2024
@mikix mikix marked this pull request as ready for review October 3, 2024 13:45
@github-actions
Copy link

github-actions bot commented Oct 3, 2024

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
3448 3386 98% 98% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
cumulus_etl/init.py 100% 🟢
cumulus_etl/formats/deltalake.py 100% 🟢
TOTAL 100% 🟢

updated for commit: 1194196 by action🐍

@mikix mikix merged commit 6c62a53 into main Oct 3, 2024
3 checks passed
@mikix mikix deleted the mikix/clustering branch October 3, 2024 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants