feat: cleaner & more efficient direct insert logic #1125

bpblanken · 2025-07-10T16:19:08Z

No description provided.

bpblanken · 2025-07-11T12:22:51Z

v03_pipeline/lib/misc/clickhouse.py

+        CREATE OR REPLACE TABLE {STAGING_CLICKHOUSE_DATABASE}._tmp_loadable_keys ENGINE = Set AS (
+            SELECT {clickhouse_table.key_field}
+            FROM {src_table} src
+            LEFT ANTI JOIN {dst_table} dst


The first pass at this used a more efficient mechanism for determining which variants to load (if the max variant id in the loading pipeline run is already present, don't load anything).

This didn't work for the data migrations, so we opted to just re-load all variants and let rocksdb compaction handle the de-duplication. This causes serious memory pressure under certain situations.

The proposed solution here duplicates the anti join that happens upstream in the pipeline to also happen here. This potentially unnecessary in the long run, but is theoretically correct behavior.

We need a temp table because the expected INSERT INTO table1 FROM SELECT * FROM table2 LEFT ANTI JOIN table1 runs into some trouble as table1 is being both joined and inserted into in the same query.

Had a second thought that we might see memory pressure with the Set implementation, but validated that we should not:

CREATE TABLE T Engine=Set AS SELECT randomString(500) FROM numbers(20000000); SELECT database, name, formatReadableSize(total_bytes) FROM system.tables WHERE engine IN ('Memory', 'Set', 'Join') and name = 'T'; 1. │ default │ T │ 1.50 GiB │

hanars · 2025-07-11T16:58:56Z

v03_pipeline/lib/misc/clickhouse.py

+    drop_staging_db()
+    logged_query(
+        f"""
+        CREATE DATABASE {STAGING_CLICKHOUSE_DATABASE}


should this be a CREATE OR REPLACE?

not in this case. this is the database, not a table, and one line above we delete the database.

I think I find it confusing that we CREATE a database and then on the next line CREATE OR REPLACE a table in that db - if the db is new every time then should't we always create the table and never replace it? And if the db can persist shouldn't we fail gracefully if it does exist, which CREATE won't do?

👍 that's fair! I think the best course of action is removing the REPLACE on the table creation. It's sort of an artifact of the testing I've been doing, but I did explicitly leave it because the table name is hardcoded and there are scenarios where the staging environment doesn't get deleted if the pod is killed. If the staging environment is wiped that won't benefit an issue.

I wasn't 100% sure, while implementing this, whether the staging environment should be used for this table or not, or if an entirely different "database" should be created.

even better, there's no reason to use a hardcoded table name here. There's a utility already present in this file to give a proper staging prefix, which includes the run id & dataset type.

bpblanken added 13 commits July 10, 2025 12:18

cleaner direct insert logic

5c57cf9

ruff

f49da94

try an alias

1dd9f61

correlated

3d29133

try anti join

a811726

cleanup fields

7c0e356

correct property

a22d017

update the prefix

64e2e4e

missing ==

3455fec

fix it

89f02a6

ruff

ca731b9

improve tmp load

f992e9a

ruff

7817ba8

bpblanken commented Jul 11, 2025

View reviewed changes

bpblanken marked this pull request as ready for review July 11, 2025 12:23

bpblanken requested a review from a team as a code owner July 11, 2025 12:23

bpblanken changed the title ~~cleaner direct insert logic~~ feat: cleaner & more efficient direct insert logic Jul 11, 2025

hanars reviewed Jul 11, 2025

View reviewed changes

bpblanken and others added 4 commits July 11, 2025 14:36

Update clickhouse.py

6adac70

cleaner table creation

99fd280

merge

313b9e0

backtick

2524525

hanars approved these changes Jul 11, 2025

View reviewed changes

bpblanken merged commit d4c75c9 into main Jul 11, 2025
4 checks passed

bpblanken deleted the benb/cleaner_direct_insert_logic branch July 11, 2025 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: cleaner & more efficient direct insert logic #1125

feat: cleaner & more efficient direct insert logic #1125

Uh oh!

bpblanken commented Jul 10, 2025

Uh oh!

bpblanken Jul 11, 2025

Uh oh!

bpblanken Jul 11, 2025

Uh oh!

hanars Jul 11, 2025

Uh oh!

bpblanken Jul 11, 2025

Uh oh!

hanars Jul 11, 2025

Uh oh!

bpblanken Jul 11, 2025

Uh oh!

bpblanken Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

feat: cleaner & more efficient direct insert logic #1125

feat: cleaner & more efficient direct insert logic #1125

Uh oh!

Conversation

bpblanken commented Jul 10, 2025

Uh oh!

bpblanken Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

bpblanken Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

hanars Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

bpblanken Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

hanars Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

bpblanken Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

bpblanken Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!