Microbatch implementation #558

ian-r-rose · 2025-05-06T20:31:06Z

Follow on work from #477

I've been looking at using the "microbatch" incremental strategy for our large incremental models, and think it would be a nice simplification to a lot of the patterns here (such as being able to delete make_model_incremental) as well as enabling easier backfills of missing data. More thoughts on this are in #477. In this PR I've been taking an initial look at implementation, and wanted to try to get some thoughts from @thehanggit before he leaves Caltrans.

For most of our incremental models converting to the microbatch strategy is pretty straightforward (see, e.g., here). But the outlier removal model required a bit more thought, because it always compares with the last seven days' data to identify outliers. I don't really like this approach for a few reasons:

It's not idempotent, in that if I run the model on different days, I'll get different results.
It won't work if a station has been decommissioned, or otherwise hasn't produced any data for the last week.
It doesn't really fit well into the microbatch execution model, where we want to be able to run different days' data independently of each other (this is more of a conceptual cleanliness issue than a technical one).

So, the bulk of this PR is actually doing some refactoring of some of the early data processing for the clearinghouse data. Specifically, I do the following:

Compute detector statistics at regular (quarterly) intervals for use in outlier removal. This is similar to what we do for the regression coefficients. When identifying outliers, we use an asof join to get the most recent statistics for a detector.
Consolidate several of the data cleaning/preparation steps into a single model, including
- Removing outliers
- Filling out missing timestamps using the timestamp_spine macro that @JamesSLogan created
- Attaching station metadata (still WIP)
Move the computation of the "g factor" speed to after this data preparation step (still WIP). Previously this was done after outlier removal, but before the timestamp spine. Conceptually, it made more sense to me to compute this derived data column after the data prep, but perhaps there is some reason to do it before? @thehanggit I would appreciate any thoughts you have here.

intervals, allowing for outlier detection

detector statistics

1. Converts the missing_rows model to microbatch 2. Removes an unnecessary join to add station metadata to model. This data already exists on the active_stations model, and can be brought along in the spine computation. 3. Removes an unnecessary join to the g_factor speed model. I'm pretty sure we can compute the g factor speed downstream of this model without any joins. 4. Removes unnecesseary coalesce statements: the metadata we are coalescing comes from the same place. If there are any discrepancies, it would be due to a bad join, which we would want to fix instead of coalescing.

ian-r-rose · 2025-05-06T22:15:04Z

...ntermediate/clearinghouse/int_clearinghouse__detector_agg_five_minutes_with_missing_rows.sql

This model is consolidated into int_vds__detector_agg_five_minutes_normalized

...ntermediate/clearinghouse/int_clearinghouse__detector_agg_five_minutes_with_missing_rows.sql

ian-r-rose · 2025-05-06T22:21:01Z

...m/models/intermediate/clearinghouse/int_clearinghouse__detector_outlier_agg_five_minutes.sql

-- calculate the statistics
-weekly_stats as (
-    select
-        detector_id,
-        avg(volume_sum) as volume_mean,
-        stddev(volume_sum) as volume_stddev,
-        -- consider using max_capacity
-        percentile_cont(0.95) within group (order by volume_sum) as volume_95th,
-        percentile_cont(0.95) within group (order by occupancy_avg) as occupancy_95th
-    from filtered_five_minute_agg_lastweek
-    group by detector_id
-),


This (and above) is what I've tried to re-implement in an idempotent way in int_diagnostic__detector_outlier_thresholds

ian-r-rose · 2025-05-06T22:21:28Z

...m/models/intermediate/clearinghouse/int_clearinghouse__detector_outlier_agg_five_minutes.sql

-),
-
-- impute detected outliers
-outlier_removed_data as (


This is consolidated into int_vds__detector_agg_five_minutes_normalized

ian-r-rose · 2025-05-06T22:21:54Z

transform/models/intermediate/diagnostics/int_diagnostics__constant_occupancy.sql

Nice example of the simple case for microbatch

wheel

JamesSLogan

🙌 very nice @ian-r-rose ! I don't see any glaring issues in the models.

JamesSLogan · 2025-05-06T22:55:12Z

transform/models/intermediate/diagnostics/int_diagnostics__detector_outlier_thresholds.sql

+    inner join good_detectors
+        on
+            agg.detector_id = good_detectors.detector_id
+            and agg.sample_date = good_detectors.sample_date


This doesn't necessarily need to be changed (unless performance is improved?), but doing this filter as WHERE EXISTS would show the query's intent more explicitly.

Thanks for the suggestion @JamesSLogan! I'm not familiar with this particular style of filter joins, can you give an example?

I think something like below would perform the filter in a similar manner as the current join to good_detectors. The idea is that the database doesn't have to build a large intermediate result set from the join and can instead simply search good_detectors, stopping as soon as matches are found. This, of course, depends on how snowflake optimizes the query...

select ... from agg inner join agg_dates_to_evaluate on agg.sample_date >= agg_dates_to_evaluate.agg_date and agg.sample_date < dateadd(day, {{ var("outlier_agg_time_window") }}, agg_dates_to_evaluate.agg_date) where agg.station_type in ('ML', 'HV') and exists ( select 1 from good_detectors gd where gd.detector_id = agg.detector_id and gd.sample_date = agg.sample_date )

There could be a difference in outputted rows if agg -> good_detectors is one -> many, which I don't think is the case here.

Interesting...would this be considered a correlated subquery? I've never quite felt smart enough for those...

I'm going to test this to see if I can find a difference in the query plan and performance, thanks!

Yes, and I feel the same! They're typically harder to read but this case would better convey intent (slightly) and maybe even improve performance 🤞

I took a look at this, and just documenting what I saw here:

There is almost no difference in performance between these two approaches: the query plans look almost identical, and both involve effectively the same join. This particular join is a small fraction of the total runtime of the query. There are two minor differences I noticed in the query plan:

First, the join type in the query plan for the correlated subquery is "semi" instead of "inner". From what I understand searching around, a subquery used in this way is effectively synonymous with "semi-join"

Second, there is a pre-aggregation in the correlated subquery to ensure there is a single detector_id and sample_date for the existence check. If I understand your comment above, this is what you were referring to above for the one->many join.

Here is a snapshot of the inner join query plan:

And here is the correlated subquery:

Happy to take this suggestion to convey the intent of the join, thanks for the knowledge sharing @JamesSLogan!

...ntermediate/clearinghouse/int_clearinghouse__detector_agg_five_minutes_with_missing_rows.sql

transform/models/intermediate/vds/int_vds__detector_agg_five_minutes_normalized.sql

I still want to move the 60th percentile occupancy calculation into the thresholds model

that all data cleanup happens in the same place.

thehanggit · 2025-05-08T21:28:22Z

Follow on work from #477

I've been looking at using the "microbatch" incremental strategy for our large incremental models, and think it would be a nice simplification to a lot of the patterns here (such as being able to delete make_model_incremental) as well as enabling easier backfills of missing data. More thoughts on this are in #477. In this PR I've been taking an initial look at implementation, and wanted to try to get some thoughts from @thehanggit before he leaves Caltrans.

For most of our incremental models converting to the microbatch strategy is pretty straightforward (see, e.g., here). But the outlier removal model required a bit more thought, because it always compares with the last seven days' data to identify outliers. I don't really like this approach for a few reasons:

It's not idempotent, in that if I run the model on different days, I'll get different results.

It won't work if a station has been decommissioned, or otherwise hasn't produced any data for the last week.

It doesn't really fit well into the microbatch execution model, where we want to be able to run different days' data independently of each other (this is more of a conceptual cleanliness issue than a technical one).

So, the bulk of this PR is actually doing some refactoring of some of the early data processing for the clearinghouse data. Specifically, I do the following:

Compute detector statistics at regular (quarterly) intervals for use in outlier removal. This is similar to what we do for the regression coefficients. When identifying outliers, we use an asof join to get the most recent statistics for a detector.

Consolidate several of the data cleaning/preparation steps into a single model, including

Removing outliers

Filling out missing timestamps using the timestamp_spine macro that @JamesSLogan created

Attaching station metadata (still WIP)

Move the computation of the "g factor" speed to after this data preparation step (still WIP). Previously this was done after outlier removal, but before the timestamp spine. Conceptually, it made more sense to me to compute this derived data column after the data prep, but perhaps there is some reason to do it before? @thehanggit I would appreciate any thoughts you have here.

@ian-r-rose Hi Ian, for the outlier detection data processing, it's a great idea to consolidate it into a single data processing model!

For the gspeed, I have talked with Ken today and the only reason we brought onto the table is about the performance (Please check PR #502). 1. In the 5-minute dataset with missing rows, it would make the table much larger compared to the dataset without missing rows. I'm not sure if it will cause significant delay. 2. The dataset with missing rows generates a large amount of NULL value in flow and occ. I checked the gfactor speed calculation and it should have avoided to include them during calculation. But it still remains unclear how would they affect the gfactor calculation. I would think of a QC to double check the difference. Other than that, there is no other reasons.
One thing needs to be mentioned is, the outlier detection only generated a new column labeled as "observed outlier" or "observed data" for detectors and it doesn't filter out data with outlier classification. But it's pretty easy to remove them in downstream models by WHERE.

ian-r-rose · 2025-05-08T22:49:17Z

For the gspeed, I have talked with Ken today and the only reason we brought onto the table is about the performance (Please check PR #502). 1. In the 5-minute dataset with missing rows, it would make the table much larger compared to the dataset without missing rows. I'm not sure if it will cause significant delay. 2. The dataset with missing rows generates a large amount of NULL value in flow and occ. I checked the gfactor speed calculation and it should have avoided to include them during calculation. But it still remains unclear how would they affect the gfactor calculation. I would think of a QC to double check the difference. Other than that, there is no other reasons.

My suspicion is that the extra cost of computing gspeed for the missing rows is less than the cost of joining two large tables. In general, joins are more expensive and less parallelizable than operations that can happen on a single row, and filling missing rows seems to mostly increase the table sizes by ~10%.

thehanggit · 2025-05-09T19:30:06Z

For the gspeed, I have talked with Ken today and the only reason we brought onto the table is about the performance (Please check PR #502). 1. In the 5-minute dataset with missing rows, it would make the table much larger compared to the dataset without missing rows. I'm not sure if it will cause significant delay. 2. The dataset with missing rows generates a large amount of NULL value in flow and occ. I checked the gfactor speed calculation and it should have avoided to include them during calculation. But it still remains unclear how would they affect the gfactor calculation. I would think of a QC to double check the difference. Other than that, there is no other reasons.

My suspicion is that the extra cost of computing gspeed for the missing rows is less than the cost of joining two large tables. In general, joins are more expensive and less parallelizable than operations that can happen on a single row, and filling missing rows seems to mostly increase the table sizes by ~10%.

I agree with your claim. Please feel free to move the gspeed model after data prep.

a full refresh

incremental_model_look_back variables. We don't need them anymore as they are replaced by microbatch configuration!

now

back a bit further, even in dev.

overridden in an apparent dbt bug: dbt-labs/dbt-core#6013 It's a bit more verbose, but at least it's explicit.

ian-r-rose · 2025-05-15T16:21:49Z

transform/models/staging/clearinghouse/_clearinghouse.yml

+    config:
+      event_time: SAMPLE_DATE


This is pretty crucial! The raw view here is not an incremental model, but we still set an event_time config on it so that downstream microbatch models know how to filter on it. So, all microbatch models have an event_time config, and some non-microbatch ones need it as well.

back further than that in dev

macros like get_snowflake_refresh_warehouse(). Disable concurrent_batches because DELETE DML statements lock the table, so batch concurrency is pretty useless unless Snowflake starts being more partition-aware.

imputation layer. By organizing all of the models that look back in time together, we can handle them specially during backfills

the same purpose

ian-r-rose added 7 commits April 24, 2025 14:06

Initial implementation of microbatch

e84c381

Microbatch diagnostics models

13b3b14

Use dev lookback for microbatch begin.

1c6cbec

Add new model aggregating averages, stddevs, and percentiles at regular

fa81bb0

intervals, allowing for outlier detection

Refactor outlier-removal to use an asof join on regular checkpoints for

09c0302

detector statistics

Consolidate outlier removal and filling missing rows into a single model

b784654

ian-r-rose requested review from JamesSLogan, summer-mothwood and thehanggit May 6, 2025 20:31

ian-r-rose self-assigned this May 6, 2025

ian-r-rose commented May 6, 2025

View reviewed changes

Whoops we can use microbatch logic for this, no need to reinvent the

8cbf293

wheel

JamesSLogan reviewed May 6, 2025

View reviewed changes

ian-r-rose added 5 commits May 7, 2025 08:59

Restructure g_factor computation to use microbatch incremental strategy.

2e0fdd6

I still want to move the 60th percentile occupancy calculation into the thresholds model

Move occupancy thresholds into the thresholds model

31a5d11

Move g factor speed computation to VDS schema

3264dd4

Move volume normalization into the five minute normalized data model so

c04434c

that all data cleanup happens in the same place.

Convert the rest of the models to microbatch

12aaa80

ian-r-rose added 3 commits May 9, 2025 15:30

Clean up

a1d0d8c

Unset full_refresh for these models, rather than unconditionally running

05a499b

a full refresh

Bump lookback to match what's in main

73044ef

ian-r-rose mentioned this pull request May 13, 2025

Convert existing incremental models to microbatch #559

Open

ian-r-rose added 4 commits May 13, 2025 15:14

Clean up and comments.

03adc86

Remove pems_clearinghouse_start_date, dev_model_look_back, and

31c3fb0

incremental_model_look_back variables. We don't need them anymore as they are replaced by microbatch configuration!

Cleanup

27fb05b

Remove explicit filtering, this should be handled by microbatch logic

6eb8b1f

now

ian-r-rose added 5 commits May 14, 2025 12:39

Refresh versions

ef161b9

Documentation and minor cleanup

995c28b

Revert for now -- we want these incremental models to be able to look

4db549f

back a bit further, even in dev.

Remove project-wide full_refresh config, which seemed unable to be

05f0332

overridden in an apparent dbt bug: dbt-labs/dbt-core#6013 It's a bit more verbose, but at least it's explicit.

Use correlated subquery instead of inner join to better convey intent

76752aa

ian-r-rose commented May 15, 2025

View reviewed changes

ian-r-rose added 11 commits June 6, 2025 14:26

Revamp the incremental docs

a201ffa

Merge branch 'main' into microbatch

fc88929

Merge branch 'main' into microbatch

7dc2a58

Using config.begin for creating spines causes problems if we want to go

a3805c3

back further than that in dev

Only compute timestamp spine for the current batch

3dca46e

Merge branch 'main' into microbatch

1ff0cd7

Disable partial parsing as it does not work well with environment-aware

9b3737c

macros like get_snowflake_refresh_warehouse(). Disable concurrent_batches because DELETE DML statements lock the table, so batch concurrency is pretty useless unless Snowflake starts being more partition-aware.

Turn on clustering for some larger tables

62ac523

Reorganize by moving outlier removal and speed smoothing into the

a062ea7

imputation layer. By organizing all of the models that look back in time together, we can handle them specially during backfills

Clean up

f5ca379

Add some docs and selectors to help with running large backfills.

b330894

ian-r-rose marked this pull request as ready for review August 5, 2025 20:35

Reintroduce pems_clearinghouse_start_date, begin can't really serve

ed089c3

the same purpose

Microbatch implementation #558

Are you sure you want to change the base?

Microbatch implementation #558

Uh oh!

Conversation

ian-r-rose commented May 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JamesSLogan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

thehanggit commented May 8, 2025

Uh oh!

ian-r-rose commented May 8, 2025

Uh oh!

thehanggit commented May 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants