dbt test + alert on streaming data #6

jwills · 2022-02-17T21:14:43Z

jwills
Feb 17, 2022

The repo for my hackday exercise is here: https://github.com/jwills/mz-hack-day-2022

Mostly copy-pasting from the Slack message where I described what I wanted to do:

"So one of the first things we do in our daily dbt run for our DWH at WeaveGrid is the standard staging work of renaming columns, cleaning up types, etc. followed by a bunch of sanity check dbt tests on that lightly processed staging data to ensure that there aren't any red flags in there that would mess up downstream table materializations. In my dream world, I would like to move that staging/testing work upstream- out of Snowflake and into dbt+materialize- so that I could run those tests "continuously" (read: with a cron job that executed like every 15 minutes-to-an-hour) to catch upstream data quality issues earlier, during business hours, and not after midnight UTC when a *&!^?# data quality issue is going to ruin my evening.

My plan is to write some basic tests for the hackday project, verify they work correctly in dbt+materialize, and then wire up a cron-ish thing in docker-compose (not sure how to do that, so open to suggestions here) that would fire an alert at me if it detects a data quality issue (which I will then deliberately introduce upstream in the opensky data). Bonus points here would be if I could be streaming the cleaned/staged data I'm generating via dbt-materialize back into Redpanda so that it can propagate to my DWH directly and (even better/super-bonus points) if I could stop the data from streaming into Redpanda if/when the dbt tests fire and detect a data quality issue."

The reason I want to do this upstream data quality filtering/checking in a streaming system instead of in e.g. Snowflake is that b/c the DWH is the first place that all of the data I care about comes together in a single place, which is great for analytics, but it also means that I can't make those integrations available to upstream systems that would benefit from having that data available to them in near-real-time for decision support. I think that the standard data quality checking that we do in the DWH via dbt tests is the foundation to providing the proverbial "data mesh"-- a streaming system that allows all of my different services to provide data for integration/remixing in order to support new use cases that needs to be built on top of technology that is available/accessible to everyone at the company-- hence dbt and SQL on Materialize-- while still giving developers the power and control to fix bugs and make changes using the programming languages they love best (hence WASM + RedPanda). 🎉

morsapaes · 2022-02-18T10:15:16Z

morsapaes
Feb 18, 2022

Thanks again for sharing your project and going as far as presenting it, @jwills!

and not after midnight UTC when a *&!^?# data quality issue is going to ruin my evening

This hits too close to home. 📟

Regarding running something like cron on Docker, I've used a low-footprint scheduler called Ofelia before and really enjoyed it! Here's an example of how to bake it into a docker-compose file:

1 reply

jakthom Mar 1, 2022

Sweet!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dbt test + alert on streaming data #6

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

dbt test + alert on streaming data #6

Uh oh!

Uh oh!

jwills Feb 17, 2022

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

morsapaes Feb 18, 2022

Uh oh!

jakthom Mar 1, 2022

jwills
Feb 17, 2022

Replies: 1 comment 1 reply

morsapaes
Feb 18, 2022