This repository was archived by the owner on Mar 24, 2023. It is now read-only.
dbt test + alert on streaming data #6
jwills
started this conversation in
Show and tell
Replies: 1 comment 1 reply
-
Thanks again for sharing your project and going as far as presenting it, @jwills!
This hits too close to home. 📟 Regarding running something like |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The repo for my hackday exercise is here: https://github.com/jwills/mz-hack-day-2022
Mostly copy-pasting from the Slack message where I described what I wanted to do:
"So one of the first things we do in our daily dbt run for our DWH at WeaveGrid is the standard staging work of renaming columns, cleaning up types, etc. followed by a bunch of sanity check dbt tests on that lightly processed staging data to ensure that there aren't any red flags in there that would mess up downstream table materializations. In my dream world, I would like to move that staging/testing work upstream- out of Snowflake and into dbt+materialize- so that I could run those tests "continuously" (read: with a cron job that executed like every 15 minutes-to-an-hour) to catch upstream data quality issues earlier, during business hours, and not after midnight UTC when a *&!^?# data quality issue is going to ruin my evening.
My plan is to write some basic tests for the hackday project, verify they work correctly in dbt+materialize, and then wire up a cron-ish thing in docker-compose (not sure how to do that, so open to suggestions here) that would fire an alert at me if it detects a data quality issue (which I will then deliberately introduce upstream in the opensky data). Bonus points here would be if I could be streaming the cleaned/staged data I'm generating via dbt-materialize back into Redpanda so that it can propagate to my DWH directly and (even better/super-bonus points) if I could stop the data from streaming into Redpanda if/when the dbt tests fire and detect a data quality issue."
The reason I want to do this upstream data quality filtering/checking in a streaming system instead of in e.g. Snowflake is that b/c the DWH is the first place that all of the data I care about comes together in a single place, which is great for analytics, but it also means that I can't make those integrations available to upstream systems that would benefit from having that data available to them in near-real-time for decision support. I think that the standard data quality checking that we do in the DWH via dbt tests is the foundation to providing the proverbial "data mesh"-- a streaming system that allows all of my different services to provide data for integration/remixing in order to support new use cases that needs to be built on top of technology that is available/accessible to everyone at the company-- hence dbt and SQL on Materialize-- while still giving developers the power and control to fix bugs and make changes using the programming languages they love best (hence WASM + RedPanda). 🎉
Beta Was this translation helpful? Give feedback.
All reactions