-
Notifications
You must be signed in to change notification settings - Fork 535
Description
Describe the feature
The equality test supports passing a compare_columns
setting, which allows one to limit the columns which will be compared from each relation. Critically, this setting assumes that column names are identical in both models being compared. In some scenarios, they might not be!
What I would like to do instead is be able to provide different lists of columns for each model, if the column names are different. It would be assumed that the columns are listed in order of comparison, i.e. first column compared with first, second column compared with second, etc.
Describe alternatives you've considered
I could skip using the equality
test completely, and write a custom one-off singular data test. Of course, this loses the advantages of using generic tests.
I could also copy/paste the equality
test into my project and then add the above functionality in my fork. But I'd prefer submitting a PR upstream, because most of the code would be the same. equality
is a long macro, and to implement this feature would require only minor tweaks... thus it's a significant violation of DRY.
Additional context
For example, suppose we have two models:
old_businesses
: contains columnsbusiness_id
,introduction
,business_name
,random_field_we_do_not_care_about
new_businesses
: contains columnsbusiness_pk
,name
,introduction
,something_else
I would like to compare the first three columns in each model. The data should be identical - it is just that the column names and order are sometimes slightly different and I need to tell equality
how to match it all up.
Ideally, I should be able to write a test like:
models:
# only compare some of the columns
- name: old_businesses
tests:
- dbt_utils.equality:
compare_model: ref('new_businesses')
model_columns:
- business_id
- business_name
- introduction
compare_columns:
- business_pk
- name
- introduction
which says to pick columns business_id, business_name, introduction
from old_businesses
, and pick columns business_pk
, name
, introduction
from new_businesses
, and compare them. If model_columns
is not provided, then we assume a default equal to compare_columns
.
Alternatively, it could be structured like this:
models:
# only compare some of the columns
- name: old_businesses
tests:
- dbt_utils.equality:
compare_model: ref('new_businesses')
compare_columns:
- [business_id, business_pk]
- [business_name, name]
- introduction
which means exactly the same as the first example, except that a list is optionally used in the case when the column names are different.
(Any other ideas?)
Who will this benefit?
In my case, the two tables being compared are undergoing a migration in the underlying system. There is a double-writing process taking place: when a write is made to one table, then a similar write is supposed to be made to the other table.
Of course, mistakes can be made when double-writing. That's where this dbt equality test could be helpful. We can monitor the two tables and use the equality test to flag whenever there is a difference.
However, the new table has different column names than the old one.
Are you interested in contributing this feature?
Yes, I can submit a PR. Just would like to know if the feature would be welcomed.