GitHub - tuva-health/connector_template

Background

This is a template to create connectors for The Tuva Project on version 0.14.x.

This template includes a general project structure for connectors for The Tuva Project, including the data models that The Tuva Project Package expects as inputs, some data tests and column-level documentation, and some notes on how to build on this connector in structuring your own projects.

What is a connector?

In simple terms, we view a connector as an on-ramp to the Tuva Data Model for a single data source. In practice, that means a connector maps raw data for a single source to the Tuva Input layer, bridging the gap between the incoming raw data schema and the schema that the Tuva Data Model expects.

Why one connector per data source?

Folding multiple different data sources (and multiple raw data schemas) into a single connector can get ugly pretty quickly. When data quality problems arise across two sources, pinpointing the source of the issue(s) can become a lot more difficult when data source-specific logic is not cleanly separated.

The way we have handled this in the past is by mapping all of our data sources in separate connectors, and then unioning data once it has been mapped to the Tuva Project Input Layer. That allows for source-specific logic to handle the mapping piece, and then the data can be unioned when all sources are aligned with the Tuva Data Model.

High-level project structure

The typical workflow and project structure for mapping raw data to the Tuva Data Model within a connector is done in 3 stages (found within the models directory): staging, intermediate, and final.

Staging Model(s) with little transformation.

Convert non varchar/string column to the appropriate data types (e.g. dates, numbers)
Map source() raw data to the Tuva Data Model

Intermediate Model(s) that perform major transformation to the source data.

Deduplication of Adjustments, Denials, and Reversals (ADR) for claims
Any other consequential transformations

Final Model(s) that are used by the Tuva Project (i.e. the Input Layer).

Convert all column to the required data types
Models in this layer are expected by The Tuva Project Package Input Layer

Getting started building your own connector

This guide assumes:

that you have at least one raw data source loaded into a data warehouse.
that you have access to your data warehouse and you've configured a profiles.yml file.

Step 1: Create a Git repository for your connector.

Create a repository based on this template and name it appropriately. Let's say you named it my_connector.

Step 2: Clone the Git repo you just created

Clone the my_connector repository you just created using our template repo. If you used the template, you should have all of the files we ship with the connector template with their own Git history.

Set your directory:

cd my_connector

Step 3: Virtual Environment and requirements.txt

Inside the project folder, create the virtual environment:

# use python3 if python defaults to Python 2
python -m venv venv

This will create a venv directory.

Activate the virtual environment:

macOS / Linux (bash/zsh): source venv/bin/activate
Windows (Command Prompt): venv\Scripts\activate.bat
Windows (PowerShell): .\venv\Scripts\Activate.ps1
Windows (Git Bash): source venv/Scripts/activate

Install the requirements:

pip install --upgrade pip
pip install -r requirements.txt

This will install the Python dependencies that ship with this connector template, including dbt-core, sqlfluff for linting, and the dbt-snowflake adapter. You may need to slightly modify these requirements depending on your setup.

Step 4: Claims or clinical?

Depending on the connector you're building, you can clean up your project by getting rid of some models that are not relevant to your use case.

If you are creating a connector for a claims data source, you can delete all models from the models/final/ directory except for medical_claim.sql, pharmacy_claim.sql, and eligibility.sql.

If you are creating a connector for a clinical data source, you can delete medical_claim.sql, pharmacy_claim.sql, and eligibility.sql from models/final/.

In the dbt_project.yml file, you should uncomment claims_enabled if you're building a claims connector. Uncomment clinical_enabled if you're building a clinical connector. This variable helps tell the Tuva Project what kind of data to expect.

Step 5: Add your raw data tables to your project

The first step in mapping your data is making sure that your dbt project can find your raw data tables that you've loaded in your data warehouse.

Raw data enters the project at the staging layer, which relies on source() calls to your data warehouse to retrive the raw data.

In order for source() calls to work correctly, you must update your _sources.yml file to add the raw tables you'll be mapping to the input layer.

Step 6: Create your staging models

In the models/staging/ directory, create your staging models. Typically, these models are very simple and only handle light data typing. If you were building a claims connector and had eligibility, medical claims, and pharmacy claims data, you might create three models: stg_eligibility.sql, stg_medical_claim.sql, and stg_pharmacy_claim.sql.

We recommend that the ETL process dumps all of the fields as strings into the data warehouse. These columns then can be trimmed and nullified by adding this CTE at the top of your staging models:

with base as (
select
  {% for column in adapter.get_columns_in_relation(source('<schema>', '<table name>')) %}
    cast(nullif(trim({{ column.name }}), '') as {{ dbt.type_string() }}) as {{ column.name }}
    {%- if not loop.last -%},{%- endif %}
  {% endfor %}
from {{ source('<schema>', '<table name>') }}
)

Note that some data sources may represent null values differently, and these different representations should be handled in the intermediate layer.

Step 7: Intermediate models and the mapping process

The bulk of mapping happens in the intermediate layer. Mapping involves:

making sure that source data column names are mapped to the Tuva Data Model
making sure that data types are what the Tuva Data Model expects
making sure that source data is deduplicated, and Adjustments, Denials and Reversals are applied (if applicable)

If, for example, you're building a claims connector and you have eligibility, medical claims, and pharmacy claims data, you'll want to create models named something along the lines of int_eligibility, int_medical_claim, and int_pharmacy_claim; these models will make dbt ref() calls to the staging models you created before.

Depending on the complexity of your medical claim or pharmacy claims data sources, you might want to have multiple intermediate models (e.g. not just int_medical_claim but maybe one model where you map column names and handle Adjustments, Denials and Reversals, and one model where you deduplicate). We've done versions of this in different connectors, including one for Medicare CCLF and another for Medicare LDS.

We've written a Claims Mapping Guide that we hope is helpful in bridging the gap between your raw data and the Tuva Data Model.

And as you apply ADR logic to your claims data, we've written about some of the different ways to handle ADR.

Step 8: Creating your final models

This template has already taken care of creating the models for you in models/final/. Once your intermediate models are ready, add ref() calls to your final models——for a claims connector, this would be eligibility, medical_claim and pharmacy_claim models. Replace the null values with column names to which you've mapped your source data.

Step 9: Running your project and evaluating Data Quality Intelligence (DQI)

Once you have your final models ready, you are ready to run your dbt project! The first run of a connector through the Tuva Project will tell you how well you've mapped your data and highlight any remaining data quality problems, correctable or not.

Tuva's Data Quality Intelligence (DQI) contains over 600 tests built into the project. As you run your final models, tests may warn, or error, with varying levels of severity. Based on these warnings and errors, you may need to correct your mappings and re-run some (or all) of your project.

To run the models you've created for your connector, you can run:

dbt --select tag:input_layer

And then, to run the whole project, you can run:

dbt build --select the_tuva_project

The output from the run above will raise any DQI warnings or errors that signal mapping logic may need an update. Once you've made the necessary changes, you may want to re-run the model(s) where you've corrected the mapping, and then any associated DQI testing.

For example, if you were correcting your mapping in int_medical_claim and you wanted to re-run the models and tests corresponding to only that part of the project, you could run:

dbt build --select int_medical_claim medical_claim+1

Tuva Resources:

The Tuva Project docs
The Tuva Project dbt docs
The Tuva Project package

dbt Resources:

Learn more about dbt in the docs
Check out Discourse for commonly asked questions and answers
Join the chat on Slack for live discussions and support
Find dbt events near you
Check out the blog for the latest news on dbt's development and best practices

Running SQLFluff

Create a virtual environment: python3 -m venv .venv
Activate the virtual environment: source .venv/bin/activate
Install the required packages: pip install -r requirements.txt
Create a profiles.yml file in ~/.dbt. The profiles.yml.tmpl file is there as a reference.
Run SQLFluff on your target file (e.g. medical claims): sqlfluff fix models/final/medical_claim.sql

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github/workflows		.github/workflows
analyses		analyses
macros		macros
models		models
seeds		seeds
snapshots		snapshots
tests		tests
.gitignore		.gitignore
.sqlfluff		.sqlfluff
.sqlfluffignore		.sqlfluffignore
LICENSE		LICENSE
README.md		README.md
dbt_project.yml		dbt_project.yml
packages.yml		packages.yml
profiles.yml.tmpl		profiles.yml.tmpl
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Background

What is a connector?

Why one connector per data source?

High-level project structure

Getting started building your own connector

Step 1: Create a Git repository for your connector.

Step 2: Clone the Git repo you just created

Step 3: Virtual Environment and requirements.txt

Step 4: Claims or clinical?

Step 5: Add your raw data tables to your project

Step 6: Create your staging models

Step 7: Intermediate models and the mapping process

Step 8: Creating your final models

Step 9: Running your project and evaluating Data Quality Intelligence (DQI)

Tuva Resources:

dbt Resources:

Running SQLFluff

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

License

tuva-health/connector_template

Folders and files

Latest commit

History

Repository files navigation

Background

What is a connector?

Why one connector per data source?

High-level project structure

Getting started building your own connector

Step 1: Create a Git repository for your connector.

Step 2: Clone the Git repo you just created

Step 3: Virtual Environment and requirements.txt

Step 4: Claims or clinical?

Step 5: Add your raw data tables to your project

Step 6: Create your staging models

Step 7: Intermediate models and the mapping process

Step 8: Creating your final models

Step 9: Running your project and evaluating Data Quality Intelligence (DQI)

Tuva Resources:

dbt Resources:

Running SQLFluff

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Packages