Home

Welcome to the snowplow-snowflake-loader wiki!

This is a project to load Snowplow enriched events into the Snowflake cloud data warehouse.

This application consists of two independent apps:

Snowplow Snowflake Transformer, a Spark job which reads Snowplow enriched events from Amazon S3 and writes them back into S3 in a format ready for Snowflake
Snowplow Snowflake Loader, a Scala app which takes the Snowflake-ready data in Redshift and loads it into Snowflake

This is a Spark job which reads Snowplow enriched events from Amazon S3 and writes them back into S3 in a format ready for Snowflake.

This Spark job is written in Scala, and makes use of the Snowplow Scala Analytics SDK.

The Transformer:

Reads Snowplow enriched events from S3
Uses the JSON transformer from the Snowplow Scala Analytics SDK to convert those enriched events into JSONs
Writes those enriched event JSONs back to S3 as newline-delimited gzipped files
Keeps track of which folders it has processed using the Snowplow Scala Analytics SDK's DynamoDB manifest functionality

This is a Scala app which takes the Snowflake-ready data in Redshift and loads it into Snowflake.

The algorithm is as follows:

Check for new transformed folders in S3, generated by Snowplow Snowflake Transformer, which haven't yet been loaded into Snowflake
Check for any new JSON properties (unstruct_events, contexts or derived_contexts) in the transformed files which are not yet present in the Snowflake database table
Add these new JSON properties to the table as large VARCHAR fields
Perform the load into S3

How do we check if a given folder in S3 has been loaded into S3 yet - some kind of manifest? In DynamoDB or in Snowflake (ideal if it's in Snowflake and can be written as part of the load transaction)
How do we make sure that a new transformed folder in S3 is finalized (both in terms of eventual consistency and checking that the Snowplow Snowflake Transformer is not still running)
How should we check for new JSON properties in the transformed files which are not yet present in the Snowflake database table - perhaps the Snowplow Snowflake Transformer writes out an inventory of all the JSON properties it found, and the Snowplow Snowflake Loader cross-checks this against the existing properties that are found in the Snowflake enriched events table

We will need to do an initial port of the Snowplow enriched event TSV to Snowflake, similar to the representation of atomic.events in Redshift.

Provide feedback