Skip to content
This repository was archived by the owner on Sep 26, 2023. It is now read-only.
Alexander Dean edited this page Jun 27, 2017 · 11 revisions

Welcome to the snowplow-snowflake-loader wiki!

Overview

This is a project to load Snowplow enriched events into the Snowflake cloud data warehouse.

Technical architecture

This application consists of two independent apps:

  1. Snowplow Snowflake Transformer, a Spark job which reads Snowplow enriched events from Amazon S3 and writes them back into S3 in a format ready for Snowflake
  2. Snowplow Snowflake Loader, a Scala app which takes the Snowflake-ready data in Redshift and loads it into Snowflake

Snowplow Snowflake Transformer

Overview

This is a Spark job which reads Snowplow enriched events from Amazon S3 and writes them back into S3 in a format ready for Snowflake.

This Spark job is written in Scala, and makes use of the Snowplow Scala Analytics SDK.

Algorithm

The Transformer:

  • Reads Snowplow enriched events from S3
  • Uses the JSON transformer from the Snowplow Scala Analytics SDK to convert those enriched events into JSONs
  • Writes those enriched event JSONs back to S3 as newline-delimited gzipped files
  • Keeps track of which folders it has processed using the Snowplow Scala Analytics SDK's DynamoDB manifest functionality

Snowplow Snowflake Loader

Overview

This is a Scala app which takes the Snowflake-ready data in Redshift and loads it into Snowflake.

Algorithm

The algorithm is as follows:

  • Check for new transformed folders in S3, generated by Snowplow Snowflake Transformer, which haven't yet been loaded into Snowflake
  • Check for any new JSON properties (unstruct_events, contexts or derived_contexts) in the transformed files which are not yet present in the Snowflake database table
  • Add these new JSON properties to the table as large VARCHAR fields
  • Perform the load into S3

Outstanding questions

  1. How do we check if a given folder in S3 has been loaded into S3 yet - some kind of manifest? In DynamoDB or in Snowflake (ideal if it's in Snowflake and can be written as part of the load transaction)
  2. How do we make sure that a new transformed folder in S3 is finalized (both in terms of eventual consistency and checking that the Snowplow Snowflake Transformer is not still running)
  3. How should we check for new JSON properties in the transformed files which are not yet present in the Snowflake database table - perhaps the Snowplow Snowflake Transformer writes out an inventory of all the JSON properties it found, and the Snowplow Snowflake Loader cross-checks this against the existing properties that are found in the Snowflake enriched events table

Other resources

We will need to do an initial port of the Snowplow enriched event TSV to Snowflake, similar to the representation of atomic.events in Redshift.

Clone this wiki locally