This is my submission for the FedEx Analytics Engineering Assignment.
It features a contained environment with a data pipeline that ingests, cleans, and enriches the Amazon E-Commerce Sales Dataset from Kaggle, and makes the results available for BI.
flowchart LR
raw["`Raw data
(.csv file)`"]
clean["Clean models
(dbt)"]
enriched["Enriched models
(dbt)"]
kimball["Kimball models
(dbt)"]
SemanticLayer["Semantic Layer
(Cube.dev)"]
BI["BI layer
(Apache Superset)"]
raw --> clean --> enriched --> kimball --> SemanticLayer --> BI
This project includes a workflow with:
- Data transformations using dbt
- Data storage using DuckDB
- Semantic Layer models using cube.dev
- BI dashboards using Superset
- A basic data catalog using dbt docs
- A local development environment using vscode devcontainer, linters, docker compose.
Due to time constraints, the following areas are incomplete/out of scope:
- Proper security handling for production, like not committing the
.env
file, using secrets, etc. (.env
file is commited for demo purposes.) - Superset works and has a connection to cube, so it can be used to create dashboards. But there are no readymade dashboards included in this repo.
- Devcontainer linters are not configured.
- Limited data cleansing and testing.
- The Pyspark part of this exercise was agreed to be skipped.
REQUIREMENTS.md
: Original requirements.transform/models
: Data transformation models (dbt).cube/schema
: Semantic Layer models, to be used by BI dashboard apps (Cube.dev)superset
: Superset (BI dashboards)docker-compose.yml
: Local environment definition.taskfile.yml
: Available actions, to be used by maintainers and eventually the CI/CD.
- Visual Studio Code
- Docker
-
Open this repo in VSCode. Open the command palette (
Shift+Cmd+P
on mac) and selectDev Containers: Rebuild and Reopen in Container
. This will spin up the environment including a devcontainer, cube, and superset. -
Open a terminal in the devcontainer and run:
task demo:run-full-demo
-
Then:
- To see an overview of the data transformation models and their metadata & lineage, access the local dbt docs instance by navigating to http://localhost:8080.
- To view and manage the semantic model data cubes and views, open the local cube instance by navigating to http://localhost:4000/.
- To view and manage BI dashboards, open the local Superset instance by navigating to http://localhost:8088/login/ and log in with
admin
,admin
. It has a connection to cube and you can create your own dashboards, but at the moment there are no readymade dashboards included in this repo.