This project is an end-to-end pipeline. It processes invoices data and creates a dashboard where it's possible to:
- fetch all invoices related to a customer
- fetch all items related to an invoice
- This project uses an e-commerce dataset that you can download here.
- It consists of invoices from a UK retailer between December 2010 and December 2011 (~540k rows and ~26k unique invoices).
- Fields in the datasets are: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, and Country.
- InvoiceNo is the invoice ID.
- One invoice can have one or more stock codes.
- Description, Quantity, and UnitPrice relate to stock codes.
- InvoiceDate, CustomerID, and Country relate to invoice numbers.
- Some invoices don't have a CustomerID and Quantity can be a negative number.
This was built on a Linux machine, if you are on a Mac you can probably avoid sudo
in front of most commands.
- Step 1. Clone the repository to your local machine
- Step 2. Create a virtual env with
sudo python3 -m venv ./venv
and activate it withsource venv/bin/activate
. Once this is done, install the requirementspip install -r requirements.txt
- Step 3. Create a folder
mongodb-data
to save data locally - Step 4. Download the data from here and save the under the
data
folder. - Step 5. Convert csv data into JSON:
sudo python data/csv_to_json.py
. This will transform the data and saved them asoutput.txt
- Step 6. Create the
data-api
image:- Navigate to the
Data API
folder:cd data-api
- Run
sudo docker build -t data-api
. - This creates a new image called
data-api
on your machine that Docker can build. To check all available images, runsudo docker images
- Navigate to the
- Step 7. Start docker containers:
sudo docker-compose -f docker-compose-kafka-spark-mongodb.yml up
. Verify that all containers are runnign withsudo docker ps
. - Step 8. Set up the kafka topic:
- Check the name of the container:
sudo docker ps
- Connect to the bash:
sudo docker exect -it <container_name> /bin/bash
- List all topics:
/opt/bitnami/kafka/bin/kafka-topics.sh --list --bootstrap-server localhost:9092
- Create a topic:
opt/bitnami/kafka/bin/kafka-topics.sh --create --topic ingestion-topic --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
- Check the name of the container:
- Step 9. Open the Jupyter Notebook PySpark. This should be on
localhost:8888
. This might ask you for a passwork/token that you can find in the logs generated when you started the containers. Leave this open (it will process messages as they are ingested by Kafka). Run all cell of the PySpark notebook. - Step 10. Open the UI of Mongo DB. This should be on
localhost:8081
with the password set in the docker compose file. Then:- Create a database called
docstreaming
- Inside docstreaming, create a collection called
invoices
- Create a database called
- Step 11. Go to the
data
directory and runpython data-api-client.py
to send the data to MongoDB. If everything is running correctly, you should see documents population the MongoDB interface. - Step 12. Run
streamilit run invoices_interfaces.py
. This will open a window in your browser (localhost:8501) from which you can query invoices by customer ID and invoices items by Invoice ID.
I hope you enjoyed it!