In this project, I build a simple data pipeline following the ETL(extract - transform - load) model using Youtube-Trending-Video dataset, perform data processing, transformation and calculation using Apache Spark big data technology, serving the video search and recommendation system
- Data Source: This project uses two main data sources: Youtube Trending Video data and Youtube API- Youtube Trending Videodata is downloaded from Kaggle.com with- .csvfile format, then loaded into- MySQL, considered as a- data source
- Using Video IDandCategory IDfromYoutube Trending Videodata, we collect some additional information fields fromYoutube APIsuch asVideo LinkandVideo Category
 
- Extract Data: Extract the above data sourcesusingPolarsDataFrame, now we have therawlayer, then load the data intoMinIOdatalake
- Tranform Data: From MinIO, we useApache Spark, specificallyPySpark- convert from PolarsDataFrametoPySparkDataFramefor processing and calculation, we getsilverandgoldlayers
- Data stored in MinIOis in.parquetformat, providing better processing performance
 
- convert from 
- Load Data: Load the goldlayer into thePostgreSQLdata warehouse, perform additional transform withdbtto create anindex, making video searching faster
- Serving: The data was used for visualization using Metabaseand creating a video recommendation application usingStreamlit
- package and orchestrator: Use Dockerto containerize and package projects andDagsterto coordinateassetsacross different tasks
- MySQL
- Youtube API
- Polars
- MinIO
- Apache Spark
- PostgreSQL
- Dbt
- Metabase
- Streamlit
- Dagster
- Docker
- Apache Superset
- Unittest
- Pytest
Here's what you can do with:
- You can completely change the logic or create new assetsin thedata pipelineas you wish, performaggregatecalculationson theassetsin thepipelineaccording to your purposes.
- You can also create new data chartsas well as change existingchartsas you like with extremely diversechart typesonMetabaseandApache Superset.
- You can also create new or change my existing dashboardsas you like
- Searchvideos quickly with any- keyword, for- Video RecommendationApps
- Searchin many different languages, not just- Englishsuch as:- Japanese,- Canadian,- German,- Indian,- Russian
- Recommend videos based on categoryandtagsvideo
During this project, I learned important skills, understood complex ideas, knew how to install and set up popular and useful tools, which brought me closer to becoming a Data Engineer.
- Logical thinking: I learned how to think like a data person, find the cause of the data problemand then come up with the mostreasonable solutionto achieve high dataaccuracy.
- Architecture: I understand and grasp the ideasandarchitectureof today's popular and popular big data processing tool,Apache Spark.
- Installation: I learned how to install popular data processing, visualization and storage tools such as: Metabase,Streamlit,MinIO,... withDocker
- Setup: I know how to setup Spark Standalone ClusterusingDockerwith threeWorker Nodeson my local machine
Each part of this project has helped me understand more about how to build a data engineering, data management project. Learn new knowledge and improve my skills in future work
- Add more data sourcesto increase data richness.
- Refer to other data warehousesbesidesPostgreSQLsuch asAmazon RedshiftorSnowflake.
- Perform more cleaningandoptimizationprocessingof the data.
- Perform more advanced statistics,analysisandcalculationswithApache Spark.
- Check out other popular and popular data orchestrationtools likeApache Airflow.
- Separate dbtinto a separate service (separatecontainer) indockerwhen the project expands
- Setup Spark Clusteroncloud platformsinstead of onlocal machines
- Refer to cloud computingservices if the project is more extensive
- Learn about dbt packageslikedbt-labs/dbt_utilsto help make thetransformationprocess faster and more optimal.
To run the project in your local environment, follow these steps:
- Run command after to clone the repositoryto yourlocal machine.
   git clone https://github.com/longNguyen010203/Youtube-ETL-Pipeline.git- Run the following commands to build the images from the Dockerfile, pull images fromdocker huband launch services
   make build
   make up- Run the following commands to access the SQL editoron theterminaland Check iflocal_infilewas turned on
   make to_mysql_root
   SET GLOBAL local_infile=TRUE;
   SHOW VARIABLES LIKE "local_infile";
   exit- Run the following commands to create tables with schema for MySQL, load data fromCSVfile toMySQLand create tables with schema forPostgreSQL
   make mysql_create
   make mysql_load
   make psql_create- Open http://localhost:3001 to view Dagster UIand clickMaterialize allbutton to run the Pipeline
- Open http://localhost:9001 to view MinIO UIand check the data to be loaded
- Open http://localhost:8080 to view Spark UIand threeworkersare running
- Open http://localhost:3030 to see charts and dashboardsonMetabase
- Open http://localhost:8501 to try out the video recommendationapp onStreamlit

