Instructor: Mr. Lorenzo Sta. Maria, Data Scientist at Globe Telecom
This explores data optimization using PySpark, covering big data concepts, pipelines, machine learning, and data transformation techniques. The goal is to enhance data processing efficiency and model performance.
- Notebooks Folder - This contains Jupyter notebooks for different topics in data optimization:
- Week 1 - Spark Setup: Understanding Big Data, Hadoop, and Spark Ecosystem
- Week 2 - Data Sources: HDFS Basics, Hive, PySpark Essentials
- Week 3 - Basic Statistics: Measures of Central Tendency, Hypothesis Testing
- Week 4 - Pipelines: Data Engineering Pipelines, ETL vs. ELT
- Week 5 - Extracting, Transforming, and Selecting Features: Data Preparation & Feature Engineering
- Week 7 - Supervised Learning: Regression: Linear Regression, Model Evaluation
- Week 8 - Supervised Learning: Classification: Logistic Regression, Classification Metrics