This repository documents my hands-on journey through the Big Data Summer Training Program organized by NTI in collaboration with ITIDA.
The program diving deep into Big Data tools, platforms, and real-world applications.
This repo contains:
- Jupyter Notebooks & Labs from each topic
- Technical notes & key takeaways
- Practice examples and use-case simulations
- Commands and setups used in the virtual environment
- Big Data Era & Kunpeng Architecture
- HDFS + ZooKeeper – Distributed storage and cluster coordination
- HBase + Hive – NoSQL + distributed data warehouse (SQL-like)
- ClickHouse – OLAP database for real-time analytics
- MapReduce + YARN – Distributed processing engine and resource manager
- Spark + Flink – Batch + Stream processing with in-memory computing
- Flume + Kafka – Data ingestion and real-time messaging pipelines
- Elasticsearch – Distributed search engine and analytics
| Tool/Tech | Use Case |
|---|---|
| Linux, SQL, Python | Foundations for scripting &querying |
| HDFS | Distributed data storage |
| Hive | SQL-style querying |
| HBase | NoSQL for large-scale datasets |
| Kafka | Real-time messaging system |
| Spark & Flink | Data processing engines |
| ClickHouse | High-performance analytics |
| Flume, Sqoop | Data ingestion from logs & DBs |
| Elasticsearch | Search and analytics |
| ZooKeeper | Cluster coordination |
This repo serves as:
- A personal reference and knowledge base
- A full recap of my learning journey
- A practical showcase for Big Data skills
Feel free to explore the notebooks or reach out if you'd like to collaborate or discuss Big Data topics!
Reach out to me on LinkedIn