Skip to content

A sandbox environment designed to simulate a pseudo-distributed Hadoop cluster with integrated Apache Spark and Kafka components. It allows developers to prototype and experiment with big data workflows, test distributed computing patterns, and explore cluster behavior in a contained virtual setup.

License

Notifications You must be signed in to change notification settings

imjuliengaupin/sparkler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prerequisites

  1. Clone the repo

    git clone https://github.com/imjuliengaupin/sparkler.git

Usage

(back to top)

⚙️ Features

  • Modular and configurable to work with a locally installed, pseudo-distributed Apache Hadoop machine cluster

  • Apache Spark structured event streaming with Apache Kafka

  • Distributed Extract-Transform-Load (ETL) data processing with Apache Spark

    • A custom Suite class (leveraging object-oriented programming abstraction concepts) to create independent and modular objects that leverage common functionality and can be used when connecting to different databases to extract data into a DataFrame object to apply transformations, using the DataFrame API

    • A custom Suite class (leveraging object-oriented programming abstraction concepts) to create independent and modular objects that leverage common functionality and can be used when extracting the content of different file varieties into a DataFrame object to apply transformations, using the DataFrame API

      Support Provided For: .json .xml .yaml

See the open issues for a full list of proposed features (and known issues).

(back to top)

🔁 CI/CD


(back to top)

💻 Demo

(back to top)

If you find interest in this project and want to share your own insights, enhancements, or bugfixes, please feel free to contribute!

  1. Fork the project
  2. Create your feature branch git checkout -b feature/branchname
  3. Commit your changes git commit -m 'description'
  4. Push your feature branch git push origin feature/branchname
  5. Open a pull request

(back to top)

📝 License

Distributed under the MIT License. See LICENSE for more information.

(back to top)

About

A sandbox environment designed to simulate a pseudo-distributed Hadoop cluster with integrated Apache Spark and Kafka components. It allows developers to prototype and experiment with big data workflows, test distributed computing patterns, and explore cluster behavior in a contained virtual setup.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages