Pipeline Engine

Overview

Pipeline Engine is a Java-based program built on the Spring Boot ecosystem that allows you to stream files from directory data sources to Cassandra as a sink data source. The engine supports two types of processing: data processing for large files that should be processed in distributed services using Spark for better scalability and performance, and event processing for small files.

Architecture

Generated with ChatGpt :-)

                                    +---------------------+
                                    |     Directory       |
                                    |     Data Source     |
                                    +----------^----------+
                                               |
                                               |
                                               v
                +--------------------------------------------------------------+
                |                     File Observer Service                    |
                |  +------------------------+      +-------------------------+ |
                |  |    Producer Worker     |      |      Observer Worker    | |
                |  +------------------------+      +-------------------------+ |
                |                                                              |
                +--------------------------------------------------------------+
                                               |
                                               |                      
                                               v                       
                              +----------------------------------+    
                              |              Kafka               |    
                              |          Message Broker          |    
                              +----------------------------------+    
                                   |                       |
                                   |                       |
                                   v                       v
                          +------------------+    +------------------+
                          |      event       |    |      Data        |
                          |    Processing    |    |    Processing    |
                          +------------------+    +------------------+
                                   |                       |
                                   |                       |
                                   v                       v
                              +----------------------------------+    
                              |              Kafka               |    
                              |          Message Broker          |    
                              +----------------------------------+    
                                               |
                                               |                      
                                               v    
                              +----------------------------------+    
                              |        Data Persistence          |    
                              |           Cassandra              |    
                              +----------------------------------+

File Observer Service

The file observer component consists of workers that utilize the Java WatchService to register relevant directories as data sources. They create jobs via an internal queue for each worker, which reads the job containing the file name and sends it to the relevant topic based on the configuration. It's important to note that at this stage, only the file name is passed and not the file itself. Each processing unit reads the file and performs the necessary processing.

File Data Processing Service

File Event Processing Service

Data Persists Service

Persistence

cassandra
scheme migration as part of docker file location n
schema erd

Requirements

Java 8 or above
Docker Engine

Improvements

Storage type can be local but to support better scalability and fault tolerance implementation should change to one of the follows:
- distribute file system (HDFS)
- cloud object storage
Large content file
- Consider to send pointer to file as event header (and not file content)
- use Kafka compress mechanizm
- compression / columnar storage apache parquet
Producer
- consider to bu

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.idea		.idea
event-processing		event-processing
file-data-processing		file-data-processing
file-observer		file-observer
schema/cassandra		schema/cassandra
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pipeline Engine

Table of Contents

Overview

Architecture

File Observer Service

File Data Processing Service

File Event Processing Service

Data Persists Service

Persistence

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

TzachiStrugo/pipeline-engine

Folders and files

Latest commit

History

Repository files navigation

Pipeline Engine

Table of Contents

Overview

Architecture

File Observer Service

File Data Processing Service

File Event Processing Service

Data Persists Service

Persistence

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages