Image Super-Resolution Pipeline with Apache Spark

This repository showcases an end-to-end pipeline for performing image super-resolution at scale using FSRCNN and Apache Spark. The architecture is split into three main clusters—Ingestion, Processing, and Notifications—to handle data ingestion, distributed image processing, and final notifications or downstream consumption.

Architecture Overview

Below is a high-level view of the complete architecture, illustrating how images progress from ingestion through processing to final consumption.

Ingestion Cluster
- Fetches image data (e.g., from an external API or user uploads).
- Publishes image references or metadata to Kafka.
- Stores images in an object storage system (e.g., S3) partitioned by ingestion date or image ID.
Processing Cluster
- Uses Apache Spark for distributed processing.
- Splits images into patches, applies super-resolution, and reconstructs final images.
- Writes processed images back to object storage.
Notifications Cluster
- Publishes an event via REST callback once images are processed.
- Notifies downstream consumers or microservices that the new, high-resolution images are available.
- The notification service is an independent Flask app.
- The notification service visualizes the processed images and displays the evaluations.

Ingestion Cluster

Producer Service: Retrieves images from an external API (e.g., Picsum or any custom data source).
Kafka Topics: Act as a buffer and reliable transport mechanism for high volumes of image references.
Consumer Service: Reads messages from Kafka, downloads the images, and stores them in an object storage bucket (such as S3) for later processing.
Threading: For this project we decided to use threading for the Kafka producer and consumer to allow concurrent processing and high throughput.

This decouples the image acquisition rate from the downstream processing speed.

Processing Cluster

Prerequisite: We are running pyspark through jupyter notebook and we assume that the spark context and sqlcontext are well defined, in a real world scenario we will prefer to use spark session and have full control over the spark session itself. The processing cluster orchestrates the Spark ETL pipeline for super-resolution:

Fetch Batch: Reads raw images from the S3 input bucket (or local storage, for a proof-of-concept).
Patch Extraction (mass_split.py):
- Reads each image, converts it to grayscale (if desired), and normalizes pixel values to [0, 1].
- Splits the image into patches of size patch_size × patch_size.
- Writes the patches as a Parquet dataset.
Model Inference (batch_inference.py):
- Reads the patches dataset.
- Broadcasts the FSRCNN model to all Spark executors.
- Runs super-resolution on each patch, converting the values back to [0, 255].
- Writes the super-resolved patches to another Parquet dataset.
Reconstruction (mass_reconstruct.py):
- Groups super-resolved patches by image_id and stitches them together.
- Saves the reconstructed high-resolution images (e.g., PNG) to the output bucket.
Write Final Output: The final images are stored in the S3 output bucket (or local directory).

Notifications Cluster

Once the super-resolution images are produced, the Notifications Cluster reads the images via a REST API and displays the images that are ready. This event will trigger an update for the visualized images. The notifications service displays the last 10 images (can be changed via code) and the evaluated metrics. The notifications service refreshes every 90 seconds (also modifiable).

FSRCNN Model

We use FSRCNN (Fast Super-Resolution Convolutional Neural Network), a lightweight and efficient model for image super-resolution. Key benefits include:

Speed: Faster inference compared to many other super-resolution architectures.
Quality: Significant improvements in visual clarity and detail.
Simplicity: Easy integration with PyTorch and PySpark for distributed inference.

Pipeline Steps

Ingestion
- Pull images from an external API or user input.
- Store them in S3 (or a local folder) for Spark consumption.
Patch Extraction (mass_split.py)
- Convert images to grayscale and normalize to [0, 1].
- Split images into patches.
- Save patches as a Parquet dataset.
Model Inference (batch_inference.py)
- Read the Parquet dataset.
- Apply FSRCNN super-resolution to each patch.
- Rescale patch values to [0, 255] after inference.
- Save the super-resolved patches as a Parquet dataset.
Reconstruction (mass_reconstruct.py)
- Group patches by image_id and reconstruct high-resolution images.
- Save final images (PNG format) to the output bucket.
Notification
- Display the created super-resolution images alongside the high-resolution images including the evaluated metrics.

Prerequisites

Python: Version 3.9 (avoid newer versions for compatibility).
Apache Spark: Tested on Spark 3.3.0.
PyTorch: Required for FSRCNN.
PySpark: For running Spark jobs in Python.
Pillow (PIL): For image manipulation.
Jupyter Notebook: Currently we run pyspark through Jupyter Notebook and assume that the spark context and sqlcontext are automatically created.
Flask: Must run as a separate thread, otherwise it is blocking.
Confluent Kafka: Must be installed for the Kafka producer and consumer services.

Ensure that all required Python libraries are listed in your requirements.txt.

Running the Pipeline

Run everything from root project directory

Run Kafka Start Script (recommended): The script starts the Zookeeper and Kafka services and creates a new topic "images-topic". You might need to change the paths in the script to your desired Kafka directory location.
```
./scripts/kafka_start.sh
```
Configure the Paths in config.yaml (Optional):
Config is created dynamically in main.
dataset.low_resolution_dir: Directory (or S3 path) with raw images.
dataset.patches_dir: Output location for extracted patches.
dataset.inference_result_dir: Output location for super-resolved patches.
dataset.reconstructed_images_dir: Final directory for high-resolution images.
processing.model_path: Path to your FSRCNN model checkpoint.
processing.patch_size: Patch size used for splitting and reconstruction.
processing.upscale_factor: The super-resolution scale factor (e.g., 2 or 4).
Run the Jupyter notebook main.ipynb: The notebook will run the entire program, including ingestion cluster, processing cluster and the notification service.
Stop The Kafka Services: The script stops the Zookeeper and Kafka services. You might need to change the paths in the script to your desired Kafka directory location.
Note that the topic will NOT be removed when the services are stopped.
```
./scripts/kafka_stop.sh
```

Contributing

Contributions and feedback are welcome! If you have suggestions for improvements, new features, or bug fixes, please open a pull request or create an issue.

License

This project is provided under an open-source license. See LICENSE for details.

Happy Super-Resolving!

Enjoy scaling your images with FSRCNN Apache Kafka and Spark in a fully distributed environment.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.vscode		.vscode
HR		HR
LR-Patches		LR-Patches
LR		LR
SR-Patches		SR-Patches
SR		SR
architecture		architecture
scripts		scripts
src		src
templates		templates
.gitignore		.gitignore
README.md		README.md
formalize.sh		formalize.sh
main.ipynb		main.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Image Super-Resolution Pipeline with Apache Spark

Table of Contents

Architecture Overview

Ingestion Cluster

Processing Cluster

Notifications Cluster

FSRCNN Model

Pipeline Steps

Prerequisites

Running the Pipeline

Contributing

License

Happy Super-Resolving!

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

RyanWri/big-data-sr

Folders and files

Latest commit

History

Repository files navigation

Image Super-Resolution Pipeline with Apache Spark

Table of Contents

Architecture Overview

Ingestion Cluster

Processing Cluster

Notifications Cluster

FSRCNN Model

Pipeline Steps

Prerequisites

Running the Pipeline

Contributing

License

Happy Super-Resolving!

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages