This repository contains an implementation of the Mean Shift algorithm using CUDA for the course "Parallel and Distributed Systems." The project aims to evaluate the performance improvements achieved through parallelizing the algorithm. Three versions have been developed: a serial implementation, and two parallel versions—one using shared memory and one without.
- GCC for compiling C code
- NVIDIA CUDA Toolkit for compiling CUDA code
- Docker (if you want to run in container)
- serial.c - Serial implementation of the Mean Shift algorithm, serving as the baseline for comparison.
- nonshared.cu - Parallel implementation without the utilization of shared memory.
- shared.cu - Parallel implementation that utilizes shared memory to increase speed.
- input.txt - Input data for the algorithm.
- output_reference.txt - Reference data used to verify the algorithm's output.
- output.txt - Contains the results from the most recent run of any implementation.
- graph.png - Graph depicting the results of a test run in a distributed system.
- Makefile - Build rules and run targets with GPU flags.
- Dockerfile - Multi-stage build for a minimal CUDA runtime image.
Run all programs with:
make all
This will compile the following binaries:
- serial - For the serial implementation.
- nonshared - For the CUDA implementation without shared memory.
- shared - For the CUDA implementation with shared memory.
To run all the programs run:
make run_all
Or run a specific version with:
make run_serial
Specify the convergence criterion e
for the Mean Shift algorithm by passing it as a parameter:
make run_all INPUT=0.05
Or for a specific version:
make run_shared INPUT=0.05
If not specified, the default value of 0.5 is used.
Build the Docker image:
docker build -t mean-shift-cuda .
Run with default parameters:
docker run --gpus all mean-shift-cuda
- This runs
make run_all
inside the container, usingINPUT=0.5
.
Override commands example:
docker run --gpus all mean-shift-cuda make run_shared INPUT=0.01
- Arrays: Instead of using multi-dimensional arrays, the data is represented using one-dimensional arrays. This approach simplifies memory access and enhances the performance in CUDA. Arrays are sized as
arrays * columns
, with data points indexed every two elements.
- Dynamic Arrays: The
arrayStatic
,arrayYk
, andarrayYkplus1
arrays are dynamically created and initialized with the data frominput.txt
. At the start of the Mean Shift algorithm,arrayYk
is set to the initial data points, representingarrayStatic
.
Convergence Criterion (e): The convergence parameter e
is converted to a float using atof()
. The execution times are recorded from the start to the finish of the algorithm for each data point, to assess the impact of different values of e
.
Performance tests were conducted on a distributed system, comparing both shared and non-shared memory implementations across a range of convergence thresholds e
.
The shared memory implementation benefits significantly when the convergence criterion e
is smaller, leading to more iterations and hence better performance. A graph of the results can be seen below:
The graph above visualizes the execution times, demonstrating the advantages of the shared memory implementation, particularly for smaller values of e
.
Moreover, performance tests were conducted in two different computers:
- System A (Local): Intel Core i5-8300H @ 2.30 GHz + GeForce GTX 1060
- System B (Docker): Intel Core i9-14900HX @ 2.20 GHz + GeForce RTX 5060 Ti
The produced results between the two systems for e
= 0.005 are in the table below:
Implementation | System A (GTX 1060) | System B (RTX 5060 Ti) | Speedup (A→B) |
---|---|---|---|
Serial | 20.11 ms | 9.16 ms | 2.20× |
Non‑shared | 2.04 ms | 0.75 ms | 2.72× |
Shared | 1.06 ms | 0.61 ms | 1.74× |
CUDA implementations compared to the serial baseline on each system:
Implementation | System A Speedup (Serial → Parallel) | System B Speedup (Serial → Parallel) |
---|---|---|
Non‑shared | 20.11 / 2.04 ≈ 9.9× | 9.16 / 0.75 ≈ 12.2× |
Shared | 20.11 / 1.06 ≈ 19.0× | 9.16 / 0.61 ≈ 15.0× |
- Shared Memory Utilization - Using shared memory is particularly beneficial for smaller values of the convergence criterion
e
, as demonstrated by the faster execution times fore=0.1
,0.05
,0.01
. - Performance Enhancement - Parallelization with CUDA significantly reduces the execution time compared to the serial implementation, highlighting the effectiveness of CUDA in enhancing algorithmic performance in distributed systems.
Further development could include adding tests at each stage of the algorithm and experimenting with CUDA optimizations for specific GPU architectures. Profiling could also provide insights into potential performance bottlenecks.