Edin-MLS-25 Spring: GPU-Accelerated Machine Learning Systems

Project Overview and Implementations

Task 1: GPU-Accelerated Information Retrieval

Objective: Accelerate large-scale vector search and clustering using GPUs.
Implementations:
- Distance Functions: L2, Cosine, Dot Product, and L1 distances implemented in:
  - NumPy: CPU baseline with vectorized operations.
  - CuPy: GPU-accelerated with CUDA kernels (e.g., cp.linalg.norm).
  - PyTorch: GPU tensors with native CUDA ops (e.g., torch.norm).
  - Triton: Custom block-wise kernels for scalability.
- kNN: Exact nearest neighbor search across all frameworks, supporting all four distance metrics.
- K-Means: Clustering implemented in CuPy, PyTorch, and NumPy, using L2 distance.
- ANN: Approximate search using K-Means clustering to reduce search space, implemented in CuPy with L2 and Cosine distances.
Key Features:
- Batching and streaming to handle large datasets.
- Parallel distance computations and top-K selection on GPU.
- Adaptive memory management based on input size.

Task 2: Model Serving System

Objective: Build an efficient RAG serving system for LLM inference.
Implementations:
- Baseline: Sequential request processing without queuing or batching.
- Queued-Batched: Adds request queuing and dynamic batching.
- Scaled-Balanced: Extends with round-robin load balancing and RPS-based autoscaling.
Components:
- Embedding with multilingual-e5-large-instruct.
- Text generation with OPT-125M.
- FastAPI-based server with asynchronous request handling.

Directory Structure & File Description

The repository is structured as follows, with emphasis on key files in task-1 and task-2:

edin-mls-25-spring/
├── .gitignore
├── LICENSE
├── main.py
├── pyproject.toml
├── README.md
├── results.txt
├── uv.lock
├── misc-src/
│   └── memory-test.py
├── resources/                  # Coursework template examples
│   ├── 1-pytorch-demo/
│   └── 2-gpu-programming/
├── task-1/                     # Core Task 1 implementations
│   ├── task.py                 # Main script with distance, kNN, K-Means, and ANN functions
│   ├── chart_nb_knn.ipynb      # Notebook for kNN performance visualization
│   ├── chart_notebook_distance.ipynb  # Notebook for distance function analysis
│   ├── compare_knn.py          # kNN benchmarking script
│   ├── compare_kmeans.py       # K-Means benchmarking script
│   ├── elbow_plot/             # Elbow plot generation for K-Means
│   └── ...                     # Other work-in-progress files (e.g., torch_task.py)
├── task-2/                     # Task 2 serving system
│   ├── autoscaler.py           # Autoscaling logic
│   ├── load_balancer_round_robin.py  # Round-robin load balancer
│   ├── request_queue.py        # Request queue implementation
│   ├── serving_rag_v0.py       # Baseline serving system
│   ├── serving_rag_v1.py       # Queued-Batched serving system
│   ├── tests/
│   │   └── test_end_to_end.py  # End-to-end benchmarking script
│   └── test_results/           # Performance results and plots
└── ...                         # Additional files

Key Files:
- task-1/task.py: Central implementation of Task 1 algorithms.
- task-1/chart_nb_knn.ipynb: Visualizes kNN performance across frameworks.
- task-2/serving_rag_v1.py: Queued-Batched serving system.
- task-2/test_end_to_end.py: Benchmarking script for Task 2.

Optimizations

Task 1

Batching and Streaming: Uses CUDA streams (e.g., 2 streams) to overlap data transfer and computation, with batch sizes dynamically calculated (e.g., optimum_knn_batch_size) based on GPU memory (20 GB assumed).
Memory Management: Preallocated buffers prevent OOM errors; adaptive batching adjusts to input size (e.g., full GPU load for datasets < 8 GB).
Custom Kernels: CuPy’s RawKernel for L2 kNN uses shared memory tiling and warp optimization (block size multiple of 32). Triton kernels process data in 512/1024-element blocks.
Parallelism: Distance computations and top-K selection parallelized across GPU cores.

Task 2

Queuing: Thread-safe FIFO queue (collections.deque) with locking smooths traffic bursts.
Batching: Dynamic batching with MAX_BATCH_SIZE=10 and MAX_WAIT_TIME=0.5s balances throughput and latency.
Load Balancing: Round-robin distribution across instances.
Autoscaling: RPS-based scaling (thresholds: 29 up, 25 down) with warm instances to reduce cold starts.

Results & Report Findings

Task 1 Performance Figures

Distance Functions:
- Small Dimensions (D=2, N_calcs=1): CPU (0.02s) outperforms GPU (Torch: 0.12×, CuPy: 0.03×) due to transfer overhead.
- Large Dimensions (D=2^15, N_calcs=1000): GPU excels with CuPy (1.79× speedup) and Torch (1.66×) over CPU (453.23s).
- Full results in LaTeX report Table 1.
kNN:
- Tested with N=[4K, 40K, 400K, 4M], D=1024, K=10:
  - Small N (4K): CPU and CuPy fastest due to low overhead.
  - Large N (4M): GPU frameworks 2-4× faster than CPU (except dot product, where CPU excels due to BLAS optimization).
  - CuPy and Triton lead, with minor variations.
K-Means:
- Tested with N=[4K, 40K, 400K, 1M, 2M, 4M], D=[2, 1024], K=10:
  - GPU (CuPy, Torch) scales better than CPU, with significant speedups at higher N.
  - Optimal scaling factors: CuPy (0.025), Torch (0.08).
ANN:
- Reduces search space via clustering, achieving high recall with 5-10× faster query times than exact kNN.

Task 2 Performance Figures

Baseline: Saturates at 13 RPS, with mean latency >28s and P99 >40s at 35 RPS.
Queued-Batched: Scales to 30 RPS, mean latency <1s, P99 <2s up to saturation.
Scaled-Balanced: Reaches 52 RPS, mean latency <1.5s, P99 <1.5s under 60 RPS load.
Figures in LaTeX report (e.g., Figure 3).

Key Findings

Task 1: GPU acceleration offers 5-10× speedups for large-scale vector search over CPU, with CuPy providing the best performance-efficiency balance. Dot product is an outlier where CPU excels at scale.
Task 2: Queuing and batching boost throughput (30 RPS vs. 13 RPS), while load balancing and autoscaling further scale to 52 RPS, maintaining low latency.
Synergy: Task 1 speedups enhance Task 2 retrieval, but system-level optimizations (Task 2) are critical to leverage raw performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Edin-MLS-25 Spring: GPU-Accelerated Machine Learning Systems

Project Overview and Implementations

Task 1: GPU-Accelerated Information Retrieval

Task 2: Model Serving System

Directory Structure & File Description

Optimizations

Task 1

Task 2

Results & Report Findings

Task 1 Performance Figures

Task 2 Performance Figures

Key Findings

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
misc-src		misc-src
resources		resources
task-1		task-1
task-2		task-2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
results.txt		results.txt
uv.lock		uv.lock

License

michaeldunn99/edin-mls-25-spring

Folders and files

Latest commit

History

Repository files navigation

Edin-MLS-25 Spring: GPU-Accelerated Machine Learning Systems

Project Overview and Implementations

Task 1: GPU-Accelerated Information Retrieval

Task 2: Model Serving System

Directory Structure & File Description

Optimizations

Task 1

Task 2

Results & Report Findings

Task 1 Performance Figures

Task 2 Performance Figures

Key Findings

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages