⚡ extreme_matmul

Fast matrix multiplication in Python using Intel MKL

Because sometimes NumPy just isn't fast enough.

What's This About?

Ever found yourself waiting around for large matrix multiplications to finish? Yeah, me too. That's why I built Extreme MatMul - a custom Python extension that leverages Intel's Math Kernel Library (MKL) to make matrix multiplication blazingly fast.

This project started as a deep dive into understanding how high-performance computing libraries work under the hood. What I discovered was pretty eye-opening: with the right approach, we can achieve some serious performance gains over standard NumPy operations.

Performance Results

Here's what really matters - the numbers don't lie:

Large Matrices (8164 × 8164 × 2048)

===============================================
extreme_matmul.fast_matmul    : 1.74 seconds ⚡
NumPy @ operator             : 7.14 seconds 🐌
torch optim                  : 1.97 seconds
===============================================

Smaller Matrices (128 × 256 × 256)

===============================================
extreme_matmul.fast_matmul    : 0.35ms ⚡
NumPy @ operator             : 19.25ms
torch optim                  : 9.96ms
===============================================

(Tests were ran on Intel(R) Xeon(R) CPU @ 2.00GHz)

That's roughly 4x faster than NumPy for large matrices and 55x faster for smaller ones!

How It Works

The magic happens through a few key optimizations:

Intelligent Algorithm Selection: For tiny matrices (≤32×32), we use a simple triple-loop implementation that's actually faster due to reduced overhead. For everything else, we call Intel MKL's optimized BLAS routines.
Direct MKL Integration: Instead of going through NumPy's abstraction layers, we talk directly to Intel's Math Kernel Library - the same engine that powers many scientific computing applications.
Memory Layout Optimization: The code ensures data is contiguous in memory and properly aligned for SIMD operations.
Flexible Broadcasting: Supports 1D, 2D, and 3D arrays with proper broadcasting rules, just like NumPy.

Installation

Prerequisites

You'll need Intel MKL installed on your system. Here's how to get everything set up:

# Update your system
sudo apt update
sudo apt install -y gpg-agent wget

# Add Intel's repository
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list

# Install Intel MKL
sudo apt update
sudo apt install intel-oneapi-mkl-devel

# Set up the environment and install
source /opt/intel/oneapi/setvars.sh
pip install git+https://github.com/dellano54/extreme-matmul.git

Build from Source

git clone https://github.com/yourusername/extreme-matmul.git
cd extreme-matmul
source /opt/intel/oneapi/setvars.sh  # Make sure MKL is in your environment
python setup.py build_ext --inplace

Usage

It's designed to be a drop-in replacement for NumPy's matrix multiplication:

import numpy as np
import extreme_matmul

# Create some test matrices
A = np.random.rand(1000, 1000).astype(np.float32)
B = np.random.rand(1000, 1000).astype(np.float32)

# Use extreme_matmul instead of np.matmul
result = extreme_matmul.matmul(A, B)

# That's it! Same API, much faster performance

Supported Operations

Vector × Vector: Dot product
Matrix × Vector: Matrix-vector multiplication
Matrix × Matrix: Standard matrix multiplication
Batch Operations: 3D arrays with batch dimensions
Mixed Dimensions: Flexible broadcasting like NumPy

Important Notes

Float32 Only: Currently optimized for 32-bit floating point operations
Memory Requirements: Arrays are converted to contiguous format if needed
Array Limits: Supports 1D to 3D arrays (batch processing for 3D)

Running the Benchmarks

Want to see the performance difference yourself?

python benchmark.py

This will run the same tests I used to generate the performance numbers above. The benchmark tests both large and small matrix scenarios to show how the algorithm selection works.

Technical Deep Dive

Algorithm Selection Strategy

The code uses a size-based heuristic to choose between algorithms:

Tiny matrices (≤32×32): Simple triple-loop implementation
Larger matrices: Intel MKL's cblas_sgemm with full optimizations

Memory Management

Automatic conversion to contiguous arrays when needed
Proper reference counting to prevent memory leaks
Aligned memory access for optimal SIMD performance

Error Handling

Comprehensive input validation including:

Type checking (float32 requirement)
Dimension compatibility verification
Memory allocation error handling

Why This Matters

This project demonstrates several important concepts:

The Power of Specialized Libraries: MKL is heavily optimized for Intel processors with years of engineering behind it
Algorithm Selection: Sometimes simpler algorithms win for small inputs due to reduced overhead
C Extensions: How to write efficient Python extensions that rival compiled languages
Memory Layout: The importance of cache-friendly data structures

Limitations & Future Work

Platform Dependency: Currently requires Intel MKL (Linux/Intel processors)
Data Type Limitation: Only supports float32 (could be extended)
GPU Support: No CUDA implementation yet (interesting future direction)

Contributing

Found a bug or have an idea for improvement? Feel free to open an issue or submit a pull request. This started as a learning project, but I'm always interested in making it better!

License

MIT License - feel free to use this in your own projects.

Built with curiosity and a need for speed. If you find this useful, consider giving it a star! ⭐

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
LICENSE		LICENSE
README.md		README.md
bench.log		bench.log
benchmark.py		benchmark.py
extreme_matmul.c		extreme_matmul.c
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

⚡ extreme_matmul

What's This About?

Performance Results

Large Matrices (8164 × 8164 × 2048)

Smaller Matrices (128 × 256 × 256)

How It Works

Installation

Prerequisites

Build from Source

Usage

Supported Operations

Important Notes

Running the Benchmarks

Technical Deep Dive

Algorithm Selection Strategy

Memory Management

Error Handling

Why This Matters

Limitations & Future Work

Contributing

License

About

Uh oh!

Languages

License

dellano54/extreme-matmul

Folders and files

Latest commit

History

Repository files navigation

⚡ extreme_matmul

What's This About?

Performance Results

Large Matrices (8164 × 8164 × 2048)

Smaller Matrices (128 × 256 × 256)

How It Works

Installation

Prerequisites

Build from Source

Usage

Supported Operations

Important Notes

Running the Benchmarks

Technical Deep Dive

Algorithm Selection Strategy

Memory Management

Error Handling

Why This Matters

Limitations & Future Work

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages