https://github.com/pytorch/workshops/tree/master/ASPLOS_2024
The frontend of regular compilers (lexing/parsing) is very different from the frontend of ML compilers (graph capture, whether through tracing, bytecode analysis, etc.)
Similarly, the backend ends up looking fairly different as well. For one, ML compilers typically start up with much more "semantic" information than traditional compilers. For example, they might do optimizations like "merge two matmuls into a single matmul". Another difference is that the general structure also ends up much "simpler". They usually support very limited forms of control flow, and so much of work involved in traditional compiler passes for handling CFGs also don't matter much.
Finally, most traditional compilers are focused on optimizations for CPUs, but almost all ML is done on GPUs or other accelerators.
MLIR dialects TVM XLA PyTorch Glow cuDNN
PyTorch Tensorflow
- IRs are generated by compilers. IRs are computation graphs.
- To generate machine code from IR, compiler use codegen. Example of codegen is LLVM. This process in called lowering.
- Tensorflow XLA, NVCC, TVM all use LLVM
-domain specific compiler: NVCC, XLA, PyTorch uses XLA for TPU and Glow for other hardwares. -3rd party compiler: TVM for custom compiler
MLIR helps build own compiler.
Instead of compile to run on a specific hardware. compile it to run on browser (WASM format that can be run with Javascript) WASM compiler: Emscripten (which also uses LLVM codegen), but it only compiles from C and C++ into WASM. scailable is supposed to convert from scikit-learn models into WASM. TVM also compiles to WASM.
GCC compiles C/C++ code to machine code LLVM is good for CPU and GPU but MLIR is generate framwework for any hardware. LLVM is a subset of MLIR. MLIR (A meta compiler that is used to build other compilers),
- MultiGPU
- Multinode
Tech Stack
- Triton + PyTorch + MLIR
- Pallas + JAX + XLA
-
https://www.kapilsharma.dev/posts/deep-dive-into-triton-internals/
-
https://www.kapilsharma.dev/posts/deep-dive-into-triton-internals-2/
-
https://www.kapilsharma.dev/posts/deep-dive-into-triton-internals-3/
IR to Machine code: x86 for CPU ptx for GPU
Polyhedral model: The polyhedral model in compilers is a mathematical approach used for optimizing loop nests in high-level programming. In this model, loops are represented as geometric shapes (polyhedra) in a high-dimensional space, where each point in the shape corresponds to an individual iteration of the loop. The edges and faces of the polyhedron represent the relationships and dependencies between different iterations. This representation allows the compiler to perform sophisticated transformations on the loops, such as tiling, fusion, or parallelization, by manipulating the shapes in this abstract space.
These transformations can significantly improve the performance of the program, particularly for applications with complex loop structures and large amounts of data processing (like deep learning!). The polyhedral model excels at capturing and optimizing the parallelism and locality in loop nests, making it a powerful tool for optimizing the core operations found in a neural network, such as matrix multiplication.
CuBLAS for linear algebra CuDNN for DL in GPU Eigen for DL In CPU
- Computational Performance: https://d2l.ai/chapter_computational-performance/index.html
- MLC Course (TVM): https://mlc.ai/summer22/schedule
- Triton Internals (MLIR): https://www.kapilsharma.dev/posts/deep-dive-into-triton-internals/
- XLA Deep Dive: https://medium.com/@muhammedashraf2661/demystifying-xla-unlocking-the-power-of-accelerated-linear-algebra-9b62f8180dbd
- Torch Dynamo Deep Dive: https://pytorch.org/docs/main/torch.compiler_dynamo_deepdive.html
- Perfomance tuning by Paul: https://paulbridger.com/
- Ultra Scale book implementation from scratch of all distributed algos.
- Model Optimization (Distillation, Quantization, Pruning) - TBD Source
- 100 days of cuda
- ML Compilation: https://mlc.ai/summer22/schedule
- DL Systems: https://dlsyscourse.org/lectures/