PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation #10344
AutonomicPerfectionist
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Overview
Introducing PipeInfer, a novel multi-node speculative inference technique that improves overall generation speed of transformer-architecure LLMs, surpassing traditional SpecInfer-style speculative inference in almost all tested cases. PipeInfer makes use of an asynchronous pipeline-parallel design to launch multiple micro-batches of speculation verification at once, while also launching normal autoregressive inference of a single token after every speculation acceptance. As a result, PipeInfer suffers near zero latency and performance regression when encountering poor speculation accuracy. The asynchronous design boosts node utilization, and naturally improves tolerance to interconnect bandwidth and latency constraints. Additionally, PipeInfer is able to adapt in real time to pipeline conditions, dynamically adjusting the confidence threshold of the speculative model, thereby adjusting the number of speculative microbatches being launched.
PipeInfer is built on top of llama.cpp, using a custom MPI backend, and will be presented at SC'24 on Wednesday, November 24th in Atlanta, Georgia.
Details
PipeInfer's design is based around a dual asynchronous inference pipeline: a longer primary pipeline dedicated to inference of the target model, and a shorter secondary pipeline dedicated to inference of the speculative model. The cluster PipeInfer runs on is split into 3 MPI communicators (creating 3 separate subgroups of nodes), two for the two inference pipelines and one containing the two pipeline head nodes. The head nodes control their respective inference pipeline, and sync their pipeline's sampled tokens with each other at specified synchronization points.
The layers of each model are split across each pipeline, dedicating an integer number of full layers to each node. For example, node 3 may be responsible for layers 32-48. Besides control messages, only activation tensors are transferred between nodes within the same pipeline. Almost all MPI communication is done with non-blocking point-to-point send and receive.
Correctness is maintained through a transaction-style pipelined communication protocol, exploiting the fact that MPI messages with the same tag are non-overtaking (at the point of reception). Worker nodes receive a transaction_start message that contains the ID of the desired operation. This ID is used to select a handler function, which performs all MPI communications using a dedicated tag, ensuring only messages designated for that operation are processed. As the protocol is designed for pipelined communication, each node is guaranteed to have a maximum of one predecessor and one successor, thereby allowing nodes to only listen for messages from a single node. This fact can be used to optimize interconnect bandwidth by inserting dedicated point-to-point networks in between each node in a pipeline.
The KV cache is managed through this transaction system, allowing a layer's entries to remain resident on the allocated node while also ensuring global consistency. PipeInfer makes liberal use of the KV cache sequences present in llama.cpp's implementation: each inference launch is allocated a sequence ID from a pool, except for non-speculative inferences, given the special sequence ID of 0. Upon acceptance or rejection of a given speculation, pipeline transactions are sent to modify the cache of each node accordingly, copying entries to the zeroth sequence if accepted and simply removing them otherwise. Since non-speculative inference runs are guaranteed to have the correct KV cache entries, the first speculative inference run after the launch of a non-speculative run can skip inference of the first token in a sequence and instead copy the canonical sequence's entry for that position.
Further details are given in our paper and our public code repo.
Resources
Paper: https://arxiv.org/abs/2407.11798
Implementation Repository: https://github.com/AutonomicPerfectionist/PipeInfer
Our initial implementation was built before the refactoring of the backend framework in llama.cpp, that initial implementation is the one on the main branch and the one used in CPU performance evaluations, for reproducibility requirements.
We also built a preliminary implementation using the newer
backend-v2
framework, aiming to add support for other backends via wrapping them in the MPI backend. That implementation is available on a separate branch of the code repo, and was used for GPU performance evaluations. However, the backend implementation is not complete, and suffers from a design flaw that results in running the first and last layers of each node on the CPU backend. Therefore, the given GPU performance metrics are not comparable with other results. This design flaw can be rectified in the future.Future Work
We believe PipeInfer can be extended to support other speculative inference strategies, such as Lookahead Decoding. We also believe that PipeInfer can be scaled down to a single multi-gpu machine by treating each GPU as a separate node in the pipeline.
Authors
Branden Butler, Sixing Yu, Arya Mazaheri, Ali Jannesari
Beta Was this translation helpful? Give feedback.
All reactions