You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Avoid KVCacheBlock.__eq__ invocations in FreeKVCacheBlockQueue (#21005)
Summary:
Pull Request resolved: #21005
# Optimizations
As a common trick for doubly linked list implementation, introducing fake head and tail nodes would significantly reduce the implementation overhead, and help us to get rid of dataclass.__eq__ comparison easily.
- No dataclass.__eq__ invocation
- Shorter code
- Branchless
All these combined should yield significant perf improvement for this piece of code.
# Observations
Per vLLM profiling, kv_cache_manager.allocate_slots consumed non-negligible cost for each prefill.
|{F1980260529}|{F1980260481}|{F1980260497}|
By zooming in, we could see the stack of FreeKVCacheBlockQueue.popleft is non-trivial. popleft -> remove -> string.__eq__ which is mainly coming from dataclasses (i.e. KVCacheBlock) equal comparison.
Per [dataclasses python library doc](https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass)
```
dataclasses.dataclass(*, init=True, repr=True, eq=True, order=False, unsafe_hash=False, frozen=False, match_args=True, kw_only=False, slots=False, weakref_slot=False)
eq: If true (the default), an __eq__() method will be generated. This method compares the class as if it were a tuple of its fields, in order. Both instances in the comparison must be of the identical type.
If the class already defines __eq__(), this parameter is ignored.
```
Test Plan:
# Result
Typically, block_size is set to 16, so in production usage, we might likely allocate 10-1000 blocks. In this range, the optimization gave us up to ~1ms TTFT savings (the improvements are more significant on long inputs).
|After|Before|
|{F1980286936}|{F1980286941}|
Rollback Plan:
Reviewed By: CuiCoco
Differential Revision: D78292345
0 commit comments