v0.3.0

Pre-release

Pre-release

shmuelk released this 20 Jul 08:29

7f1f766

Release Notes

Compatibility with vLLM

Aligned command-line parameters with real vLLM. All parameters supported by both the simulator and the vLLM now share the same name and format:
- Support for --served-model-name
- Support for --seed
- Support for --max-model-len
Added support for tools in chat completions
Included usage in the response
Added object field to the response JSON
Added support for multimodal inputs in chat completions
Added health and readiness endpoints
Added P/D support; the connector type must be set to nixl

Additional Features

Introduced configuration file support. All parameters can now be loaded from a configuration file in addition to being set via the command line.
Added new test coverage
Changed the Docker base image
Added the ability to randomize time to first token, inter token latency, and KV-cache transfer latency

Migration Notes (for users upgrading from versions prior to v0.2.0)

max-running-requests has been renamed to max-num-seqs
lora has been replaced by lora-modules, which now accepts a list of JSON strings, e.g, '{"name": "name", "path": "lora_path", "base_model_name": "id"}'

Change details since v0.2.2

feat: add max-model-len configuration and validation for context window (#82) by @mohitpalsingh in #85
Fixed readme, removed error for --help by @irar2 in #89
Pd support by @mayabar in #94
fix: crash when omitted stream_options by @jasonmadigan in #95
style: 🔨 splits all import blocks into different sections by @yafengio in #98
Fixed deployment.yaml by @irar2 in #99
Enable configuration of various parameters in tools by @irar2 in #100
Choose latencies randomly by @irar2 in #103

New Contributors

@mohitpalsingh made their first contribution in #85
@jasonmadigan made their first contribution in #95

Full Changelog: v0.2.2...v0.3.0

Contributors

jasonmadigan, mayabar, and 3 other contributors

Assets 2