Skip to content

Releases: llm-d/llm-d-inference-sim

v0.3.0

20 Jul 08:29
7f1f766
Compare
Choose a tag to compare
v0.3.0 Pre-release
Pre-release

Release Notes

Compatibility with vLLM

  • Aligned command-line parameters with real vLLM. All parameters supported by both the simulator and the vLLM now share the same name and format:
    • Support for --served-model-name
    • Support for --seed
    • Support for --max-model-len
  • Added support for tools in chat completions
  • Included usage in the response
  • Added object field to the response JSON
  • Added support for multimodal inputs in chat completions
  • Added health and readiness endpoints
  • Added P/D support; the connector type must be set to nixl

Additional Features

  • Introduced configuration file support. All parameters can now be loaded from a configuration file in addition to being set via the command line.
  • Added new test coverage
  • Changed the Docker base image
  • Added the ability to randomize time to first token, inter token latency, and KV-cache transfer latency

Migration Notes (for users upgrading from versions prior to v0.2.0)

  • max-running-requests has been renamed to max-num-seqs
  • lora has been replaced by lora-modules, which now accepts a list of JSON strings, e.g, '{"name": "name", "path": "lora_path", "base_model_name": "id"}'

Change details since v0.2.2

  • feat: add max-model-len configuration and validation for context window (#82) by @mohitpalsingh in #85
  • Fixed readme, removed error for --help by @irar2 in #89
  • Pd support by @mayabar in #94
  • fix: crash when omitted stream_options by @jasonmadigan in #95
  • style: 🔨 splits all import blocks into different sections by @yafengio in #98
  • Fixed deployment.yaml by @irar2 in #99
  • Enable configuration of various parameters in tools by @irar2 in #100
  • Choose latencies randomly by @irar2 in #103

New Contributors

Full Changelog: v0.2.2...v0.3.0

v0.2.2

13 Jul 10:02
7656a3c
Compare
Choose a tag to compare
v0.2.2 Pre-release
Pre-release

What's Changed

  • Initialize rand once, added seed to configuration by @irar2 in #79
  • use string when storing lora adapters in simulator by @mayabar in #81
  • Improved support for empty command line arguments by @irar2 in #80
  • Added tests for LoRA configuration, load and unload by @irar2 in #86

Full Changelog: v0.2.1...v0.2.2

v0.2.1

06 Jul 10:03
3e63a0d
Compare
Choose a tag to compare
v0.2.1 Pre-release
Pre-release

What's Changed

  • fix: max-cpu-loras should be initialized from max-loras by @shmuelk in #77
  • Support space separated arguments, use correct format in config file by @irar2 in #78

Full Changelog: v0.2.0...v0.2.1

Migrating from releases prior to v0.2.0

  • max-running-requests was replaced by max-num-seqs
  • lora was replaced by lora-modules, which is now a list of JSON strings, e.g, '{"name": "name", "path": "lora_path", "base_model_name": "id"}'

v0.2.0

03 Jul 10:26
2119638
Compare
Choose a tag to compare
v0.2.0 Pre-release
Pre-release

What's Changed

  • add support to multimodal in chat completions by @JuanmaBM in #49
  • Updated REST endpoint and Prometheus metric documentation by @shmuelk in #59
  • Support tools by @irar2 in #55
  • Split defs.go file containing data structures and functions into several files by @irar2 in #60
  • Calculate usage data once for both text and chat by @irar2 in #61
  • Improved tokenization by @irar2 in #62
  • feat: 🚀 Reduce the mirror image size by @yafengio in #64
  • Allow array parameters in tools by @irar2 in #65
  • Support object parameters in tools by @irar2 in #66
  • Support minItems and maxItems for array parameters in tools by @irar2 in #67
  • Support integer and float in tools by @irar2 in #68
  • link checker - fails PR if links are broken - be consistent with llmd-scheduler by @mayabar in #70
  • Simplify tools parameters json schema by @irar2 in #72
  • Add the --served-model-name flag by @nerdalert in #69
  • Configuration improvements by @irar2 in #75

Migrating from releases prior to v0.2.0

Changes have been made in release v0.2.0 to make the command line arguments and configuration file more in-line with
vLLM's command line arguments and configuration file. In particular:

  • max-running-requests was replaced by max-num-seqs
  • lora was replaced by lora-modules, which is now an array in JSON format, e.g, [{"name": "name", "path": "lora_path", "base_model_name": "id"}]

New Contributors

Full Changelog: v0.1.2...v0.2.0

v0.1.0

20 May 11:36
a5c928e
Compare
Choose a tag to compare
v0.1.0 Pre-release
Pre-release

The first release of the llm-d-inference-sim.

The llm-d-inference-sim is a lightweight vLLM simulator for use during development of the llm-d platform. In particular for use in the development of the llm-d-inference-scheduler.

What's Changed

Full Changelog: 0.0.6...v0.1.0