What's Changed
- skip failing Chinese prompt on Win by @pavel-esir in #1573
- Bump product version 2025.1 by @akladiev in #1571
- Bump tokenizers submodule by @akladiev in #1575
- [LLM_BENCH] relax md5 checks and allow pass cb config without use_cb by @eaidova in #1570
- [VLM] Add Qwen2VL by @yatarkan in #1553
- Fix links, remind about ABI by @Wovchena in #1585
- Add nightly to instructions similar to requirements by @Wovchena in #1582
- GHA: use nightly from 2025.1.0 by @ilya-lavrenov in #1577
- NPU LLM Pipeline: Switch to STATEFUL by default by @dmatveev in #1561
- Verify not empty rendered chat template by @yatarkan in #1574
- [RTTI] Fix passes rtti definitions by @t-jankowski in #1588
- Test
add_special_tokens
properly by @pavel-esir in #1586 - Add indentation for llm_bench json report dumping by @nikita-savelyevv in #1584
- prioretize config model type under path-based task determination by @eaidova in #1587
- Replace openvino.runtime imports with openvino by @helena-intel in #1579
- Add tests for Whisper static pipeline by @eshiryae in #1250
- CB: removed handle_dropped() misuse by @ilya-lavrenov in #1594
- Bump timm from 1.0.13 to 1.0.14 by @dependabot in #1595
- Update samples readme by @olpipi in #1545
- [ Speculative decoding ][ Prompt lookup ] Enable Perf Metrics for assisting pipelines by @iefode in #1599
- [LLM] [NPU] StaticLLMPipeline: Export blob by @smirnov-alexey in #1601
- [llm_bench] enable prompt permutations for prevent prefix caching and fix vlm image load by @eaidova in #1607
- LLM: use set_output_seq_len instead of WA by @ilya-lavrenov in #1611
- CB: support different number of K and V heads per layer by @ilya-lavrenov in #1610
- LLM: fixed Slice / Gather of last MatMul by @ilya-lavrenov in #1616
- Switch to VS 2022 by @mryzhov in #1598
- Add Phi-3.5-vision-instruct and Phi-3-vision-128k-instruct by @Wovchena in #1609
- Whisper pipeline: apply slice matmul by @as-suvorov in #1623
- GHA: use OV master in mac.yml by @ilya-lavrenov in #1622
- [Image Generation] Image2Image for FLUX by @likholat in #1621
- add missed ignore_eos in generation config by @eaidova in #1625
- Master increase priority for rt info to fix Phi-3.5-vision-instruct and Phi-3-vision-128k-instruct by @Wovchena in #1626
- Correct model name by @wgzintel in #1624
- Token rotation by @vshampor in #987
- Whisper pipeline: use Sampler by @as-suvorov in #1615
- Fix setting eos_token_id with kwarg by @Wovchena in #1629
- Extract cacheopt E2E tests into separate test matrix field by @vshampor in #1630
- [CB] Split token streaming and generation to different threads for all CB based pipelines by @iefode in #1544
- Don't silence a error if a file can't be opened by @Wovchena in #1620
- [CMAKE]: use different version for macOS arm64 by @ilya-lavrenov in #1632
- Test invalid fields assignment raises in GenerationConfig by @Wovchena in #1633
- do_sample=False for NPU in chat_sample, add NPU to README by @helena-intel in #1637
- [JS] Add GenAI Node.js bindings by @vishniakov-nikolai in #1193
- CB: preparation for relying on KV cache precisions from plugins by @ilya-lavrenov in #1634
- [LLM bench]support providing adapter config mode by @eaidova in #1644
- Automatically apply chat template in non-chat scenarios by @sbalandi in #1533
- beam_search_causal_lm.cpp: delete wrong comment by @Wovchena in #1639
- [WWB]: Fixed chat template usage in VLM GenAI pipeline by @AlexKoff88 in #1643
- [WWB]: Fixed nano-Llava preprocessor selection by @AlexKoff88 in #1646
- [WWB]: Added config to preprocessor call in VLMs by @AlexKoff88 in #1638
- CB: remove DeviceConfig class by @ilya-lavrenov in #1640
- [WWB]: Added initialization of nano-llava in case of Transformers model by @AlexKoff88 in #1649
- WWB: simplify code around start_chat / use_template by @ilya-lavrenov in #1650
- Tokenizers update by @ilya-lavrenov in #1653
- DOCS: reorganized support models for image generation by @ilya-lavrenov in #1655
- Fix using lm_bemch/wwb with version w/o apply_chat_template by @sbalandi in #1651
- Fix Qwen2VL generation without images by @yatarkan in #1645
- Parallel sampling with threadpool by @mzegla in #1252
- [Coverity] Enabling coverity scan by @akazakov-github in #1657
- [ CB ] Fix streaming in case of empty outputs by @iefode in #1647
- Allow overriding eos_token_id by @Wovchena in #1654
- CB: remove GenerationHandle:back by @ilya-lavrenov in #1662
- Fix tiny-random-llava-next in VLM Pipeline by @yatarkan in #1660
- [CB] Add KVHeadConfig parameters to PagedAttention's rt_info by @sshlyapn in #1666
- Bump py-build-cmake from 0.3.4 to 0.4.0 by @dependabot in #1668
- pin optimum version by @pavel-esir in #1675
- [LLM] Enabled CB by default by @ilya-lavrenov in #1455
- SAMPLER: fixed hang during destruction of ThreadPool by @ilya-lavrenov in #1681
- CB: use optimized scheduler config for cases when user explicitly asked CB backend by @ilya-lavrenov in #1679
- [CB] Return Block manager asserts to destructors by @iefode in #1569
- phi3_v: allow images, remove unused var by @Wovchena in #1670
- [Image Generation] Inpainting for FLUX by @likholat in #1685
- [WWB]: Added support for SchedulerConfig in LLMPipeline by @AlexKoff88 in #1671
- Add LongBench validation by @l-bat in #1220
- Fix Tokenizer for several added special tokens by @pavel-esir in #1659
- Unpin optimum-intel version by @ilya-lavrenov in #1680
- Image generation: proper error message when encode() is used w/o encoder passed to ctor by @ilya-lavrenov in #1683
- Fix excluding stop str from output for some tokenizer by @sbalandi in #1676
- [VLM] Fix chat template fallback in chat mode with defined system message by @yatarkan in #1674
- Set stop token ids from default generation config by @yatarkan in #1612
- [Tokenizers] add max_lengh parametrisation to encode by @pavel-esir in #1518
- WWB: run FLUX inpainting test by @ilya-lavrenov in #1692
- Whisper pipeline: parallel streaming with async/wait by @as-suvorov in #1687
- GHA: fixed build on Windows by @ilya-lavrenov in #1694
- Fix Falcon validation by @AlexKoff88 in #1664
- Add a choice of how to end streaming from callback: STOP or CANCEL by @sbalandi in #1476
- Bump py-build-cmake from 0.4.0 to 0.4.1 by @dependabot in #1699
- Bump einops from 0.8.0 to 0.8.1 in /samples by @dependabot in #1696
- Fixed warnings by @ilya-lavrenov in #1695
- Add egg package name to avoid issues after pip freeze by @wkobielx in #1703
- fix llm bench config based model search for gptj by @eaidova in #1701
- Fixed typo by @ilya-lavrenov in #1708
- Added GIL release for time consuming methods. by @popovaan in #1673
- fix analyze_model return value by @eaidova in #1711
- Fix error after second start_chat() for StatefulLLMPipeline by @sbalandi in #1684
- [ Test ][ PR1 ] Splitting Common.py by @iefode in #1691
- [CPU] Remove memset WA for PagedAttention by @luo-cheng2021 in #1678
- Update to the latest tokenizers with StringPack/Unpack from opset by @pavel-esir in #1562
- GHA: update workflows by @ilya-lavrenov in #1720
- [ Test ][ PR2 ] Splitting Common.py by @iefode in #1702
- [Sampler] Fix stop strings offset for speculative decoding by @iefode in #1719
- Omit
--trust-remote-code
from export command for whisper in the README by @nikita-savelyevv in #1726 - tokenizer: read simplified_chat_template by @Wovchena in #1712
- CB pipelines: use threaded streamer by @as-suvorov in #1690
- Remove core_tokenizers unused code by @mryzhov in #1710
- Set default permision for job_vlm by @jszczepa in #1721
- [GHA] Win pipeline refactoring by @mryzhov in #1714
- Enable test_load_special_tokens_from_tokenizer_config_json() by @Wovchena in #1729
- [llm bench] fix prompt file parcing for vlm by @eaidova in #1727
- CMAKE: use object librry between shared OpenVINO GenAI and tests by @ilya-lavrenov in #1705
- Add more samples to ci check by @olpipi in #1724
- docs: 2024-2025 by @Wovchena in #1733
- Make
TextStreamer
public & add unit-tests by @pavel-esir in #1700 - [GHA] macOS pipeline refactoring by @mryzhov in #1731
- Simplify installation verification by @Wovchena in #1739
- Update for LoRA Adapters: Derived adapters and support for FLUX (#1602) (master) by @slyalin in #1652
- Threaded streamer: add tests by @as-suvorov in #1715
- Revert "Threaded streamer: add tests" by @akladiev in #1741
- [llm bench] improve catch unicode encoding errors during generated print by @eaidova in #1736
- tokenizer: don't store CompiledModel by @Wovchena in #1740
- Set English language by default for all the LLM models by @AlexKoff88 in #1686
- PYTHON: remove py::object as ov::Any by @ilya-lavrenov in #1745
- [LLM] [NPU] StaticLLMPipeline: support weightless caching by @smirnov-alexey in #1635
- [WWB]: Fixed internvl inference with Transformers lib by @AlexKoff88 in #1749
- do not convert tokenizer on the fly in llm bench by @eaidova in #1752
- TESTS: skip test_perf_metrics by @ilya-lavrenov in #1754
- [GHA][WIN] Use self-hosted runners by @mryzhov in #1732
- VLM: updated chat template mapping for llava-next by @ilya-lavrenov in #1756
- Add flag to use full history on each generation in chat mode by @sbalandi in #1750
- Align streamers output by @Wovchena in #1759
- Bump py-build-cmake from 0.4.1 to 0.4.2 by @dependabot in #1757
- [JS] Add a dependency on the openvino-node package by @Retribution98 in #1667
- Revert "VLM: updated chat template mapping for llava-next" by @ilya-lavrenov in #1760
- Add benchmark_genai python sample to precommit by @olpipi in #1747
- Clean up Static LLM Pipeline by @TolyaTalamanov in #1748
- [LLM bench]: remove convert.py and split requirements by @eaidova in #1734
- add performance statistics for image generation by @xufang-lisa in #1405
- get_lm_encoded_results: use remote tensor by @Wovchena in #1669
- [JS] Setup genai nodejs bindings compilation for windows by @Retribution98 in #1697
- phi3_v: apply chat template by default by @Wovchena in #1762
- [TESTS] Retry model downloading and conversion by @mryzhov in #1758
- Whisper samples: align timestamps precision by @as-suvorov in #1770
- Simplify python read_image() by @Wovchena in #1763
- typo fix and message improvement by @isanghao in #1773
- Update README.md to update OV2025.0 GenAI Whisper NPU requirement by @luke-lin-vmc in #1772
- [Tokenizer] Fix max_length, pad_to_max_length for models with 2 RaggedToDense ops by @pavel-esir in #1764
- Speculative decoding fix by @dkalinowski in #1767
- SD3 Img2Img and Inpainting by @likholat in #1737
- Fix "images" parameter in VLM to allow single image. by @popovaan in #1761
- [ Test ][ PR3 ] Splitting Common.py by @iefode in #1718
- Whisper pipeline: use remote tensor for encoder->decoder by @as-suvorov in #1723
- VLMPipeline: remove duplicate reset by @Wovchena in #1776
- Fix speculative decoding internal metrics by @sammysun0711 in #1771
- [GHA] Samples tests refactoring by @mryzhov in #1661
- Parsing model names of deepseek and flan-t5-xxl by @wgzintel in #1735
- GHA: pin OpenVINO commit by @ilya-lavrenov in #1781
- [ Test ][ PR4 ] Splitting & Refactoring Common.py by @iefode in #1722
- Move tokenized/templated history difference handling to KVCacheState by @sbalandi in #1716
- [JS] Preparing the JS package for preview release by @Retribution98 in #1775
- Bump timm from 1.0.14 to 1.0.15 in /samples by @dependabot in #1787
- Deprecated API usage by @ilya-lavrenov in #1785
- VLM: more informative error message by @ilya-lavrenov in #1786
- reduce sleep during memcomp measurement, handle unicode in input by @eaidova in #1792
- Update SUPPORTED_MODELS.md by @Huanli-Gong in #1784
- [JS] Basic configuration of typescrtipt for nodejs package by @Retribution98 in #1790
- [GHA] replaced cpp-prompt_lookup_decoding_lm-ubuntu by @mryzhov in #1795
- [GHA] Replaced cpp-beam_search_causal_lm-ubuntu by @mryzhov in #1793
- StreamerBase: add write tokens vector by @as-suvorov in #1769
- [GHA][MAC] Samples tests by @mryzhov in #1782
- Add py bindings for encrypted models and sample by @olpipi in #1751
- Bump pybind11-stubgen from 2.5.1 to 2.5.3 by @dependabot in #1801
- Continuous Batching in VLM [Draft] by @popovaan in #1704
- Fixed path to the Supported Models Section by @Huanli-Gong in #1804
- [GHA] Replaced cpp-speculative_decoding_lm-ubuntu by @mryzhov in #1794
- Streaming: use write with vector by @ilya-lavrenov in #1807
- [GHA][WIN] Samples tests by @mryzhov in #1780
- VLM: create per-model folder with implementation details by @ilya-lavrenov in #1803
- Samples build instructions by @DimaPastushenkov in #1604
- StatefulLLMPipeline: Fix attention mask by @smirnov-alexey in #1812
- LLaVA: align the number of tokens in history and kv_cache by @Wovchena in #1788
- Text2ImagePipeline heterogenous compile by @RyanMetcalfeInt8 in #1768
- Fixed VLM metrics test. by @popovaan in #1810
- [TEST][PR5] Implementing test infra by @iefode in #1797
- InputsEmbedderLLaVANext: push_back() embeddings by @Wovchena in #1813
- Added mutex to add_request() with images. by @popovaan in #1808
- Allow build w/o python by @ilya-lavrenov in #1822
- GHA: pinned OpenVINO by @ilya-lavrenov in #1821
- Tokenizer: fixed decode of special tokens during init stage by @ilya-lavrenov in #1823
- Add ov::Tensor from_npy(), remove duplicate print_tensor() by @Wovchena in #1824
- SD3 Reshape + Heterogenous Compile by @RyanMetcalfeInt8 in #1818
- [ImageGeneration] FLUX pipeline assert for strength by @likholat in #1825
- [GHA][WIN] use azure runners by @mryzhov in #1819
- flux_pipeline: Add support for heterogeneous compile by @RyanMetcalfeInt8 in #1828
- update supported model_ids by @eaidova in #1834
- Switch NPU LLM execution to ov::genai::StatefulLLMPipeline by @TolyaTalamanov in #1677
- Revert "GHA: pinned OpenVINO" by @ilya-lavrenov in #1835
- [GHA] Use TinyLlama-1.1B-Chat-v1.0 instead of LaMini-GPT-124M by @mryzhov in #1816
- Add support for inpainting/image2image pipeline to llm_bench by @sbalandi in #1806
- Update README.md by @SunnyLi2015 in #1837
- [JS] Setup genai nodejs bindings compilation for macos by @Retribution98 in #1738
- [CB]: allow int8 KV cache precision for CPU by @ilya-lavrenov in #1552
- [JS] Fix NPM CPACK_GENERATOR by @Retribution98 in #1842
- Update heterogeneous_stable_diffusion.py by @ilya-lavrenov in #1844
- [GHA] reworked lcm_dreamshaper and stable_diffusion_1_5_cpp pipelines by @mryzhov in #1836
- [llm_bench] Fix way with relative path of media for json prompts by @sbalandi in #1843
- CB: rely on CPU logic for KV cache precision and shape by @ilya-lavrenov in #1838
- [GHA] replaced benchmark_genai-ubuntu by @mryzhov in #1817
- [GHA] Replaced visual_language_chat_sample-ubuntu-llava by @mryzhov in #1802
- Strengthen perf_metrics test, rename var by @Wovchena in #1847
- Use get_max_new_tokens() insted of max_new_tokens field when stopping… by @michalkulakowski in #1417
- Fixed inference of HF version of internvl by @AlexKoff88 in #1849
- Prefix caching for sequences with embeddings. by @popovaan in #1841
- Extend VLM to run LM on NPU by @TolyaTalamanov in #1783
- [PR6][Test infra] Starting to use of pipeline types by @iefode in #1814
- CB: added error messages when CB backend is explicitly asked, but not available by @ilya-lavrenov in #1693
- Fix running generate with encoded inputs one after another with the same input data by @sbalandi in #1850
- [WWB]: align phi3_v by @Wovchena in #1853
- [Docs] Add initial docs pages version by @yatarkan in #1765
- Static Whisper: transformations for dual-model whisper by @eshiryae in #1820
- Implement CANCEL for streaming with VLM Pipeline by @sbalandi in #1725
- Add C API for LLMPipeline by @apinge in #1778
- Use Sampler for StaticWhisperPipeline by @eshiryae in #1713
- CB: rely on GPU logic for KV cache precision and shape by @sshlyapn in #1848
- [Image Generation] Flux Fill inpainting pipeline by @likholat in #1857
- MiniCPM-V-2_6: add native tag by @Wovchena in #1858
- [tokenizer] Fix setting max_legth and special tokens flags by @pavel-esir in #1860
- GHA pin OpenVINO by @ilya-lavrenov in #1865
- FLUX: move FluxPipeline with pipeline type to protected by @ilya-lavrenov in #1864
- [JS] Split building and testing NodeJS for Linux CI by @Retribution98 in #1859
- NPU LLM: Add prefill hint (dynamic/static) by @dmatveev in #1867
- Add averaged results dumping to llm_bench output json by @nikita-savelyevv in #1862
- Add heterogenous compile API for image2image & inpainting by @RyanMetcalfeInt8 in #1868
- CB constructor from ModelsMap by @popovaan in #1863
- BUILD: set rpath for GenAI to OpenVINO C when building wheel by @ilya-lavrenov in #1869
- SD3: fix case w/o T5 by @ilya-lavrenov in #1876
- Revert "GHA pin OpenVINO" by @ilya-lavrenov in #1880
- [Docs] Add supported models & introduction pages by @yatarkan in #1839
- Qwen2-VL: add native tag by @Wovchena in #1884
- Update tests to check encoded inputs with chat by @sbalandi in #1866
- BUILD: second attempt to fix RPATHs by @ilya-lavrenov in #1886
- Adjust the LLM pipeline C API to ensure it can determine the required sufficient size for the output. by @apinge in #1871
- [GHA] replaced cpp-Phi-1_5 by @mryzhov in #1798
- Tokenizers update by @ilya-lavrenov in #1879
- [GHA] removed cpp-greedy_causal_lm-windows by @mryzhov in #1887
- Chat mode for VLM Continuous Batching by @popovaan in #1872
- Removed LMS Discrete by @ilya-lavrenov in #1892
- Dedicated library for OpenVINO GenAI C by @ilya-lavrenov in #1896
- [Docs] Align list of supported LLMs with Optimum-Intel by @yatarkan in #1893
- Update cpp sample CMakeLists.txt by @sammysun0711 in #1898
- Fixed stubgen issue by @ilya-lavrenov in #1894
- Bump prismjs from 1.29.0 to 1.30.0 in /site in the npm_and_yarn group across 1 directory by @dependabot in #1881
- [GHA] Mac pipeline fixes by @mryzhov in #1903
- Added C API to cpack by @ilya-lavrenov in #1908
- [GHA] Removed visual_language_chat_sample-ubuntu-internvl2 and visual_language_chat_sample-ubuntu-qwen2vl by @mryzhov in #1888
- VLM: fix image separators by @Wovchena in #1902
- Use circular buffer of infer requests in VLM components by @mzegla in #1833
- Bump the npm_and_yarn group across 1 directory with 3 updates by @dependabot in #1899
- Update perf metrics for image generation pipeline to use get_performance_metrics by @sbalandi in #1895
- [llm_bench] Allow to provide scheduler config for vlm by @sbalandi in #1906
- Store EncodedImage's in VLM CB chat history. by @popovaan in #1901
- Move speculative decoding from streamer to metrics benchmarking approach by @sbalandi in #1904
- Updated tokenizers by @ilya-lavrenov in #1912
- Add function to create usm host tensor by @ahnyoung-paul in #1900
- VisionEncoderPhi3V: fix multiple infers by @Wovchena in #1914
- GHA: switch to 2025.1 branches by @ilya-lavrenov in #1919
- Enhance the flexibility of the c streamer by @apinge in #1940
- Revert perf regression changes by @dkalinowski in #1944
- VLM: change infer to start_async/wait by @dkalinowski in #1947
- Added possibility to generate base text on GPU for text evaluation by @ljaljushkin in #1955
- SDL tokenizers fixes by @mryzhov in #1958
- Synchronize entire embeddings calculation phase by @mzegla in #1967
New Contributors
- @t-jankowski made their first contribution in #1588
- @vishniakov-nikolai made their first contribution in #1193
- @akazakov-github made their first contribution in #1657
- @wkobielx made their first contribution in #1703
- @jszczepa made their first contribution in #1721
- @luke-lin-vmc made their first contribution in #1772
- @Huanli-Gong made their first contribution in #1784
- @SunnyLi2015 made their first contribution in #1837
- @michalkulakowski made their first contribution in #1417
- @ahnyoung-paul made their first contribution in #1900
Full Changelog: 2025.0.1.0...2025.1.0.0