Skip to content

build(deps): bump vllm from 0.15.1 to 0.17.0 in /python/kserve#1172

Open
dependabot[bot] wants to merge 1 commit intomasterfrom
dependabot/uv/python/kserve/vllm-0.17.0
Open

build(deps): bump vllm from 0.15.1 to 0.17.0 in /python/kserve#1172
dependabot[bot] wants to merge 1 commit intomasterfrom
dependabot/uv/python/kserve/vllm-0.17.0

Conversation

@dependabot
Copy link

@dependabot dependabot bot commented on behalf of github Mar 10, 2026

Bumps vllm from 0.15.1 to 0.17.0.

Release notes

Sourced from vllm's releases.

v0.17.0

vLLM v0.17.0

Known Issue: If you are on CUDA 12.9+ and encounter a CUBLAS_STATUS_INVALID_VALUE error, this is caused by a CUDA library mismatch. To resolve, try one of the following:

  1. Remove the path to system CUDA shared library files (e.g. /usr/local/cuda) from LD_LIBRARY_PATH, or simply unset LD_LIBRARY_PATH.
  2. Install vLLM with uv pip install vllm --torch-backend=auto.
  3. Install vLLM with pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129 (change the CUDA version to match your system).

Highlights

This release features 699 commits from 272 contributors (48 new)!

  • PyTorch 2.10 Upgrade: This release upgrades to PyTorch 2.10.0, which is a breaking change for environment dependencies.
  • FlashAttention 4 Integration: vLLM now supports the FlashAttention 4 backend (#32974), bringing next-generation attention performance.
  • Model Runner V2 Maturation: Model Runner V2 has reached a major milestone with Pipeline Parallel (#33960), Decode Context Parallel (#34179), Eagle3 speculative decoding with CUDA graphs (#35029, #35040), pooling model support (#35120), piecewise & mixed CUDA graph capture (#32771), DP+EP for spec decoding (#35294), and a new ModelState architecture. Design docs are now available (#35819).
  • Qwen3.5 Model Family: Full support for the Qwen3.5 model family (#34110) featuring GDN (Gated Delta Networks), with FP8 quantization, MTP speculative decoding, and reasoning parser support.
  • New --performance-mode Flag: A new --performance-mode {balanced, interactivity, throughput} flag (#34936) simplifies performance tuning for common deployment scenarios.
  • Anthropic API Compatibility: Added support for Anthropic thinking blocks (#33671), count_tokens API (#35588), tool_choice=none (#35835), and streaming/image handling fixes.
  • Weight Offloading V2 with Prefetching: The weight offloader now hides onloading latency via prefetching (#29941), plus selective CPU weight offloading (#34535) and CPU offloading without pinned memory doubling (#32993).
  • Elastic Expert Parallelism Milestone 2: Initial support for elastic expert parallelism enabling dynamic GPU scaling for MoE models (#34861).
  • Quantized LoRA Adapters: Users can now load quantized LoRA adapters (e.g. QLoRA) directly (#30286).
  • Transformers v5 Compatibility: Extensive work to ensure compatibility with HuggingFace Transformers v5 across models and utilities.

Model Support

  • New architectures: Qwen3.5 (#34110), COLQwen3 (#34398), ColModernVBERT (#34558), Ring 2.5 (#35102), skt/A.X-K1 (#32407), Ovis 2.6 (#34426), nvidia/llama-nemotron-embed-vl-1b-v2 (#35297), nvidia/llama-nemotron-rerank-vl-1b-v2 (#35735), nvidia/nemotron-colembed (#34574).
  • ASR models: FunASR (#33247), FireRedASR2 (#35727), Qwen3-ASR realtime streaming (#34613).
  • Multimodal: OpenPangu-VL video input (#34134), audio chunking for offline LLM (#34628), Parakeet audio encoder for nemotron-nano-vl (#35100), MiniCPM-o flagos (#34126).
  • LoRA: LFM2 (#34921), Llama 4 Vision tower/connector (#35147), max vocab size increased to 258048 (#34773), quantized LoRA adapters (#30286).
  • Task expansion: ColBERT extended to non-standard BERT backbones (#34170), multimodal scoring for late-interaction models (#34574).
  • Performance: Qwen3.5 GDN projector fusion (#34697), FlashInfer cuDNN backend for Qwen3 VL ViT (#34580), Step3.5-Flash NVFP4 (#34478), Qwen3MoE tuned configs for H200 (#35457).
  • Fixes: DeepSeek-VL V2 simplified loading (#35203), Qwen3/Qwen3.5 reasoning parser (#34779), Qwen2.5-Omni/Qwen3-Omni mixed-modality (#35368), Ernie4.5-VL garbled output (#35587), Qwen-VL tokenizer (#36140), Qwen-Omni audio cache (#35994), Nemotron-3-Nano NVFP4 accuracy with TP>1 (#34476).

Engine Core

  • Model Runner V2: Pipeline Parallel (#33960), Decode Context Parallel (#34179), piecewise & mixed CUDA graphs (#32771), Eagle3 with CUDA graphs (#35029, #35040), pooling models (#35120), DP+EP for spec decoding (#35294), bad_words sampling (#33433), ModelState architecture (#35350, #35383, #35564, #35621, #35774), design docs (#35819).
  • Weight offloading: V2 prefetching to hide latency (#29941), selective CPU weight offloading (#34535), CPU offloading without pinned memory doubling (#32993).
  • Sleep level 0 mode with enqueue/wait pattern (#33195), pause/resume moved into engine (#34125).
  • Fixes: allreduce_rms_fusion disabled by default with PP > 1 (#35424), DCP + FA3 crash (#35082), prefix caching for Mamba "all" mode (#34874), num_active_loras fix (#34119), async TP reduce-scatter reduction fix (#33088).
  • Repetitive token pattern detection flags (#35451).

Kernel

  • FlashAttention 4 integration (#32974).
  • FlashInfer Sparse MLA backend (#33451).
  • Triton-based top-k and top-p sampler kernels (#33538).
  • Faster topKperRow decode kernel for DeepSeek-V3.2 sparse attention (#33680).
  • Optimized grouped topk kernel (#34206).
  • TRTLLM DSV3 Router GEMM kernel, 6% batch-1 speedup (#34302).
  • FA3 swizzle optimization (#34043).
  • 256-bit LDG/STG activation kernels (#33022).
  • TMA support for fused_moe_lora kernel (#32195).
  • Helion kernel framework: silu_mul_fp8 kernel (#33373), autotuning infrastructure (#34025), num_tokens autotuning (#34185), fx tracing via HOP (#34390), GPU variant canonicalization (#34928).

... (truncated)

Commits
  • b31e932 Bound openai to under 2.25.0
  • e346c08 [Release] Include source distribution (sdist) in PyPI uploads (#35136)
  • b7a423c [BUGFIX]Fix Qwen-Omni models audio max_token_per_item estimation error leadin...
  • fa78ec8 [Bugfix] Fix Qwen-VL tokenizer implementation (#36140)
  • 9a474ce [XPU] bump vllm-xpu-kernels to v0.1.3 (#35984)
  • 097eb54 [Bugfix] Improve engine ready timeout error message (#35616)
  • 7cdba98 [BugFix] Support tool_choice=none in the Anthropic API (#35835)
  • 3c85cd9 [Rocm][CI] Fix ROCm LM Eval Large Models (8 Card) (#35913)
  • edba150 [Bugfix] Guard mm_token_type_ids kwarg in get_mrope_input_positions (#35711)
  • e379396 [Refactor] Clean up processor kwargs extraction (#35872)
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    You can disable automated security fix PRs for this repo from the Security Alerts page.

Bumps [vllm](https://github.com/vllm-project/vllm) from 0.15.1 to 0.17.0.
- [Release notes](https://github.com/vllm-project/vllm/releases)
- [Changelog](https://github.com/vllm-project/vllm/blob/main/RELEASE.md)
- [Commits](vllm-project/vllm@v0.15.1...v0.17.0)

---
updated-dependencies:
- dependency-name: vllm
  dependency-version: 0.17.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot bot added dependencies Pull requests that update a dependency file python:uv Pull requests that update python:uv code labels Mar 10, 2026
@openshift-ci
Copy link

openshift-ci bot commented Mar 10, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dependabot[bot]
Once this PR has been reviewed and has the lgtm label, please assign mholder6 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link

openshift-ci bot commented Mar 10, 2026

Hi @dependabot[bot]. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file needs-ok-to-test python:uv Pull requests that update python:uv code

Projects

Status: New/Backlog

Development

Successfully merging this pull request may close these issues.

0 participants