vLLM Speculative Decoding Penalty Parameter Handling Causes Engine Crash Vulnerability

Vulnerability

A vulnerability in vLLM, an inference engine for large language models, causes the server to crash when using the 'extract_hidden_states' speculative decoding method with certain penalty parameters. This issue affects vLLM versions 0.18.0 prior to 0.20.0. The crash occurs because the 'extract_hidden_states' proposer returns a tensor with an incorrect shape after the first decoding step, leading to a shape mismatch error. The problem is triggered immediately and consistently when any request in a batch includes a sampling penalty, such as 'repetition_penalty', 'frequency_penalty', or 'presence_penalty'.

Impact

Exploiting this vulnerability causes the EngineCore process to crash, leading to a complete loss of service availability.

Reproduction

To reproduce this vulnerability, send a batch request to a vLLM server running a vulnerable version with the 'extract_hidden_states' speculative decoding method enabled. Ensure that at least one request in the batch includes a sampling penalty parameter, such as 'repetition_penalty'. The server will crash immediately after processing the request with the penalty parameter.

Remediation

Upgrade to vLLM version 0.20.0 or later. If an upgrade is not possible, avoid using the 'extract_hidden_states' method with penalty parameters on affected versions.

Added: May 12, 2026, 8:42 PM
Updated: May 12, 2026, 8:42 PM

Vulnerability Rating

Custom Algorithm
spread
2.6
impact
2.5
exploitability
7.3
remediation
7.7
relevance
8.1
threat
4.8
urgency
2.9
incentive
0.0

Our algorithm analyzes dozens of metrics to generate these 8 key vulnerability categories, which are then combined to calculate the overall risk score.