CVE-2024-8939: ilab/vllm: best_of param causes inference API DoS
MEDIUM PoC AVAILABLEAn attacker with local API access can crash your vllm inference server by sending requests with an inflated best_of parameter, consuming all compute resources until the service becomes unresponsive. Patch ilab immediately if running model serve in any shared or multi-tenant environment. If patching is not immediate, cap best_of input at the API gateway and enforce per-request timeouts.
Risk Assessment
Despite a medium CVSS (6.2), the local attack vector (AV:L) narrows real-world exposure to contexts where the vllm API is reachable from localhost or an internal network. In containerized AI serving clusters, shared dev boxes, or misconfigured Kubernetes namespaces, that boundary is often non-existent. Attack complexity is low, no credentials are required, and the exploitation path is trivial — a single crafted request can trigger cascading resource exhaustion. Availability loss on inference infrastructure is high-impact in any production LLM serving stack.
Severity & Risk
Attack Surface
Recommended Action
6 steps-
Apply the Red Hat patch referenced in the advisory (bugzilla.redhat.com/2312782).
-
Enforce a hard cap on best_of at the API gateway or reverse proxy layer — reject any request with best_of > 5.
-
Configure per-request inference timeouts at the vllm level to bound maximum resource consumption.
-
Apply rate limiting per client IP or API token on all inference endpoints.
-
Restrict network access to the vllm API to authorized internal clients only (firewall or network policy).
-
Alert on sustained CPU/GPU utilization spikes above baseline on inference nodes as an early DoS detection signal.
CISA SSVC Assessment
Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.
Classification
Compliance Impact
This CVE is relevant to:
Frequently Asked Questions
What is CVE-2024-8939?
An attacker with local API access can crash your vllm inference server by sending requests with an inflated best_of parameter, consuming all compute resources until the service becomes unresponsive. Patch ilab immediately if running model serve in any shared or multi-tenant environment. If patching is not immediate, cap best_of input at the API gateway and enforce per-request timeouts.
Is CVE-2024-8939 actively exploited?
Proof-of-concept exploit code is publicly available for CVE-2024-8939, increasing the risk of exploitation.
How to fix CVE-2024-8939?
1. Apply the Red Hat patch referenced in the advisory (bugzilla.redhat.com/2312782). 2. Enforce a hard cap on best_of at the API gateway or reverse proxy layer — reject any request with best_of > 5. 3. Configure per-request inference timeouts at the vllm level to bound maximum resource consumption. 4. Apply rate limiting per client IP or API token on all inference endpoints. 5. Restrict network access to the vllm API to authorized internal clients only (firewall or network policy). 6. Alert on sustained CPU/GPU utilization spikes above baseline on inference nodes as an early DoS detection signal.
What systems are affected by CVE-2024-8939?
This vulnerability affects the following AI/ML architecture patterns: model serving, LLM inference endpoints, RAG pipelines, agent frameworks.
What is the CVSS score for CVE-2024-8939?
CVE-2024-8939 has a CVSS v3.1 base score of 6.2 (MEDIUM). The EPSS exploitation probability is 0.03%.
Technical Details
NVD Description
A vulnerability was found in the ilab model serve component, where improper handling of the best_of parameter in the vllm JSON web API can lead to a Denial of Service (DoS). The API used for LLM-based sentence or chat completion accepts a best_of parameter to return the best completion from several options. When this parameter is set to a large value, the API does not handle timeouts or resource exhaustion properly, allowing an attacker to cause a DoS by consuming excessive system resources. This leads to the API becoming unresponsive, preventing legitimate users from accessing the service.
Exploitation Scenario
An attacker with access to the internal network or a compromised container sharing the namespace sends repeated POST requests to the vllm /v1/chat/completions endpoint with best_of set to an extreme value such as 500 or 1000. Each request forces vllm to generate and internally score hundreds of parallel completions before returning the best result. Within minutes, GPU VRAM is saturated and CPU threads are exhausted. The API stops responding to legitimate requests. Downstream applications — RAG pipelines, AI agents, user-facing chatbots — start throwing timeouts or errors. The attacker can sustain this with minimal tooling (a single curl loop), and the service does not self-recover without operator intervention.
Weaknesses (CWE)
CVSS Vector
CVSS:3.1/AV:L/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H References
Timeline
Related Vulnerabilities
CVE-2026-33660 10.0 TensorFlow: type confusion NPD in tensor conversion
Same attack type: DoS CVE-2022-35939 9.8 TensorFlow: ScatterNd OOB write enables RCE/crash
Same attack type: DoS CVE-2022-41900 9.8 TensorFlow: heap OOB RCE in FractionalMaxPool op
Same attack type: DoS CVE-2022-23587 9.8 TensorFlow: integer overflow in Grappler enables RCE
Same attack type: DoS CVE-2023-25668 9.8 TensorFlow: unauthenticated RCE via heap buffer overflow
Same attack type: DoS
AI Threat Alert