CVE-2024-8939: ilab/vllm: best_of param causes inference API DoS
MEDIUM PoC AVAILABLEAn attacker with local API access can crash your vllm inference server by sending requests with an inflated best_of parameter, consuming all compute resources until the service becomes unresponsive. Patch ilab immediately if running model serve in any shared or multi-tenant environment. If patching is not immediate, cap best_of input at the API gateway and enforce per-request timeouts.
What is the risk?
Despite a medium CVSS (6.2), the local attack vector (AV:L) narrows real-world exposure to contexts where the vllm API is reachable from localhost or an internal network. In containerized AI serving clusters, shared dev boxes, or misconfigured Kubernetes namespaces, that boundary is often non-existent. Attack complexity is low, no credentials are required, and the exploitation path is trivial — a single crafted request can trigger cascading resource exhaustion. Availability loss on inference infrastructure is high-impact in any production LLM serving stack.
How severe is it?
What is the attack surface?
What should I do?
6 steps-
Apply the Red Hat patch referenced in the advisory (bugzilla.redhat.com/2312782).
-
Enforce a hard cap on best_of at the API gateway or reverse proxy layer — reject any request with best_of > 5.
-
Configure per-request inference timeouts at the vllm level to bound maximum resource consumption.
-
Apply rate limiting per client IP or API token on all inference endpoints.
-
Restrict network access to the vllm API to authorized internal clients only (firewall or network policy).
-
Alert on sustained CPU/GPU utilization spikes above baseline on inference nodes as an early DoS detection signal.
What does CISA's SSVC say?
Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.
How is it classified?
Which compliance frameworks are affected?
This CVE is relevant to:
Frequently Asked Questions
What is CVE-2024-8939?
An attacker with local API access can crash your vllm inference server by sending requests with an inflated best_of parameter, consuming all compute resources until the service becomes unresponsive. Patch ilab immediately if running model serve in any shared or multi-tenant environment. If patching is not immediate, cap best_of input at the API gateway and enforce per-request timeouts.
Is CVE-2024-8939 actively exploited?
Proof-of-concept exploit code is publicly available for CVE-2024-8939, increasing the risk of exploitation.
How to fix CVE-2024-8939?
1. Apply the Red Hat patch referenced in the advisory (bugzilla.redhat.com/2312782). 2. Enforce a hard cap on best_of at the API gateway or reverse proxy layer — reject any request with best_of > 5. 3. Configure per-request inference timeouts at the vllm level to bound maximum resource consumption. 4. Apply rate limiting per client IP or API token on all inference endpoints. 5. Restrict network access to the vllm API to authorized internal clients only (firewall or network policy). 6. Alert on sustained CPU/GPU utilization spikes above baseline on inference nodes as an early DoS detection signal.
What systems are affected by CVE-2024-8939?
This vulnerability affects the following AI/ML architecture patterns: model serving, LLM inference endpoints, RAG pipelines, agent frameworks.
What is the CVSS score for CVE-2024-8939?
CVE-2024-8939 has a CVSS v3.1 base score of 6.2 (MEDIUM). The EPSS exploitation probability is 0.23%.
What is the AI security impact?
Affected AI Architectures
MITRE ATLAS Techniques
AML.T0029 Denial of AI Service AML.T0034 Cost Harvesting AML.T0040 AI Model Inference API Access Compliance Controls Affected
What are the technical details?
Original Advisory
A vulnerability was found in the ilab model serve component, where improper handling of the best_of parameter in the vllm JSON web API can lead to a Denial of Service (DoS). The API used for LLM-based sentence or chat completion accepts a best_of parameter to return the best completion from several options. When this parameter is set to a large value, the API does not handle timeouts or resource exhaustion properly, allowing an attacker to cause a DoS by consuming excessive system resources. This leads to the API becoming unresponsive, preventing legitimate users from accessing the service.
Exploitation Scenario
An attacker with access to the internal network or a compromised container sharing the namespace sends repeated POST requests to the vllm /v1/chat/completions endpoint with best_of set to an extreme value such as 500 or 1000. Each request forces vllm to generate and internally score hundreds of parallel completions before returning the best result. Within minutes, GPU VRAM is saturated and CPU threads are exhausted. The API stops responding to legitimate requests. Downstream applications — RAG pipelines, AI agents, user-facing chatbots — start throwing timeouts or errors. The attacker can sustain this with minimal tooling (a single curl loop), and the service does not self-recover without operator intervention.
Weaknesses (CWE)
CWE-400 — Uncontrolled Resource Consumption: The product does not properly control the allocation and maintenance of a limited resource.
- [Architecture and Design] Design throttling mechanisms into the system architecture. The best protection is to limit the amount of resources that an unauthorized user can cause to be expended. A strong authentication and access control model will help prevent such attacks from occurring in the first place. The login application should be protected against DoS attacks as much as possible. Limiting the database access, perhaps by caching result sets, can help minimize the resources expended. To further limit the potential for a DoS attack, consider tracking the rate of requests received from users and blocking requests that exceed a defined rate threshold.
- [Architecture and Design] Mitigation of resource exhaustion attacks requires that the target system either: The first of these solutions is an issue in itself though, since it may allow attackers to prevent the use of the system by a particular valid user. If the attacker impersonates the valid user, they may be able to prevent the user from accessing the server in question. The second solution is simply difficult to effectively institute -- and even when properly done, it does not provide a full solution. It simply makes the attack require more resources on the part of the attacker. recognizes the attack and denies that user further access for a given amount of time, or uniformly throttles all requests in order to make it more difficult to consume resources more quickly than they can again be freed.
Source: MITRE CWE corpus.
CVSS Vector
CVSS:3.1/AV:L/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H References
Timeline
Related Vulnerabilities
CVE-2026-33660 10.0 TensorFlow: type confusion NPD in tensor conversion
Same attack type: DoS CVE-2022-35939 9.8 TensorFlow: ScatterNd OOB write enables RCE/crash
Same attack type: DoS CVE-2022-41900 9.8 TensorFlow: heap OOB RCE in FractionalMaxPool op
Same attack type: DoS CVE-2022-23587 9.8 TensorFlow: integer overflow in Grappler enables RCE
Same attack type: DoS CVE-2023-25668 9.8 TensorFlow: unauthenticated RCE via heap buffer overflow
Same attack type: DoS