CVE-2024-8939: ilab/vllm: best_of param causes inference API DoS

MEDIUM PoC AVAILABLE
Published September 17, 2024
CISO Take

An attacker with local API access can crash your vllm inference server by sending requests with an inflated best_of parameter, consuming all compute resources until the service becomes unresponsive. Patch ilab immediately if running model serve in any shared or multi-tenant environment. If patching is not immediate, cap best_of input at the API gateway and enforce per-request timeouts.

What is the risk?

Despite a medium CVSS (6.2), the local attack vector (AV:L) narrows real-world exposure to contexts where the vllm API is reachable from localhost or an internal network. In containerized AI serving clusters, shared dev boxes, or misconfigured Kubernetes namespaces, that boundary is often non-existent. Attack complexity is low, no credentials are required, and the exploitation path is trivial — a single crafted request can trigger cascading resource exhaustion. Availability loss on inference infrastructure is high-impact in any production LLM serving stack.

How severe is it?

CVSS 3.1
6.2 / 10
EPSS
0.2%
chance of exploitation in 30 days
Higher than 13% of all CVEs
Exploitation Status
Exploit Available
Exploitation: MEDIUM
Sophistication
Trivial
Exploitation Confidence
medium
Public PoC indexed (trickest/cve)
Composite signal derived from CISA KEV, VulnCheck KEV, CISA SSVC, EPSS, Metasploit, Exploit-DB, trickest/cve, Nuclei templates, and inthewild.io exploitation reports.

What is the attack surface?

AV AC PR UI S C I A
AV Local
AC Low
PR None
UI None
S Unchanged
C None
I None
A High

What should I do?

6 steps
  1. Apply the Red Hat patch referenced in the advisory (bugzilla.redhat.com/2312782).

  2. Enforce a hard cap on best_of at the API gateway or reverse proxy layer — reject any request with best_of > 5.

  3. Configure per-request inference timeouts at the vllm level to bound maximum resource consumption.

  4. Apply rate limiting per client IP or API token on all inference endpoints.

  5. Restrict network access to the vllm API to authorized internal clients only (firewall or network policy).

  6. Alert on sustained CPU/GPU utilization spikes above baseline on inference nodes as an early DoS detection signal.

What does CISA's SSVC say?

Decision Track
Exploitation none
Automatable No
Technical Impact partial

Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.

How is it classified?

Which compliance frameworks are affected?

This CVE is relevant to:

EU AI Act
Art. 9 - Risk Management System
ISO 42001
A.9.2 - AI System Availability and Resilience
NIST AI RMF
MANAGE-2.2 - Mechanisms for sustaining value of deployed AI systems
OWASP LLM Top 10
LLM04 - Model Denial of Service

Frequently Asked Questions

What is CVE-2024-8939?

An attacker with local API access can crash your vllm inference server by sending requests with an inflated best_of parameter, consuming all compute resources until the service becomes unresponsive. Patch ilab immediately if running model serve in any shared or multi-tenant environment. If patching is not immediate, cap best_of input at the API gateway and enforce per-request timeouts.

Is CVE-2024-8939 actively exploited?

Proof-of-concept exploit code is publicly available for CVE-2024-8939, increasing the risk of exploitation.

How to fix CVE-2024-8939?

1. Apply the Red Hat patch referenced in the advisory (bugzilla.redhat.com/2312782). 2. Enforce a hard cap on best_of at the API gateway or reverse proxy layer — reject any request with best_of > 5. 3. Configure per-request inference timeouts at the vllm level to bound maximum resource consumption. 4. Apply rate limiting per client IP or API token on all inference endpoints. 5. Restrict network access to the vllm API to authorized internal clients only (firewall or network policy). 6. Alert on sustained CPU/GPU utilization spikes above baseline on inference nodes as an early DoS detection signal.

What systems are affected by CVE-2024-8939?

This vulnerability affects the following AI/ML architecture patterns: model serving, LLM inference endpoints, RAG pipelines, agent frameworks.

What is the CVSS score for CVE-2024-8939?

CVE-2024-8939 has a CVSS v3.1 base score of 6.2 (MEDIUM). The EPSS exploitation probability is 0.23%.

What is the AI security impact?

Affected AI Architectures

model servingLLM inference endpointsRAG pipelinesagent frameworks

MITRE ATLAS Techniques

AML.T0029 Denial of AI Service
AML.T0034 Cost Harvesting
AML.T0040 AI Model Inference API Access

Compliance Controls Affected

EU AI Act: Art. 9
ISO 42001: A.9.2
NIST AI RMF: MANAGE-2.2
OWASP LLM Top 10: LLM04

What are the technical details?

Original Advisory

A vulnerability was found in the ilab model serve component, where improper handling of the best_of parameter in the vllm JSON web API can lead to a Denial of Service (DoS). The API used for LLM-based sentence or chat completion accepts a best_of parameter to return the best completion from several options. When this parameter is set to a large value, the API does not handle timeouts or resource exhaustion properly, allowing an attacker to cause a DoS by consuming excessive system resources. This leads to the API becoming unresponsive, preventing legitimate users from accessing the service.

Exploitation Scenario

An attacker with access to the internal network or a compromised container sharing the namespace sends repeated POST requests to the vllm /v1/chat/completions endpoint with best_of set to an extreme value such as 500 or 1000. Each request forces vllm to generate and internally score hundreds of parallel completions before returning the best result. Within minutes, GPU VRAM is saturated and CPU threads are exhausted. The API stops responding to legitimate requests. Downstream applications — RAG pipelines, AI agents, user-facing chatbots — start throwing timeouts or errors. The attacker can sustain this with minimal tooling (a single curl loop), and the service does not self-recover without operator intervention.

Weaknesses (CWE)

CWE-400 — Uncontrolled Resource Consumption: The product does not properly control the allocation and maintenance of a limited resource.

  • [Architecture and Design] Design throttling mechanisms into the system architecture. The best protection is to limit the amount of resources that an unauthorized user can cause to be expended. A strong authentication and access control model will help prevent such attacks from occurring in the first place. The login application should be protected against DoS attacks as much as possible. Limiting the database access, perhaps by caching result sets, can help minimize the resources expended. To further limit the potential for a DoS attack, consider tracking the rate of requests received from users and blocking requests that exceed a defined rate threshold.
  • [Architecture and Design] Mitigation of resource exhaustion attacks requires that the target system either: The first of these solutions is an issue in itself though, since it may allow attackers to prevent the use of the system by a particular valid user. If the attacker impersonates the valid user, they may be able to prevent the user from accessing the server in question. The second solution is simply difficult to effectively institute -- and even when properly done, it does not provide a full solution. It simply makes the attack require more resources on the part of the attacker. recognizes the attack and denies that user further access for a given amount of time, or uniformly throttles all requests in order to make it more difficult to consume resources more quickly than they can again be freed.

Source: MITRE CWE corpus.

CVSS Vector

CVSS:3.1/AV:L/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H

Timeline

Published
September 17, 2024
Last Modified
September 20, 2024
First Seen
September 17, 2024

Related Vulnerabilities