CVE-2026-0599: TGI DoS causes service disruption

Q: Is CVE-2026-0599 actively exploited?

No confirmed active exploitation of CVE-2026-0599 has been reported, but organizations should still patch proactively.

Q: How to fix CVE-2026-0599?

1. PATCH: Upgrade to text-generation-inference 3.3.7 immediately — this is the definitive fix. 2. INTERIM if patching is blocked: Enable API authentication via --authentication-config flag to require bearer tokens; this prevents unauthenticated exploitation. 3. ADD EGRESS CONTROLS: Restrict outbound HTTP from the TGI process/container to internal or whitelisted endpoints only — this breaks the attack chain by preventing external image fetching. 4. ENFORCE MEMORY LIMITS: Set container memory limits (Docker: --memory=Xg, Kubernetes: resources.limits.memory) to contain blast radius and prevent host OOM. 5. DEPLOY API GATEWAY: Place TGI behind an API gateway or reverse proxy with rate limiting and request body size limits (e.g., nginx client_max_body_size, Kong rate-limit plugin). 6. DETECTION: Alert on anomalous memory growth spikes in inference containers, unusual outbound bandwidth from inference pods, and repeated 429/413 response codes paired with sustained resource utilization.

Q: What systems are affected by CVE-2026-0599?

This vulnerability affects the following AI/ML architecture patterns: multimodal/VLM inference serving, LLM inference servers, model serving, self-hosted AI APIs, AI agent frameworks with vision capabilities.

Q: What is the CVSS score for CVE-2026-0599?

CVE-2026-0599 has a CVSS v3.1 base score of 7.5 (HIGH). The EPSS exploitation probability is 23.72%.

CISO Take

If you're running HuggingFace TGI in VLM (multimodal) mode, patch to 3.3.7 now — this is a trivial, unauthenticated DoS that can crash your inference host with a single crafted request. Default deployments have no memory limits and no authentication, meaning your entire AI inference stack is one HTTP request away from an OOM crash. Treat this as critical if your AI pipelines serve multimodal workloads without an auth layer or network egress controls.

What is the risk?

Effective risk is higher than CVSS 7.5 suggests for AI-specific deployments. The attack requires zero credentials, zero AI/ML knowledge, and zero user interaction — just a POST with a Markdown image URL pointing to a large resource. Default TGI deployments (as documented by HuggingFace) expose the inference API without authentication, maximizing blast radius. EPSS of 0.00245 indicates limited observed exploitation, but the technique is trivially discoverable. Organizations running multimodal LLM inference at scale face compounded risk: a single attacker can saturate bandwidth, exhaust memory, and spike CPU simultaneously, crashing the host before any rate limiting or token validation kicks in.

What systems are affected?

Package	Ecosystem	Vulnerable Range	Patched
TGI	pip	< 3.3.7	`3.3.7`
10.9K Pushed 3mo ago 100% patched ~1d to patch Full package profile →

Do you use TGI? You're affected.

How severe is it?

CVSS 3.1

7.5 / 10

EPSS

23.7%

chance of exploitation in 30 days

Higher than 98% of all CVEs

Source: EPSS v3 — FIRST.org

Exploitation Status

Exploit Available

Exploitation: MEDIUM

Sophistication

Trivial

Exploitation Confidence

medium

○ CISA SSVC: Public PoC

○ EPSS exploit prediction: 24%

Composite signal derived from CISA KEV, VulnCheck KEV, CISA SSVC, EPSS, Metasploit, Exploit-DB, trickest/cve, Nuclei templates, and inthewild.io exploitation reports.

What is the attack surface?

AV Network

AC Low

PR None

UI None

S Unchanged

C None

I None

A High

What should I do?

6 steps

PATCH

Upgrade to text-generation-inference 3.3.7 immediately — this is the definitive fix.
INTERIM if patching is blocked: Enable API authentication via --authentication-config flag to require bearer tokens; this prevents unauthenticated exploitation.
ADD EGRESS CONTROLS

Restrict outbound HTTP from the TGI process/container to internal or whitelisted endpoints only — this breaks the attack chain by preventing external image fetching.
ENFORCE MEMORY LIMITS

Set container memory limits (Docker: --memory=Xg, Kubernetes: resources.limits.memory) to contain blast radius and prevent host OOM.
DEPLOY API GATEWAY

Place TGI behind an API gateway or reverse proxy with rate limiting and request body size limits (e.g., nginx client_max_body_size, Kong rate-limit plugin).
DETECTION

Alert on anomalous memory growth spikes in inference containers, unusual outbound bandwidth from inference pods, and repeated 429/413 response codes paired with sustained resource utilization.

What does CISA's SSVC say?

Decision Track*

Exploitation poc

Automatable Yes

Technical Impact partial

Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.

How is it classified?

DoS Inference Framework API AML.T0006 - Active Scanning AML.T0029 - Denial of AI Service AML.T0034 - Cost Harvesting AML.T0040 - AI Model Inference API Access AML.T0049 - Exploit Public-Facing Application

Which compliance frameworks are affected?

This CVE is relevant to:

EU AI Act

Article 15 - Accuracy, robustness and cybersecurity

ISO 42001

A.6.2 - AI system operational management A.9.2 - AI System Availability and Resilience

NIST AI RMF

GOVERN 1.1 - AI risk is integrated into organizational risk management GOVERN 1.4 - Risks associated with AI system vulnerabilities are identified and managed MANAGE 2.2 - Mechanisms to detect, respond to, and recover from AI system failures

OWASP LLM Top 10

LLM04 - Model Denial of Service

Frequently Asked Questions

What is CVE-2026-0599?

If you're running HuggingFace TGI in VLM (multimodal) mode, patch to 3.3.7 now — this is a trivial, unauthenticated DoS that can crash your inference host with a single crafted request. Default deployments have no memory limits and no authentication, meaning your entire AI inference stack is one HTTP request away from an OOM crash. Treat this as critical if your AI pipelines serve multimodal workloads without an auth layer or network egress controls.

Is CVE-2026-0599 actively exploited?

No confirmed active exploitation of CVE-2026-0599 has been reported, but organizations should still patch proactively.

How to fix CVE-2026-0599?

1. PATCH: Upgrade to text-generation-inference 3.3.7 immediately — this is the definitive fix. 2. INTERIM if patching is blocked: Enable API authentication via --authentication-config flag to require bearer tokens; this prevents unauthenticated exploitation. 3. ADD EGRESS CONTROLS: Restrict outbound HTTP from the TGI process/container to internal or whitelisted endpoints only — this breaks the attack chain by preventing external image fetching. 4. ENFORCE MEMORY LIMITS: Set container memory limits (Docker: --memory=Xg, Kubernetes: resources.limits.memory) to contain blast radius and prevent host OOM. 5. DEPLOY API GATEWAY: Place TGI behind an API gateway or reverse proxy with rate limiting and request body size limits (e.g., nginx client_max_body_size, Kong rate-limit plugin). 6. DETECTION: Alert on anomalous memory growth spikes in inference containers, unusual outbound bandwidth from inference pods, and repeated 429/413 response codes paired with sustained resource utilization.

What systems are affected by CVE-2026-0599?

This vulnerability affects the following AI/ML architecture patterns: multimodal/VLM inference serving, LLM inference servers, model serving, self-hosted AI APIs, AI agent frameworks with vision capabilities.

What is the CVSS score for CVE-2026-0599?

CVE-2026-0599 has a CVSS v3.1 base score of 7.5 (HIGH). The EPSS exploitation probability is 23.72%.

What is the AI security impact?

Affected AI Architectures

multimodal/VLM inference servingLLM inference serversmodel servingself-hosted AI APIsAI agent frameworks with vision capabilities

MITRE ATLAS Techniques

AML.T0006 Active Scanning

AML.T0029 Denial of AI Service

AML.T0034 Cost Harvesting

AML.T0040 AI Model Inference API Access

AML.T0049 Exploit Public-Facing Application

Compliance Controls Affected

EU AI Act: Article 15

ISO 42001: A.6.2, A.9.2

NIST AI RMF: GOVERN 1.1, GOVERN 1.4, MANAGE 2.2

OWASP LLM Top 10: LLM04

What are the technical details?

Original Advisory

A vulnerability in huggingface/text-generation-inference version 3.3.6 allows unauthenticated remote attackers to exploit unbounded external image fetching during input validation in VLM mode. The issue arises when the router scans inputs for Markdown image links and performs a blocking HTTP GET request, reading the entire response body into memory and cloning it before decoding. This behavior can lead to resource exhaustion, including network bandwidth saturation, memory inflation, and CPU overutilization. The vulnerability is triggered even if the request is later rejected for exceeding token limits. The default deployment configuration, which lacks memory usage limits and authentication, exacerbates the impact, potentially crashing the host machine. The issue is resolved in version 3.3.7.

Exploitation Scenario

An attacker identifies a publicly accessible TGI endpoint running in VLM mode — discoverable via Shodan or by probing common ports (8080, 8000) with the /info endpoint. They craft a POST to /generate containing a prompt with a Markdown image reference: `What do you see? ![img](http://attacker-controlled.com/10gb-random.bin)`. The TGI router parses the Markdown, initiates a blocking HTTP GET to the attacker's server, and streams the full 10GB response into memory before any token-limit validation occurs. The attacker runs this concurrently from multiple IPs or even a single client with multiple threads. Within seconds to minutes (depending on bandwidth), the TGI process exhausts available RAM, triggering OOM kills and crashing the inference service — with no authentication required and no prior knowledge of the model or API needed.

Weaknesses (CWE)

CWE-400 Uncontrolled Resource Consumption Primary CWE-400 Uncontrolled Resource Consumption Primary

CWE-400 — Uncontrolled Resource Consumption: The product does not properly control the allocation and maintenance of a limited resource.

[Architecture and Design] Design throttling mechanisms into the system architecture. The best protection is to limit the amount of resources that an unauthorized user can cause to be expended. A strong authentication and access control model will help prevent such attacks from occurring in the first place. The login application should be protected against DoS attacks as much as possible. Limiting the database access, perhaps by caching result sets, can help minimize the resources expended. To further limit the potential for a DoS attack, consider tracking the rate of requests received from users and blocking requests that exceed a defined rate threshold.
[Architecture and Design] Mitigation of resource exhaustion attacks requires that the target system either: The first of these solutions is an issue in itself though, since it may allow attackers to prevent the use of the system by a particular valid user. If the attacker impersonates the valid user, they may be able to prevent the user from accessing the server in question. The second solution is simply difficult to effectively institute -- and even when properly done, it does not provide a full solution. It simply makes the attack require more resources on the part of the attacker. recognizes the attack and denies that user further access for a given amount of time, or uniformly throttles all requests in order to make it more difficult to consume resources more quickly than they can again be freed.

Source: MITRE CWE corpus.