CVE-2026-54235: vLLM NaN/Inf bypass crashes GPU

Q: Is CVE-2026-54235 actively exploited?

No confirmed active exploitation of CVE-2026-54235 has been reported, but organizations should still patch proactively.

Q: How to fix CVE-2026-54235?

1. Patch: apply the fix in PR #45116 / commit d598d239737, which adds math.isfinite(self.temperature) validation to _verify_args() and returns HTTP 400 for non-finite values. 2. Gateway workaround (pre-patch): add input validation at the API ingestion layer — frameworks such as FastAPI with Pydantic or nginx with lua can reject temperature values that fail isfinite() checks before the request reaches vLLM. 3. Detection: monitor inference worker process restarts and CUDA error logs for unusual crash spikes; a burst of CUDA softmax errors or rapid worker recycling is the primary telemetry signal for exploitation attempts. 4. Prioritize exposure: audit all public or untrusted-user-accessible vLLM endpoints, gate those first, and treat internal multi-tenant deployments as equally urgent given the shared-worker impact.

Q: What systems are affected by CVE-2026-54235?

This vulnerability affects the following AI/ML architecture patterns: model serving, LLM inference APIs, multi-tenant inference platforms, AI gateway deployments, self-hosted LLM deployments.

Q: What is the CVSS score for CVE-2026-54235?

No CVSS score has been assigned yet.

CISO Take

A validation bypass in vLLM's temperature parameter handling allows any caller with API access to crash GPU inference workers by passing NaN or positive Infinity as the temperature value — Python's IEEE 754 float semantics cause all comparison guards in sampling_params.py to silently evaluate to False, letting the invalid value reach CUDA kernels where it triggers undefined behavior. With 130 downstream dependents and an EPSS score placing this vulnerability in the 88th percentile for exploitation likelihood, any vLLM-based serving infrastructure — including self-hosted LLM APIs, multi-tenant inference platforms, and derivative middleware — faces service disruption from a single malformed request. No public exploit or KEV listing exists yet, but the attack requires no special knowledge beyond knowing the parameter name and is effective against all versions up to 0.23.0. Remediate by deploying the fix from PR #45116, which adds a math.isfinite() check, or immediately reject non-finite float values at the API gateway before requests reach vLLM.

Sources: NVD EPSS GitHub Advisory ATLAS

What is the risk?

Medium operational risk with high exploitability relative to skill required. The trigger is trivial — any API caller can set temperature to NaN or +Infinity — and requires no authentication bypass, credential theft, or advanced technique. Impact is bounded to denial of service (full inference worker crash affecting all concurrent users), with no data exfiltration or persistent system compromise. The 88th EPSS percentile relative to a medium CVSS indicates the research community views this as straightforward to weaponize. Risk is most acute in multi-tenant or publicly accessible vLLM deployments where a single unauthenticated request can disrupt shared serving capacity.

How does the attack unfold?

Crafted API Request

Adversary submits an inference request to the vLLM endpoint with the temperature parameter set to NaN or +Infinity, requiring no credentials or special tooling.

AML.T0049

Validation Bypass

Python IEEE 754 semantics cause all comparison guards in sampling_params.py (lines 384 and 462) to evaluate False for NaN and +Infinity, allowing the non-finite value to pass validation unchecked.

AML.T0043.003

GPU Kernel Crash

The non-finite temperature propagates to the CUDA sampling kernel, which receives an invalid softmax input and triggers undefined behavior or a hard CUDA error that kills the inference worker process.

Service Disruption

All concurrent users sharing the inference worker lose service until the process is manually restarted, completing the denial-of-service impact with a repeatable, low-cost mechanism.

AML.T0029

Crafted API Request

Adversary submits an inference request to the vLLM endpoint with the temperature parameter set to NaN or +Infinity, requiring no credentials or special tooling.

AML.T0049

Validation Bypass

Python IEEE 754 semantics cause all comparison guards in sampling_params.py (lines 384 and 462) to evaluate False for NaN and +Infinity, allowing the non-finite value to pass validation unchecked.

AML.T0043.003

GPU Kernel Crash

The non-finite temperature propagates to the CUDA sampling kernel, which receives an invalid softmax input and triggers undefined behavior or a hard CUDA error that kills the inference worker process.

Service Disruption

All concurrent users sharing the inference worker lose service until the process is manually restarted, completing the denial-of-service impact with a repeatable, low-cost mechanism.

AML.T0029

What systems are affected?

Package	Ecosystem	Vulnerable Range	Patched
vLLM	pip	<= 0.23.0	No patch
82.8K 130 dependents Pushed 3d ago 35% patched ~30d to patch Full package profile →

Do you use vLLM? You're affected.

How severe is it?

CVSS 3.1

N/A

EPSS

0.0%

chance of exploitation in 30 days

Higher than 12% of all CVEs

Source: EPSS v3 — FIRST.org

Exploitation Status

No known exploitation

Sophistication

Trivial

What should I do?

4 steps

Patch: apply the fix in PR #45116 / commit d598d239737, which adds math.isfinite(self.temperature) validation to _verify_args() and returns HTTP 400 for non-finite values.
Gateway workaround (pre-patch): add input validation at the API ingestion layer — frameworks such as FastAPI with Pydantic or nginx with lua can reject temperature values that fail isfinite() checks before the request reaches vLLM.
Detection: monitor inference worker process restarts and CUDA error logs for unusual crash spikes; a burst of CUDA softmax errors or rapid worker recycling is the primary telemetry signal for exploitation attempts.
Prioritize exposure: audit all public or untrusted-user-accessible vLLM endpoints, gate those first, and treat internal multi-tenant deployments as equally urgent given the shared-worker impact.

How is it classified?

DoS Inference AML.T0029 - Denial of AI Service AML.T0043.003 - Manual Modification AML.T0049 - Exploit Public-Facing Application

Which compliance frameworks are affected?

This CVE is relevant to:

EU AI Act

Art. 9 - Risk management system — robustness and resilience

ISO 42001

8.4 - AI system development — data quality and input validation

NIST AI RMF

MG-2.2 - Mechanisms to sustain AI system value and manage risks

OWASP LLM Top 10

LLM04 - Model Denial of Service

Frequently Asked Questions

What is CVE-2026-54235?

A validation bypass in vLLM's temperature parameter handling allows any caller with API access to crash GPU inference workers by passing NaN or positive Infinity as the temperature value — Python's IEEE 754 float semantics cause all comparison guards in sampling_params.py to silently evaluate to False, letting the invalid value reach CUDA kernels where it triggers undefined behavior. With 130 downstream dependents and an EPSS score placing this vulnerability in the 88th percentile for exploitation likelihood, any vLLM-based serving infrastructure — including self-hosted LLM APIs, multi-tenant inference platforms, and derivative middleware — faces service disruption from a single malformed request. No public exploit or KEV listing exists yet, but the attack requires no special knowledge beyond knowing the parameter name and is effective against all versions up to 0.23.0. Remediate by deploying the fix from PR #45116, which adds a math.isfinite() check, or immediately reject non-finite float values at the API gateway before requests reach vLLM.

Is CVE-2026-54235 actively exploited?

No confirmed active exploitation of CVE-2026-54235 has been reported, but organizations should still patch proactively.

How to fix CVE-2026-54235?

1. Patch: apply the fix in PR #45116 / commit d598d239737, which adds math.isfinite(self.temperature) validation to _verify_args() and returns HTTP 400 for non-finite values. 2. Gateway workaround (pre-patch): add input validation at the API ingestion layer — frameworks such as FastAPI with Pydantic or nginx with lua can reject temperature values that fail isfinite() checks before the request reaches vLLM. 3. Detection: monitor inference worker process restarts and CUDA error logs for unusual crash spikes; a burst of CUDA softmax errors or rapid worker recycling is the primary telemetry signal for exploitation attempts. 4. Prioritize exposure: audit all public or untrusted-user-accessible vLLM endpoints, gate those first, and treat internal multi-tenant deployments as equally urgent given the shared-worker impact.

What systems are affected by CVE-2026-54235?

This vulnerability affects the following AI/ML architecture patterns: model serving, LLM inference APIs, multi-tenant inference platforms, AI gateway deployments, self-hosted LLM deployments.

What is the CVSS score for CVE-2026-54235?

No CVSS score has been assigned yet.

What is the AI security impact?

Affected AI Architectures

model servingLLM inference APIsmulti-tenant inference platformsAI gateway deploymentsself-hosted LLM deployments

MITRE ATLAS Techniques

AML.T0029 Denial of AI Service

AML.T0043.003 Manual Modification

AML.T0049 Exploit Public-Facing Application

Compliance Controls Affected

EU AI Act: Art. 9

ISO 42001: 8.4

NIST AI RMF: MG-2.2

OWASP LLM Top 10: LLM04

What are the technical details?

Original Advisory

## Summary All temperature validation gates use comparison operators (`<`, `>`), which silently evaluate to `False` for `NaN` and for positive `Infinity` in Python's IEEE 754 float semantics. Both values pass every guard and propagate to GPU sampling kernels, where they produce undefined behavior or CUDA errors that can crash the inference worker. Note: `-Infinity` is correctly caught. ## Root Cause `sampling_params.py:384`: ```python if 0 < self.temperature < _MAX_TEMP: # NaN → False; +Inf → False ``` `sampling_params.py:462`: ```python if self.temperature < 0.0: # NaN → False; +Inf → False raise VLLMValidationError(...) ``` No `math.isnan()` or `math.isinf()` check exists anywhere in `sampling_params.py`. Python semantics (verified): `float('nan') < 0.0` → `False`, `float('inf') < 0.0` → `False`. ## Impact Crash of inference worker on GPU kernel execution with NaN/Inf softmax input, degrading service for all concurrent users. ## Remediation Add `math.isfinite(self.temperature)` check in `_verify_args()`. Reject non-finite float values with a 400 error. ## Fix A fix for this vulnerability was merged here: https://github.com/vllm-project/vllm/pull/45116

Exploitation Scenario

An adversary with access to a vLLM inference API endpoint — whether via an open multi-tenant deployment, a compromised API key, or an internally exposed service — crafts a standard inference request and sets the temperature field to float('nan') or float('inf'). Because Python IEEE 754 semantics cause NaN and +Infinity to return False for every comparison, the value bypasses all guards in sampling_params.py without raising any exception. The non-finite value is forwarded to the GPU CUDA sampling kernel, which receives an invalid softmax input, produces undefined behavior or a hard CUDA error, and crashes the inference worker process. All serving capacity shared by concurrent users is immediately lost, and the disruption persists until the worker is restarted — giving the adversary a repeatable low-cost mechanism to keep the service down.

Weaknesses (CWE)

CWE-1287 Improper Validation of Specified Type of Input Primary

CWE-1287 — Improper Validation of Specified Type of Input: The product receives input that is expected to be of a certain type, but it does not validate or incorrectly validates that the input is actually of the expected type.

[Implementation] Assume all input is malicious. Use an "accept known good" input validation strategy, i.e., use a list of acceptable inputs that strictly conform to specifications. Reject any input that does not strictly conform to specifications, or transform it into something that does. When performing input validation, consider all potentially relevant properties, including length, type of input, the full range of acceptable values, missing or extra inputs, syntax, consistency across related fields, and conformance to business rules. As an example of business rule logic, "boat" may be syntactically valid because it only contains alphanumeric characters, but it is not valid if the input is only expected to contain colors such as "red" or "blue." Do not rely exclusively on looking for malicious or malformed inputs. This is likely to miss at least one undesirable input, especially if the code's environment changes. This can give attackers enough room to bypass the intended validation. However, denylis

Source: MITRE CWE corpus.