CVE-2026-53923: vLLM integer truncation leaks GPU

CISO Take

A silent integer truncation bug in vLLM's GGUF dequantize CUDA kernels causes output tensors to be only partially initialized, leaving the remainder populated with stale GPU memory from prior operations — which in multi-tenant inference deployments may contain other users' tensor data. An attacker needs only to publish a GGUF model file with tensor dimensions whose product exceeds INT_MAX to trigger this silently: no error is thrown, no warning logged, and contaminated tensors pass undetected through all downstream computation. With 130 downstream dependents and EPSS placing this vulnerability in the top 87th percentile for exploitation likelihood, any team running shared vLLM inference infrastructure or loading externally-sourced GGUF models faces genuine cross-tenant data leakage exposure. Apply the upstream fix from PR #44971 (commit f219788f) immediately and validate all GGUF model files for oversized tensor dimensions before loading in production.

Sources: NVD EPSS GitHub Advisory ATLAS

What is the risk?

Medium CVE severity but elevated contextual risk for multi-tenant AI inference environments. The attack surface is realistic: exploitation requires only publishing or social-engineering the loading of a malicious GGUF model file — a plausible threat given how widely models are sourced from public hubs. The vulnerability is entirely passive and silent once triggered: the dequantize kernel truncates its work silently, leaving uninitialized GPU buffers to propagate through model computation. Single-tenant deployments with trusted model sources face low risk; shared inference platforms and model-as-a-service providers running GGUF-quantized models face genuine confidentiality exposure. No patched release version is currently published — the fix exists only as a commit on the upstream repo, meaning users must manually apply or cherry-pick the patch.

How does the attack unfold?

Supply Chain Staging

Attacker crafts a GGUF model file with weight tensor dimensions whose product exceeds INT_MAX (e.g., shape [65536, 65536] = 4.29B elements) and publishes it to a public model hub such as HuggingFace.

AML.T0010.003

Model Loading

Victim's vLLM multi-tenant inference server loads the malicious GGUF model, triggering the vulnerable dequantize CUDA kernel path (ggml_dequantize or ggml_mul_mat_*) during weight initialization.

AML.T0011.000

Silent Integer Truncation

The 64-bit tensor size (m*n) is silently cast to int32, truncating to a near-zero value; the CUDA kernel fills only that fraction of the torch::empty output buffer while the remainder retains raw GPU memory from prior operations.

AML.T0049

Cross-Tenant Data Leakage

The contaminated output tensor — containing residual GPU memory from concurrent users' inference sessions — propagates silently through all downstream model computation, enabling potential recovery of other tenants' prompt content or activation data.

AML.T0025

Supply Chain Staging

Attacker crafts a GGUF model file with weight tensor dimensions whose product exceeds INT_MAX (e.g., shape [65536, 65536] = 4.29B elements) and publishes it to a public model hub such as HuggingFace.

AML.T0010.003

Model Loading

Victim's vLLM multi-tenant inference server loads the malicious GGUF model, triggering the vulnerable dequantize CUDA kernel path (ggml_dequantize or ggml_mul_mat_*) during weight initialization.

AML.T0011.000

Silent Integer Truncation

The 64-bit tensor size (m*n) is silently cast to int32, truncating to a near-zero value; the CUDA kernel fills only that fraction of the torch::empty output buffer while the remainder retains raw GPU memory from prior operations.

AML.T0049

Cross-Tenant Data Leakage

The contaminated output tensor — containing residual GPU memory from concurrent users' inference sessions — propagates silently through all downstream model computation, enabling potential recovery of other tenants' prompt content or activation data.

AML.T0025

What systems are affected?

Package	Ecosystem	Vulnerable Range	Patched
vLLM	pip	>= 0.5.5, <= 0.23.0	No patch
82.8K 130 dependents Pushed 3d ago 35% patched ~30d to patch Full package profile →

Do you use vLLM? You're affected.

How severe is it?

CVSS 3.1

N/A

EPSS

0.0%

chance of exploitation in 30 days

Higher than 13% of all CVEs

Source: EPSS v3 — FIRST.org

Exploitation Status

No known exploitation

Sophistication

Moderate

What should I do?

5 steps

Apply upstream fix from PR #44971 (commit f219788f91952827132fa4fdf916427cd20d225e): changes the int k parameter to int64_t in to_cuda_ggml_t and all derived dequantize functions.
Until patched, implement a pre-load validation layer that rejects any GGUF model file containing weight tensor dimensions whose product exceeds INT_MAX (2,147,483,647).
Audit all GGUF models currently loaded in production for weight matrices with shapes such as [65536, 65536] or any m×n > 2.1B configuration.
For multi-tenant deployments, consider process-level isolation for GGUF model loading with dedicated GPU memory allocations pending the patch.
Enforce model provenance controls — only load GGUF files from cryptographically attested, internal or verified sources; disable loading from arbitrary public model hubs in production.

How is it classified?

Data Extraction Supply Chain Privacy Violation Inference Model AML.T0010.003 - Model AML.T0011.000 - Unsafe AI Artifacts AML.T0025 - Exfiltration via Cyber Means AML.T0049 - Exploit Public-Facing Application

Which compliance frameworks are affected?

This CVE is relevant to:

EU AI Act

Article 15 - Accuracy, robustness and cybersecurity

ISO 42001

A.10.1 - Supply chain management A.6.2 - AI system design and development

NIST AI RMF

MANAGE 2.2 - Mechanisms are in place and applied to sustain the value of deployed AI systems

OWASP LLM Top 10

LLM05:2025 - Supply Chain Vulnerabilities LLM06:2025 - Sensitive Information Disclosure

Frequently Asked Questions

What is CVE-2026-53923?

A silent integer truncation bug in vLLM's GGUF dequantize CUDA kernels causes output tensors to be only partially initialized, leaving the remainder populated with stale GPU memory from prior operations — which in multi-tenant inference deployments may contain other users' tensor data. An attacker needs only to publish a GGUF model file with tensor dimensions whose product exceeds INT_MAX to trigger this silently: no error is thrown, no warning logged, and contaminated tensors pass undetected through all downstream computation. With 130 downstream dependents and EPSS placing this vulnerability in the top 87th percentile for exploitation likelihood, any team running shared vLLM inference infrastructure or loading externally-sourced GGUF models faces genuine cross-tenant data leakage exposure. Apply the upstream fix from PR #44971 (commit f219788f) immediately and validate all GGUF model files for oversized tensor dimensions before loading in production.

Is CVE-2026-53923 actively exploited?

No confirmed active exploitation of CVE-2026-53923 has been reported, but organizations should still patch proactively.

How to fix CVE-2026-53923?

1. Apply upstream fix from PR #44971 (commit f219788f91952827132fa4fdf916427cd20d225e): changes the int k parameter to int64_t in to_cuda_ggml_t and all derived dequantize functions. 2. Until patched, implement a pre-load validation layer that rejects any GGUF model file containing weight tensor dimensions whose product exceeds INT_MAX (2,147,483,647). 3. Audit all GGUF models currently loaded in production for weight matrices with shapes such as [65536, 65536] or any m×n > 2.1B configuration. 4. For multi-tenant deployments, consider process-level isolation for GGUF model loading with dedicated GPU memory allocations pending the patch. 5. Enforce model provenance controls — only load GGUF files from cryptographically attested, internal or verified sources; disable loading from arbitrary public model hubs in production.

What systems are affected by CVE-2026-53923?

This vulnerability affects the following AI/ML architecture patterns: multi-tenant LLM inference, model serving, GGUF model pipelines, quantized model deployment, shared GPU inference clusters.

What is the CVSS score for CVE-2026-53923?

No CVSS score has been assigned yet.

What is the AI security impact?

Affected AI Architectures

multi-tenant LLM inferencemodel servingGGUF model pipelinesquantized model deploymentshared GPU inference clusters

MITRE ATLAS Techniques

AML.T0010.003 Model

AML.T0011.000 Unsafe AI Artifacts

AML.T0025 Exfiltration via Cyber Means

AML.T0049 Exploit Public-Facing Application

Compliance Controls Affected

EU AI Act: Article 15

ISO 42001: A.10.1, A.6.2

NIST AI RMF: MANAGE 2.2

OWASP LLM Top 10: LLM05:2025, LLM06:2025

What are the technical details?

Original Advisory

## Summary Integer truncation of tensor dimensions in vLLM's GGUF dequantize kernels (`csrc/quantization/gguf/gguf_kernel.cu`) causes partial tensor processing. The output tensor is allocated at full size via `torch::empty` (uninitialized memory), but the dequantize CUDA kernel processes only a truncated number of elements. The unfilled portion of the output tensor retains whatever was previously in GPU memory. In multi-tenant inference deployments, this residual GPU memory may contain tensor data from other users' inference requests, constituting information disclosure. ## Root Cause The `to_cuda_ggml_t` function pointer type at `ggml-common.h:1067` declares its element count parameter as `int` (32-bit): ```cpp using to_cuda_ggml_t = void (*)(const void * __restrict__ x, dst_t * __restrict__ y, int k, // 32-bit cudaStream_t stream); ``` All dequantize kernel functions (`dequantize_block_cuda`, `dequantize_row_q2_K_cuda`, etc. in `dequantize.cuh`) inherit this `int k` parameter and use it as the kernel launch grid size: ```cpp static void dequantize_block_cuda(..., const int k, cudaStream_t stream) { const int num_blocks = (k + 2*CUDA_DEQUANTIZE_BLOCK_SIZE - 1) / (2*CUDA_DEQUANTIZE_BLOCK_SIZE); dequantize_block<<<num_blocks, CUDA_DEQUANTIZE_BLOCK_SIZE, 0, stream>>>(vx, y, k); } ``` In `ggml_dequantize()` at `gguf_kernel.cu:85`, the caller passes `m * n` (an `int64_t` product) to this `int k` parameter: ```cpp at::Tensor DW = torch::empty({m, n}, options); // line 80: full-size, UNINITIALIZED // ... to_cuda((void*)W.data_ptr(), (scalar_t*)DW.data_ptr(), m * n, stream); // line 85: m*n truncated to int ``` When `m * n > INT_MAX`, the truncated `k` is smaller than the actual tensor size. The kernel processes `k` elements. The remaining `(m * n) - k` elements in `DW` are never written and contain stale GPU memory. This is a single root cause -- the `int` type on the `k` parameter in `to_cuda_ggml_t` -- with a single fix: change `int k` to `int64_t k`. All dequantize functions inherit this type through the same typedef. ## Affected Functions All in `csrc/quantization/gguf/gguf_kernel.cu`: | Function | Line | Allocation | Info Disclosure? | |----------|------|-----------|-----------------| | `ggml_dequantize` | 74 | `torch::empty({m, n})` at line 80 | Yes -- `m*n` truncated to `int k` at line 85 | | `ggml_mul_mat_vec_a8` | 91 | `torch::empty({vecs, row})` at line 99 | Yes -- `int col = X.sizes()[1]` at line 94 | | `ggml_mul_mat_a8` | 207 | `torch::empty({batch, row})` at line 215 | Yes -- `int col = X.sizes()[1]` at line 210 | | `ggml_moe_a8` | 279 | `torch::empty({tokens*top_k, row})` at line 289 | Yes -- `int col = X.sizes()[1]` at line 285 | All four functions allocate output tensors with `torch::empty` (uninitialized) and then run CUDA kernels that use truncated dimension values as loop bounds. The unfilled portion of each output tensor retains stale GPU memory. `ggml_moe_a8_vec` (line 382) uses `torch::zeros` instead of `torch::empty`, so it is not affected by the info disclosure variant. ## Impact: Information Disclosure in Multi-Tenant Serving vLLM is designed for multi-tenant inference serving. GPU memory is reused across requests from different users. When the dequantize kernel partially fills an output tensor: 1. The output tensor `DW` is allocated with `torch::empty` -- the buffer contains whatever was previously in that GPU memory region 2. The dequantize kernel fills only a truncated portion of the buffer 3. The unfilled portion retains residual data from prior GPU operations, which may include tensor data from other users' inference requests 4. The contaminated tensor proceeds through the model computation 5. No error or warning is generated -- the partial fill is silent This is a confidentiality violation. In shared inference deployments (the primary vLLM use case), one user's inference data can leak into another user's model computation through residual GPU memory. ## Attacker Control The attacker crafts a GGUF model file with weight tensor dimensions whose product exceeds `INT_MAX` (e.g., a matrix with shape `[65536, 65536]` gives `m * n = 4,294,967,296`). The model is hosted on HuggingFace or any model hub. The victim loads the model with vLLM for inference serving. The truncation happens automatically during model weight dequantization. ## Fix A fix for this vulnerability was added here: https://github.com/vllm-project/vllm/pull/44971

Exploitation Scenario

An adversary targeting a shared vLLM inference platform publishes a GGUF model to HuggingFace with a weight tensor shaped [65536, 65536], yielding m*n = 4,294,967,296 — exceeding INT_MAX by roughly one full INT_MAX. A victim organization loads this model onto their multi-tenant GPU cluster serving multiple enterprise clients. During GGUF weight dequantization at model load time, the int64 product is silently cast to int32, producing a near-zero truncated k. The CUDA kernel fills essentially none of the ~16GB output buffer, leaving it populated entirely with prior GPU allocations from active user sessions. The contaminated weight tensor is then used in all subsequent inference matrix multiplications. A co-located adversary submitting concurrent inference requests may craft queries that amplify or surface fragments of this residual data, potentially recovering portions of other users' prompt text, token embeddings, or intermediate activations without any authentication bypass or elevated privilege.

Weaknesses (CWE)

CWE-200 Exposure of Sensitive Information to an Unauthorized Actor Primary CWE-681 Incorrect Conversion between Numeric Types Primary

CWE-200 — Exposure of Sensitive Information to an Unauthorized Actor: The product exposes sensitive information to an actor that is not explicitly authorized to have access to that information.

[Architecture and Design] Compartmentalize the system to have "safe" areas where trust boundaries can be unambiguously drawn. Do not allow sensitive data to go outside of the trust boundary and always be careful when interfacing with a compartment outside of the safe area. Ensure that appropriate compartmentalization is built into the system design, and the compartmentalization allows for and reinforces privilege separation functionality. Architects and designers should rely on the principle of least privilege to decide the appropriate time to use privileges and the time to drop privileges.

Source: MITRE CWE corpus.