A silent integer truncation bug in vLLM's GGUF dequantize CUDA kernels causes output tensors to be only partially initialized, leaving the remainder populated with stale GPU memory from prior operations — which in multi-tenant inference deployments may contain other users' tensor data. An attacker needs only to publish a GGUF model file with tensor dimensions whose product exceeds INT_MAX to trigger this silently: no error is thrown, no warning logged, and contaminated tensors pass undetected through all downstream computation. With 130 downstream dependents and EPSS placing this vulnerability in the top 87th percentile for exploitation likelihood, any team running shared vLLM inference infrastructure or loading externally-sourced GGUF models faces genuine cross-tenant data leakage exposure. Apply the upstream fix from PR #44971 (commit f219788f) immediately and validate all GGUF model files for oversized tensor dimensions before loading in production.
What is the risk?
Medium CVE severity but elevated contextual risk for multi-tenant AI inference environments. The attack surface is realistic: exploitation requires only publishing or social-engineering the loading of a malicious GGUF model file — a plausible threat given how widely models are sourced from public hubs. The vulnerability is entirely passive and silent once triggered: the dequantize kernel truncates its work silently, leaving uninitialized GPU buffers to propagate through model computation. Single-tenant deployments with trusted model sources face low risk; shared inference platforms and model-as-a-service providers running GGUF-quantized models face genuine confidentiality exposure. No patched release version is currently published — the fix exists only as a commit on the upstream repo, meaning users must manually apply or cherry-pick the patch.
How does the attack unfold?
What systems are affected?
| Package | Ecosystem | Vulnerable Range | Patched |
|---|---|---|---|
| vLLM | pip | >= 0.5.5, <= 0.23.0 | No patch |
Do you use vLLM? You're affected.
How severe is it?
What should I do?
5 steps-
Apply upstream fix from PR #44971 (commit f219788f91952827132fa4fdf916427cd20d225e): changes the int k parameter to int64_t in to_cuda_ggml_t and all derived dequantize functions.
-
Until patched, implement a pre-load validation layer that rejects any GGUF model file containing weight tensor dimensions whose product exceeds INT_MAX (2,147,483,647).
-
Audit all GGUF models currently loaded in production for weight matrices with shapes such as [65536, 65536] or any m×n > 2.1B configuration.
-
For multi-tenant deployments, consider process-level isolation for GGUF model loading with dedicated GPU memory allocations pending the patch.
-
Enforce model provenance controls — only load GGUF files from cryptographically attested, internal or verified sources; disable loading from arbitrary public model hubs in production.
How is it classified?
Which compliance frameworks are affected?
This CVE is relevant to:
Frequently Asked Questions
What is CVE-2026-53923?
A silent integer truncation bug in vLLM's GGUF dequantize CUDA kernels causes output tensors to be only partially initialized, leaving the remainder populated with stale GPU memory from prior operations — which in multi-tenant inference deployments may contain other users' tensor data. An attacker needs only to publish a GGUF model file with tensor dimensions whose product exceeds INT_MAX to trigger this silently: no error is thrown, no warning logged, and contaminated tensors pass undetected through all downstream computation. With 130 downstream dependents and EPSS placing this vulnerability in the top 87th percentile for exploitation likelihood, any team running shared vLLM inference infrastructure or loading externally-sourced GGUF models faces genuine cross-tenant data leakage exposure. Apply the upstream fix from PR #44971 (commit f219788f) immediately and validate all GGUF model files for oversized tensor dimensions before loading in production.
Is CVE-2026-53923 actively exploited?
No confirmed active exploitation of CVE-2026-53923 has been reported, but organizations should still patch proactively.
How to fix CVE-2026-53923?
1. Apply upstream fix from PR #44971 (commit f219788f91952827132fa4fdf916427cd20d225e): changes the int k parameter to int64_t in to_cuda_ggml_t and all derived dequantize functions. 2. Until patched, implement a pre-load validation layer that rejects any GGUF model file containing weight tensor dimensions whose product exceeds INT_MAX (2,147,483,647). 3. Audit all GGUF models currently loaded in production for weight matrices with shapes such as [65536, 65536] or any m×n > 2.1B configuration. 4. For multi-tenant deployments, consider process-level isolation for GGUF model loading with dedicated GPU memory allocations pending the patch. 5. Enforce model provenance controls — only load GGUF files from cryptographically attested, internal or verified sources; disable loading from arbitrary public model hubs in production.
What systems are affected by CVE-2026-53923?
This vulnerability affects the following AI/ML architecture patterns: multi-tenant LLM inference, model serving, GGUF model pipelines, quantized model deployment, shared GPU inference clusters.
What is the CVSS score for CVE-2026-53923?
No CVSS score has been assigned yet.
What is the AI security impact?
Affected AI Architectures
MITRE ATLAS Techniques
AML.T0010.003 Model AML.T0011.000 Unsafe AI Artifacts AML.T0025 Exfiltration via Cyber Means AML.T0049 Exploit Public-Facing Application Compliance Controls Affected
What are the technical details?
Original Advisory
## Summary Integer truncation of tensor dimensions in vLLM's GGUF dequantize kernels (`csrc/quantization/gguf/gguf_kernel.cu`) causes partial tensor processing. The output tensor is allocated at full size via `torch::empty` (uninitialized memory), but the dequantize CUDA kernel processes only a truncated number of elements. The unfilled portion of the output tensor retains whatever was previously in GPU memory. In multi-tenant inference deployments, this residual GPU memory may contain tensor data from other users' inference requests, constituting information disclosure. ## Root Cause The `to_cuda_ggml_t` function pointer type at `ggml-common.h:1067` declares its element count parameter as `int` (32-bit): ```cpp using to_cuda_ggml_t = void (*)(const void * __restrict__ x, dst_t * __restrict__ y, int k, // 32-bit cudaStream_t stream); ``` All dequantize kernel functions (`dequantize_block_cuda`, `dequantize_row_q2_K_cuda`, etc. in `dequantize.cuh`) inherit this `int k` parameter and use it as the kernel launch grid size: ```cpp static void dequantize_block_cuda(..., const int k, cudaStream_t stream) { const int num_blocks = (k + 2*CUDA_DEQUANTIZE_BLOCK_SIZE - 1) / (2*CUDA_DEQUANTIZE_BLOCK_SIZE); dequantize_block<<<num_blocks, CUDA_DEQUANTIZE_BLOCK_SIZE, 0, stream>>>(vx, y, k); } ``` In `ggml_dequantize()` at `gguf_kernel.cu:85`, the caller passes `m * n` (an `int64_t` product) to this `int k` parameter: ```cpp at::Tensor DW = torch::empty({m, n}, options); // line 80: full-size, UNINITIALIZED // ... to_cuda((void*)W.data_ptr(), (scalar_t*)DW.data_ptr(), m * n, stream); // line 85: m*n truncated to int ``` When `m * n > INT_MAX`, the truncated `k` is smaller than the actual tensor size. The kernel processes `k` elements. The remaining `(m * n) - k` elements in `DW` are never written and contain stale GPU memory. This is a single root cause -- the `int` type on the `k` parameter in `to_cuda_ggml_t` -- with a single fix: change `int k` to `int64_t k`. All dequantize functions inherit this type through the same typedef. ## Affected Functions All in `csrc/quantization/gguf/gguf_kernel.cu`: | Function | Line | Allocation | Info Disclosure? | |----------|------|-----------|-----------------| | `ggml_dequantize` | 74 | `torch::empty({m, n})` at line 80 | Yes -- `m*n` truncated to `int k` at line 85 | | `ggml_mul_mat_vec_a8` | 91 | `torch::empty({vecs, row})` at line 99 | Yes -- `int col = X.sizes()[1]` at line 94 | | `ggml_mul_mat_a8` | 207 | `torch::empty({batch, row})` at line 215 | Yes -- `int col = X.sizes()[1]` at line 210 | | `ggml_moe_a8` | 279 | `torch::empty({tokens*top_k, row})` at line 289 | Yes -- `int col = X.sizes()[1]` at line 285 | All four functions allocate output tensors with `torch::empty` (uninitialized) and then run CUDA kernels that use truncated dimension values as loop bounds. The unfilled portion of each output tensor retains stale GPU memory. `ggml_moe_a8_vec` (line 382) uses `torch::zeros` instead of `torch::empty`, so it is not affected by the info disclosure variant. ## Impact: Information Disclosure in Multi-Tenant Serving vLLM is designed for multi-tenant inference serving. GPU memory is reused across requests from different users. When the dequantize kernel partially fills an output tensor: 1. The output tensor `DW` is allocated with `torch::empty` -- the buffer contains whatever was previously in that GPU memory region 2. The dequantize kernel fills only a truncated portion of the buffer 3. The unfilled portion retains residual data from prior GPU operations, which may include tensor data from other users' inference requests 4. The contaminated tensor proceeds through the model computation 5. No error or warning is generated -- the partial fill is silent This is a confidentiality violation. In shared inference deployments (the primary vLLM use case), one user's inference data can leak into another user's model computation through residual GPU memory. ## Attacker Control The attacker crafts a GGUF model file with weight tensor dimensions whose product exceeds `INT_MAX` (e.g., a matrix with shape `[65536, 65536]` gives `m * n = 4,294,967,296`). The model is hosted on HuggingFace or any model hub. The victim loads the model with vLLM for inference serving. The truncation happens automatically during model weight dequantization. ## Fix A fix for this vulnerability was added here: https://github.com/vllm-project/vllm/pull/44971
Exploitation Scenario
An adversary targeting a shared vLLM inference platform publishes a GGUF model to HuggingFace with a weight tensor shaped [65536, 65536], yielding m*n = 4,294,967,296 — exceeding INT_MAX by roughly one full INT_MAX. A victim organization loads this model onto their multi-tenant GPU cluster serving multiple enterprise clients. During GGUF weight dequantization at model load time, the int64 product is silently cast to int32, producing a near-zero truncated k. The CUDA kernel fills essentially none of the ~16GB output buffer, leaving it populated entirely with prior GPU allocations from active user sessions. The contaminated weight tensor is then used in all subsequent inference matrix multiplications. A co-located adversary submitting concurrent inference requests may craft queries that amplify or surface fragments of this residual data, potentially recovering portions of other users' prompt text, token embeddings, or intermediate activations without any authentication bypass or elevated privilege.
Weaknesses (CWE)
CWE-200 Exposure of Sensitive Information to an Unauthorized Actor
Primary
CWE-681 Incorrect Conversion between Numeric Types
Primary
CWE-200 — Exposure of Sensitive Information to an Unauthorized Actor: The product exposes sensitive information to an actor that is not explicitly authorized to have access to that information.
- [Architecture and Design] Compartmentalize the system to have "safe" areas where trust boundaries can be unambiguously drawn. Do not allow sensitive data to go outside of the trust boundary and always be careful when interfacing with a compartment outside of the safe area. Ensure that appropriate compartmentalization is built into the system design, and the compartmentalization allows for and reinforces privilege separation functionality. Architects and designers should rely on the principle of least privilege to decide the appropriate time to use privileges and the time to drop privileges.
Source: MITRE CWE corpus.
References
Timeline
Related Vulnerabilities
CVE-2024-9053 9.8 vllm: RCE via unsafe pickle deserialization in RPC server
Same package: vllm CVE-2024-11041 9.8 vllm: RCE via unsafe pickle deserialization in MessageQueue
Same package: vllm CVE-2026-25960 9.8 vllm: SSRF allows internal network access
Same package: vllm CVE-2025-47277 9.8 vLLM: RCE via exposed TCPStore in distributed inference
Same package: vllm CVE-2025-32444 9.8 vLLM: RCE via pickle deserialization on ZeroMQ
Same package: vllm