Impact
vLLM’s GGUF dequantize kernels perform an integer truncation of tensor dimensions, causing the output tensor to be allocated at its full size while the CUDA kernel processes only a truncated number of elements. The leftover portion of the tensor remains uninitialized and may contain data previously residing in GPU memory. In a multi‑tenant inference setup this stale memory can contain tensor data belonging to other users, allowing an attacker to read confidential information. The vulnerability exemplifies a numeric truncation flaw (CWE‑681) coupled with an information disclosure weakness (CWE‑200).
Affected Systems
vLLM, the inference engine for large language models, from version 0.5.5 through 0.23.1rc0 is affected. Versions 0.23.1rc0 and newer incorporate the fix and are no longer impacted.
Risk and Exploitability
The CVSS score of 5.3 denotes moderate severity. No EPSS score is available and the vulnerability is not listed in the CISA KEV catalog. Exploitation requires the ability to submit inference requests that share a GPU with other tenants; the attacker would then benefit from residual GPU memory to read data from other users. The primary consequence is confidentiality loss of tenant data in a shared‑GPU environment.
OpenCVE Enrichment
Github GHSA