CVE-2025-6211: llama-index: DocugamiReader MD5 hash collision drops chunks

GHSA-5hq9-5r78-2gjh MEDIUM CISA: TRACK*
Published July 10, 2025
CISO Take

LlamaIndex's DocugamiReader silently loses document chunks when distinct sections share identical text due to MD5 hash collisions, causing AI responses to be based on incomplete or wrong context with no error surfaced. This is a critical integrity risk for any RAG or document Q&A pipeline handling Docugami-formatted documents—especially in legal, compliance, or audit workflows where missed clauses go undetected. Patch to llama-index-readers-docugami >= 0.3.1 and llama-index >= 0.12.41, then re-index all previously processed document stores.

What is the risk?

The medium CVSS score (6.5) understates operational risk for AI pipelines. Silent chunk loss means affected systems hallucinate or omit critical content with no visible error or alert—a severe integrity failure in regulated industries. EPSS of 0.00067 indicates no observed active exploitation. The vulnerability triggers passively under normal usage: repeated clauses (boilerplate, standard terms) common in business and legal documents are sufficient to cause collisions without adversarial input. Risk is highest for organizations using LlamaIndex with Docugami in legal review, compliance evidence, or regulated document workflows.

What systems are affected?

Package Ecosystem Vulnerable Range Patched
LlamaIndex pip < 0.12.41 0.12.41
50.2K 238 dependents Pushed 4d ago 87% patched ~50d to patch Full package profile →
LlamaIndex pip < 0.3.1 0.3.1
50.2K 238 dependents Pushed 4d ago 87% patched ~50d to patch Full package profile →

How severe is it?

CVSS 3.1
6.5 / 10
EPSS
0.3%
chance of exploitation in 30 days
Higher than 23% of all CVEs
Exploitation Status
Exploit Available
Exploitation: MEDIUM
Sophistication
Moderate
Exploitation Confidence
medium
CISA SSVC: Public PoC
Composite signal derived from CISA KEV, VulnCheck KEV, CISA SSVC, EPSS, Metasploit, Exploit-DB, trickest/cve, Nuclei templates, and inthewild.io exploitation reports.

What is the attack surface?

AV AC PR UI S C I A
AV Network
AC Low
PR None
UI None
S Unchanged
C None
I Low
A Low

What should I do?

5 steps
  1. Patch immediately: upgrade llama-index-readers-docugami to >= 0.3.1 and llama-index to >= 0.12.41.

  2. Re-index: all documents processed with affected versions must be re-ingested after patching to restore chunk integrity.

  3. Audit outputs: cross-check AI responses against source documents for any critical legal, compliance, or contractual content generated before patching.

  4. Detect past exposure: compare chunk counts before and after re-indexing—significant drops confirm prior collisions.

  5. Workaround if patching is delayed: prepend positional or structural metadata to chunk text before hashing to enforce uniqueness at the application layer.

What does CISA's SSVC say?

Decision Track*
Exploitation poc
Automatable Yes
Technical Impact partial

Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.

How is it classified?

Which compliance frameworks are affected?

This CVE is relevant to:

EU AI Act
Article 10 - Data and data governance
ISO 42001
8.4 - Data for AI systems
NIST AI RMF
MAP 2.2 - Data quality and integrity risks to AI
OWASP LLM Top 10
LLM03:2025 - Supply Chain LLM08:2025 - Vector and Embedding Weaknesses

Frequently Asked Questions

What is CVE-2025-6211?

LlamaIndex's DocugamiReader silently loses document chunks when distinct sections share identical text due to MD5 hash collisions, causing AI responses to be based on incomplete or wrong context with no error surfaced. This is a critical integrity risk for any RAG or document Q&A pipeline handling Docugami-formatted documents—especially in legal, compliance, or audit workflows where missed clauses go undetected. Patch to llama-index-readers-docugami >= 0.3.1 and llama-index >= 0.12.41, then re-index all previously processed document stores.

Is CVE-2025-6211 actively exploited?

No confirmed active exploitation of CVE-2025-6211 has been reported, but organizations should still patch proactively.

How to fix CVE-2025-6211?

1. Patch immediately: upgrade llama-index-readers-docugami to >= 0.3.1 and llama-index to >= 0.12.41. 2. Re-index: all documents processed with affected versions must be re-ingested after patching to restore chunk integrity. 3. Audit outputs: cross-check AI responses against source documents for any critical legal, compliance, or contractual content generated before patching. 4. Detect past exposure: compare chunk counts before and after re-indexing—significant drops confirm prior collisions. 5. Workaround if patching is delayed: prepend positional or structural metadata to chunk text before hashing to enforce uniqueness at the application layer.

What systems are affected by CVE-2025-6211?

This vulnerability affects the following AI/ML architecture patterns: RAG pipelines, document processing pipelines, legal document analysis, compliance document workflows.

What is the CVSS score for CVE-2025-6211?

CVE-2025-6211 has a CVSS v3.1 base score of 6.5 (MEDIUM). The EPSS exploitation probability is 0.31%.

What is the AI security impact?

Affected AI Architectures

RAG pipelinesdocument processing pipelineslegal document analysiscompliance document workflows

MITRE ATLAS Techniques

AML.T0010.001 AI Software
AML.T0031 Erode AI Model Integrity
AML.T0059 Erode Dataset Integrity
AML.T0070 RAG Poisoning

Compliance Controls Affected

EU AI Act: Article 10
ISO 42001: 8.4
NIST AI RMF: MAP 2.2
OWASP LLM Top 10: LLM03:2025, LLM08:2025

What are the technical details?

Original Advisory

A vulnerability in the DocugamiReader class of the run-llama/llama_index repository, up to but excluding version 0.12.41, involves the use of MD5 hashing to generate IDs for document chunks. This approach leads to hash collisions when structurally distinct chunks contain identical text, resulting in one chunk overwriting another. This can cause loss of semantically or legally important document content, breakage of parent-child chunk hierarchies, and inaccurate or hallucinated responses in AI outputs. The issue is resolved in version 0.3.1.

Exploitation Scenario

An attacker with document upload access crafts a contract or policy document with identical clause text in structurally distinct positions—a common pattern in real legal documents (boilerplate, repeated definitions). When DocugamiReader processes the document, the MD5 hash collision causes the later chunk to silently overwrite the earlier one, deleting an obligation, liability clause, or exclusion from the index. The AI assistant subsequently confirms compliance or contract terms based on the corrupted index, with the dropped clause entirely absent from its context. In a compliance evidence workflow, this could result in an audit gap going undetected until challenged externally. No attacker is required for passive exploitation—natural repetition in real business documents triggers collisions without any adversarial input.

Weaknesses (CWE)

CWE-440 — Expected Behavior Violation: A feature, API, or function does not perform according to its specification.

Source: MITRE CWE corpus.

CVSS Vector

CVSS:3.0/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:L/A:L

Timeline

Published
July 10, 2025
Last Modified
July 10, 2025
First Seen
March 24, 2026

Related Vulnerabilities