CVE-2025-3044: llama-index ArxivReader: MD5 collision corrupts training data

GHSA-p7j4-jwjf-5x9w MEDIUM CISA: TRACK*
Published July 7, 2025
CISO Take

If your AI pipelines use LlamaIndex to ingest arXiv papers for training or RAG knowledge bases, papers with hash-colliding titles silently overwrite each other — corrupting datasets without any error raised. This is a silent data integrity failure, not a loud exploit. Patch to llama-index-readers-papers 0.3.1 and re-audit any datasets built with the affected ArxivReader.

What is the risk?

Low external exploitability in the traditional sense — no RCE, no auth bypass. However, the risk to AI-native teams is underrated: silent data corruption in training pipelines or RAG knowledge bases can degrade model quality or introduce subtle data poisoning without triggering any alert. CVSS 5.3 Medium is accurate for the base vulnerability, but the real risk to AI systems is higher because the failure mode is invisible. No active exploitation observed; EPSS 0.07% confirms low attacker interest for now.

What systems are affected?

Package Ecosystem Vulnerable Range Patched
LlamaIndex pip < 0.3.1 0.3.1
50.2K 238 dependents Pushed 4d ago 87% patched ~50d to patch Full package profile →

Do you use LlamaIndex? You're affected.

How severe is it?

CVSS 3.1
5.3 / 10
EPSS
0.3%
chance of exploitation in 30 days
Higher than 20% of all CVEs
Exploitation Status
Exploit Available
Exploitation: MEDIUM
Sophistication
Moderate
Exploitation Confidence
medium
CISA SSVC: Public PoC
Composite signal derived from CISA KEV, VulnCheck KEV, CISA SSVC, EPSS, Metasploit, Exploit-DB, trickest/cve, Nuclei templates, and inthewild.io exploitation reports.

What is the attack surface?

AV AC PR UI S C I A
AV Network
AC Low
PR None
UI None
S Unchanged
C None
I Low
A None

What should I do?

5 steps
  1. Patch immediately: upgrade llama-index-readers-papers to >= 0.3.1 (fixed in llama-index 0.12.28).

  2. Audit existing datasets: if you have corpora built with ArxivReader on affected versions, verify paper counts against expected totals and check for missing entries.

  3. Re-ingest affected datasets after patching to ensure completeness.

  4. Detection: compare file counts pre/post ingestion runs; add hash integrity checks (SHA-256) on downloaded files as a defense-in-depth measure.

  5. For RAG systems: validate knowledge base entry counts after ingestion jobs complete.

What does CISA's SSVC say?

Decision Track*
Exploitation poc
Automatable Yes
Technical Impact partial

Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.

How is it classified?

Which compliance frameworks are affected?

This CVE is relevant to:

EU AI Act
Article 10 - Data and Data Governance
ISO 42001
A.8.2 - Data for AI systems
NIST AI RMF
MANAGE 2.2 - Data Quality Management
OWASP LLM Top 10
LLM03 - Training Data Poisoning

Frequently Asked Questions

What is CVE-2025-3044?

If your AI pipelines use LlamaIndex to ingest arXiv papers for training or RAG knowledge bases, papers with hash-colliding titles silently overwrite each other — corrupting datasets without any error raised. This is a silent data integrity failure, not a loud exploit. Patch to llama-index-readers-papers 0.3.1 and re-audit any datasets built with the affected ArxivReader.

Is CVE-2025-3044 actively exploited?

No confirmed active exploitation of CVE-2025-3044 has been reported, but organizations should still patch proactively.

How to fix CVE-2025-3044?

1. Patch immediately: upgrade llama-index-readers-papers to >= 0.3.1 (fixed in llama-index 0.12.28). 2. Audit existing datasets: if you have corpora built with ArxivReader on affected versions, verify paper counts against expected totals and check for missing entries. 3. Re-ingest affected datasets after patching to ensure completeness. 4. Detection: compare file counts pre/post ingestion runs; add hash integrity checks (SHA-256) on downloaded files as a defense-in-depth measure. 5. For RAG systems: validate knowledge base entry counts after ingestion jobs complete.

What systems are affected by CVE-2025-3044?

This vulnerability affects the following AI/ML architecture patterns: RAG pipelines, training pipelines, data ingestion, LLM framework integrations.

What is the CVSS score for CVE-2025-3044?

CVE-2025-3044 has a CVSS v3.1 base score of 5.3 (MEDIUM). The EPSS exploitation probability is 0.28%.

What is the AI security impact?

Affected AI Architectures

RAG pipelinestraining pipelinesdata ingestionLLM framework integrations

MITRE ATLAS Techniques

AML.T0010.001 AI Software
AML.T0010.002 Data
AML.T0019 Publish Poisoned Datasets
AML.T0020 Poison Training Data
AML.T0059 Erode Dataset Integrity

Compliance Controls Affected

EU AI Act: Article 10
ISO 42001: A.8.2
NIST AI RMF: MANAGE 2.2
OWASP LLM Top 10: LLM03

What are the technical details?

Original Advisory

A vulnerability in the ArxivReader class of the run-llama/llama_index repository allows for MD5 hash collisions when generating filenames for downloaded papers. This can lead to data loss as papers with identical titles but different contents may overwrite each other, preventing some papers from being processed for AI model training. The issue is resolved in llama-index-readers-papers version 0.3.1 (in llama-index 0.12.28).

Exploitation Scenario

An adversary targeting an organization's AI training pipeline or RAG knowledge base could publish arXiv papers with titles carefully crafted to produce MD5 hash collisions with high-value legitimate papers (e.g., key security research or proprietary domain papers). When the victim's automated LlamaIndex pipeline re-ingests arXiv content, the adversary's paper silently replaces the legitimate one on disk. The victim's RAG system now retrieves adversary-controlled content in response to queries about the overwritten topic — a form of indirect RAG poisoning without compromising the victim's infrastructure directly.

Weaknesses (CWE)

CWE-440 — Expected Behavior Violation: A feature, API, or function does not perform according to its specification.

Source: MITRE CWE corpus.

CVSS Vector

CVSS:3.0/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:L/A:N

Timeline

Published
July 7, 2025
Last Modified
July 8, 2025
First Seen
March 24, 2026

Related Vulnerabilities