CVE-2026-0847: NLTK: path traversal exposes sensitive server files

GHSA-68j8-pq59-fqgm HIGH
Published March 4, 2026
CISO Take

NLTK versions up to 3.9.2 contain a path traversal flaw (CWE-22) in multiple CorpusReader classes — WordListCorpusReader, TaggedCorpusReader, and BracketParseCorpusReader — that lets unauthenticated remote attackers read arbitrary files from the server, including SSH private keys, API tokens, and cloud provider credentials. With a CVSS of 8.6 and an EPSS ranking in the top 77%, any organization running NLTK-backed NLP services, chatbots, or ML APIs that accept user-controlled file paths faces material credential exposure risk with no authentication barrier. No public exploit or active exploitation is confirmed and CISA has not added this to KEV, but path traversal flaws are trivial to weaponize once the affected code paths are understood, and NLTK is deeply embedded in production NLP pipelines. Until an official patch ships for a version beyond 3.9.2, restrict CorpusReader file path inputs to validated allowlists, run NLTK processes under least-privilege accounts, and audit file access logs for reads outside expected data directories.

Sources: NVD EPSS GitHub Advisory ATLAS

What is the risk?

High risk for production NLP deployments. CVSS 8.6 with network-accessible, zero-authentication-required exploitation lowers the bar significantly. EPSS top 77% indicates above-average exploitation likelihood for the class of vulnerability. No patch is available — affected up to and including 3.9.2 with no fixed version listed. The attack surface is broad: any HTTP endpoint that passes user-supplied strings to affected CorpusReader classes is exploitable without special tooling. Credential theft via SSH key or API token exfiltration can cascade into full infrastructure compromise, making the effective blast radius larger than the base CVSS score implies. Risk is highest for multi-tenant NLP platforms and cloud-hosted ML APIs.

How does the attack unfold?

Reconnaissance
Adversary identifies an NLP API or chatbot endpoint that accepts user-controlled file path inputs and fingerprints the backend as NLTK-based through error messages or documentation.
AML.T0006
Initial Access
Adversary sends a crafted HTTP request with a path traversal payload (e.g., '../../../../home/ubuntu/.ssh/id_rsa') targeting an exposed NLTK CorpusReader-backed endpoint without authentication.
AML.T0049
Collection
NLTK CorpusReader reads the traversed file without path sanitization; attacker retrieves SSH private keys, cloud API tokens, or application secrets from the server filesystem.
AML.T0037
Impact
Stolen credentials enable lateral movement into ML infrastructure — accessing training data storage, cloud resources, or enabling follow-on attacks such as training data poisoning.
AML.T0055

What systems are affected?

Package Ecosystem Vulnerable Range Patched
nltk pip <= 3.9.2 No patch

Do you use nltk? You're affected.

How severe is it?

CVSS 3.1
8.6 / 10
EPSS
0.7%
chance of exploitation in 30 days
Higher than 50% of all CVEs
Exploitation Status
No known exploitation
Sophistication
Trivial

What is the attack surface?

AV AC PR UI S C I A
AV Network
AC Low
PR None
UI None
S Unchanged
C High
I Low
A Low

What should I do?

6 steps
  1. Audit all code paths where user input reaches NLTK CorpusReader classes — grep for WordListCorpusReader, TaggedCorpusReader, BracketParseCorpusReader instantiation with dynamic path arguments.

  2. Implement strict allowlist validation: reject any file path input containing '../', '..\', absolute path prefixes (/), or null bytes; only accept bare filenames validated against a known-safe corpus directory.

  3. Run NLTK processes under a dedicated least-privilege service account with filesystem access scoped to NLTK data directories only.

  4. Monitor file access audit logs (auditd or equivalent) for reads outside expected /usr/share/nltk_data or equivalent corpus paths.

  5. Track the GHSA-68j8-pq59-fqgm advisory on GitHub and upgrade immediately when a patched release is published.

  6. For containerized deployments, use read-only filesystem mounts with explicit volume allowlists to limit traversal impact.

How is it classified?

Which compliance frameworks are affected?

This CVE is relevant to:

EU AI Act
Article 15 - Accuracy, Robustness and Cybersecurity Article 9 - Risk Management System
ISO 42001
A.9.5 - Security of AI System Operations Clause 6.1 - Actions to Address Risks and Opportunities
NIST AI RMF
GOVERN 1.7 - Processes for AI Risk Identification MANAGE 2.4 - Vulnerability Management for AI Systems
OWASP LLM Top 10
LLM05:2025 - Supply Chain Vulnerabilities LLM06:2025 - Sensitive Information Disclosure

Frequently Asked Questions

What is CVE-2026-0847?

NLTK versions up to 3.9.2 contain a path traversal flaw (CWE-22) in multiple CorpusReader classes — WordListCorpusReader, TaggedCorpusReader, and BracketParseCorpusReader — that lets unauthenticated remote attackers read arbitrary files from the server, including SSH private keys, API tokens, and cloud provider credentials. With a CVSS of 8.6 and an EPSS ranking in the top 77%, any organization running NLTK-backed NLP services, chatbots, or ML APIs that accept user-controlled file paths faces material credential exposure risk with no authentication barrier. No public exploit or active exploitation is confirmed and CISA has not added this to KEV, but path traversal flaws are trivial to weaponize once the affected code paths are understood, and NLTK is deeply embedded in production NLP pipelines. Until an official patch ships for a version beyond 3.9.2, restrict CorpusReader file path inputs to validated allowlists, run NLTK processes under least-privilege accounts, and audit file access logs for reads outside expected data directories.

Is CVE-2026-0847 actively exploited?

No confirmed active exploitation of CVE-2026-0847 has been reported, but organizations should still patch proactively.

How to fix CVE-2026-0847?

1. Audit all code paths where user input reaches NLTK CorpusReader classes — grep for WordListCorpusReader, TaggedCorpusReader, BracketParseCorpusReader instantiation with dynamic path arguments. 2. Implement strict allowlist validation: reject any file path input containing '../', '..\', absolute path prefixes (/), or null bytes; only accept bare filenames validated against a known-safe corpus directory. 3. Run NLTK processes under a dedicated least-privilege service account with filesystem access scoped to NLTK data directories only. 4. Monitor file access audit logs (auditd or equivalent) for reads outside expected /usr/share/nltk_data or equivalent corpus paths. 5. Track the GHSA-68j8-pq59-fqgm advisory on GitHub and upgrade immediately when a patched release is published. 6. For containerized deployments, use read-only filesystem mounts with explicit volume allowlists to limit traversal impact.

What systems are affected by CVE-2026-0847?

This vulnerability affects the following AI/ML architecture patterns: NLP pipelines, chatbot backends, ML APIs, training pipelines, model serving.

What is the CVSS score for CVE-2026-0847?

CVE-2026-0847 has a CVSS v3.1 base score of 8.6 (HIGH). The EPSS exploitation probability is 0.75%.

What is the AI security impact?

Affected AI Architectures

NLP pipelineschatbot backendsML APIstraining pipelinesmodel serving

MITRE ATLAS Techniques

AML.T0010.001 AI Software
AML.T0025 Exfiltration via Cyber Means
AML.T0037 Data from Local System
AML.T0049 Exploit Public-Facing Application
AML.T0055 Unsecured Credentials

Compliance Controls Affected

EU AI Act: Article 15, Article 9
ISO 42001: A.9.5, Clause 6.1
NIST AI RMF: GOVERN 1.7, MANAGE 2.4
OWASP LLM Top 10: LLM05:2025, LLM06:2025

What are the technical details?

Original Advisory

A vulnerability in NLTK versions up to and including 3.9.2 allows arbitrary file read via path traversal in multiple CorpusReader classes, including WordListCorpusReader, TaggedCorpusReader, and BracketParseCorpusReader. These classes fail to properly sanitize or validate file paths, enabling attackers to traverse directories and access sensitive files on the server. This issue is particularly critical in scenarios where user-controlled file inputs are processed, such as in machine learning APIs, chatbots, or NLP pipelines. Exploitation of this vulnerability can lead to unauthorized access to sensitive files, including system files, SSH private keys, and API tokens, and may potentially escalate to remote code execution when combined with other vulnerabilities.

Exploitation Scenario

An adversary scanning an ML API discovers an NLP preprocessing endpoint that accepts a corpus file path parameter. They craft a POST request supplying '../../../../home/ubuntu/.ssh/id_rsa' as the file path argument to an endpoint backed by WordListCorpusReader. NLTK reads the file without path sanitization and the API returns its contents in the response body or a verbose error. The attacker harvests the SSH private key and uses it to directly access the ML server, pivoting to cloud provider credential files (.aws/credentials) to access the S3 bucket storing the production training dataset. With write access to the training bucket, they inject poisoned data samples, causing silent model degradation in the next training run. In an agentic AI pipeline using NLTK for preprocessing, the same path traversal could expose the system prompt or RAG database credentials.

Weaknesses (CWE)

CWE-22 — Improper Limitation of a Pathname to a Restricted Directory ('Path Traversal'): The product uses external input to construct a pathname that is intended to identify a file or directory that is located underneath a restricted parent directory, but the product does not properly neutralize special elements within the pathname that can cause the pathname to resolve to a location that is outside of the restricted directory.

  • [Implementation] Assume all input is malicious. Use an "accept known good" input validation strategy, i.e., use a list of acceptable inputs that strictly conform to specifications. Reject any input that does not strictly conform to specifications, or transform it into something that does. When performing input validation, consider all potentially relevant properties, including length, type of input, the full range of acceptable values, missing or extra inputs, syntax, consistency across related fields, and conformance to business rules. As an example of business rule logic, "boat" may be syntactically valid because it only contains alphanumeric characters, but it is not valid if the input is only expected to contain colors such as "red" or "blue." Do not rely exclusively on looking for malicious or malformed inputs. This is likely to miss at least one undesirable input, especially if the code's environment changes. This can give attackers enough room to bypass the intended validation. However, denylis
  • [Architecture and Design] For any security checks that are performed on the client side, ensure that these checks are duplicated on the server side, in order to avoid CWE-602. Attackers can bypass the client-side checks by modifying values after the checks have been performed, or by changing the client to remove the client-side checks entirely. Then, these modified values would be submitted to the server.

Source: MITRE CWE corpus.

CVSS Vector

CVSS:3.0/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:L/A:L

Timeline

Published
March 4, 2026
Last Modified
May 6, 2026
First Seen
March 4, 2026

Related Vulnerabilities