CVE-2024-5206: scikit-learn: TfidfVectorizer leaks training data tokens
MEDIUM PoC AVAILABLE CISA: TRACK*Any scikit-learn pipeline that processes sensitive text (logs, emails, documents containing credentials or PII) via TfidfVectorizer may have inadvertently baked that data into serialized model files via the stop_words_ attribute. Upgrade to scikit-learn 1.5.0+ immediately and audit existing serialized models for credential exposure. Risk is amplified in MLOps environments where model artifacts are shared across teams or stored in accessible registries.
Risk Assessment
Nominal CVSS score (4.7 medium) understates real-world risk in ML deployment contexts. Local attack vector assumes direct system access, but serialized model files (pickle, joblib) are routinely transferred across environments — dev, staging, prod, model registries, S3 buckets. Any party with read access to these artifacts can trivially inspect stop_words_. Organizations processing sensitive text corpora (security logs, HR documents, customer emails) face confidentiality risk that escalates with the sensitivity of training data.
Affected Systems
| Package | Ecosystem | Vulnerable Range | Patched |
|---|---|---|---|
| scikit-learn | pip | — | No patch |
Do you use scikit-learn? You're affected.
Severity & Risk
Attack Surface
Recommended Action
6 steps-
PATCH
Upgrade scikit-learn to >=1.5.0 across all environments (pip install 'scikit-learn>=1.5.0').
-
AUDIT
Inspect existing serialized models — load and check len(vectorizer.stop_words_) against expected stop word counts. Any count significantly higher than your stop word list indicates leaked tokens.
-
RETRAIN
Retrain and replace affected models after upgrading. Do not assume patching the library fixes already-serialized models.
-
SCAN ARTIFACTS
Audit model registries and artifact stores for TfidfVectorizer models trained pre-1.5.0 — treat them as potentially containing sensitive training data.
-
ACCESS CONTROL
Restrict access to model artifact stores; apply least-privilege to model download endpoints.
-
DETECT
Add model scanning to CI/CD — flag serialized sklearn vectorizers with anomalously large stop_words_ sets.
CISA SSVC Assessment
Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.
Classification
Compliance Impact
This CVE is relevant to:
Frequently Asked Questions
What is CVE-2024-5206?
Any scikit-learn pipeline that processes sensitive text (logs, emails, documents containing credentials or PII) via TfidfVectorizer may have inadvertently baked that data into serialized model files via the stop_words_ attribute. Upgrade to scikit-learn 1.5.0+ immediately and audit existing serialized models for credential exposure. Risk is amplified in MLOps environments where model artifacts are shared across teams or stored in accessible registries.
Is CVE-2024-5206 actively exploited?
Proof-of-concept exploit code is publicly available for CVE-2024-5206, increasing the risk of exploitation.
How to fix CVE-2024-5206?
1. PATCH: Upgrade scikit-learn to >=1.5.0 across all environments (pip install 'scikit-learn>=1.5.0'). 2. AUDIT: Inspect existing serialized models — load and check len(vectorizer.stop_words_) against expected stop word counts. Any count significantly higher than your stop word list indicates leaked tokens. 3. RETRAIN: Retrain and replace affected models after upgrading. Do not assume patching the library fixes already-serialized models. 4. SCAN ARTIFACTS: Audit model registries and artifact stores for TfidfVectorizer models trained pre-1.5.0 — treat them as potentially containing sensitive training data. 5. ACCESS CONTROL: Restrict access to model artifact stores; apply least-privilege to model download endpoints. 6. DETECT: Add model scanning to CI/CD — flag serialized sklearn vectorizers with anomalously large stop_words_ sets.
What systems are affected by CVE-2024-5206?
This vulnerability affects the following AI/ML architecture patterns: NLP training pipelines, text classification systems, document processing pipelines, model serving (serialized artifacts), MLOps artifact registries.
What is the CVSS score for CVE-2024-5206?
CVE-2024-5206 has a CVSS v3.1 base score of 4.7 (MEDIUM). The EPSS exploitation probability is 0.04%.
Technical Details
NVD Description
A sensitive data leakage vulnerability was identified in scikit-learn's TfidfVectorizer, specifically in versions up to and including 1.4.1.post1, which was fixed in version 1.5.0. The vulnerability arises from the unexpected storage of all tokens present in the training data within the `stop_words_` attribute, rather than only storing the subset of tokens required for the TF-IDF technique to function. This behavior leads to the potential leakage of sensitive information, as the `stop_words_` attribute could contain tokens that were meant to be discarded and not stored, such as passwords or keys. The impact of this vulnerability varies based on the nature of the data being processed by the vectorizer.
Exploitation Scenario
An attacker with read access to a model artifact store (misconfigured S3 bucket, shared MLflow registry, or via compromised developer credentials) downloads a serialized TfidfVectorizer trained on internal documents — say, a SIEM alert classifier trained on raw log data containing auth tokens, or an email classifier trained on customer support tickets. The attacker loads the pickle file and iterates vectorizer.stop_words_ to extract a list of tokens from the original training corpus. Depending on the training data, this yields internal hostnames, usernames, API keys embedded in log lines, or PII from customer communications. No ML expertise required — a single Python one-liner extracts the full token list.
Weaknesses (CWE)
CVSS Vector
CVSS:3.1/AV:L/AC:H/PR:L/UI:N/S:U/C:H/I:N/A:N References
Timeline
Related Vulnerabilities
CVE-2020-13092 9.8 scikit-learn: RCE via malicious joblib model deserialization
Same package: scikit-learn CVE-2020-28975 7.5 scikit-learn: DoS via crafted SVM model deserialization
Same package: scikit-learn CVE-2025-53767 10.0 Azure OpenAI: SSRF EoP, no auth required (CVSS 10)
Same attack type: Data Extraction CVE-2025-2828 10.0 LangChain RequestsToolkit: SSRF exposes cloud metadata
Same attack type: Data Extraction CVE-2023-3765 10.0 MLflow: path traversal allows arbitrary file read
Same attack type: Data Leakage
AI Threat Alert