CVE-2024-5206 — MEDIUM (CVSS 4.7) AI Security Vulnerability

CISO Take

Any scikit-learn pipeline that processes sensitive text (logs, emails, documents containing credentials or PII) via TfidfVectorizer may have inadvertently baked that data into serialized model files via the stop_words_ attribute. Upgrade to scikit-learn 1.5.0+ immediately and audit existing serialized models for credential exposure. Risk is amplified in MLOps environments where model artifacts are shared across teams or stored in accessible registries.

Risk Assessment

Nominal CVSS score (4.7 medium) understates real-world risk in ML deployment contexts. Local attack vector assumes direct system access, but serialized model files (pickle, joblib) are routinely transferred across environments — dev, staging, prod, model registries, S3 buckets. Any party with read access to these artifacts can trivially inspect stop_words_. Organizations processing sensitive text corpora (security logs, HR documents, customer emails) face confidentiality risk that escalates with the sensitivity of training data.

Affected Systems

Package	Ecosystem	Vulnerable Range	Patched
scikit-learn	pip	—	No patch
66.0K OpenSSF 9.4 27.9K dependents Pushed 8d ago 0% patched Full package profile →

Do you use scikit-learn? You're affected.

Severity & Risk

CVSS 3.1

4.7 / 10

EPSS

0.0%

chance of exploitation in 30 days

Higher than 11% of all CVEs

Source: EPSS v3 — FIRST.org

Exploitation Status

Exploit Available

Exploitation: MEDIUM

Sophistication

Trivial

Exploitation Confidence

medium

○ CISA SSVC: Public PoC

○ Public PoC indexed (trickest/cve)

Composite signal derived from CISA KEV, CISA SSVC, EPSS, trickest/cve, and Nuclei templates.

Attack Surface

AV Local

AC High

PR Low

UI None

S Unchanged

C High

I None

A None

Recommended Action

6 steps

PATCH

Upgrade scikit-learn to >=1.5.0 across all environments (pip install 'scikit-learn>=1.5.0').
AUDIT

Inspect existing serialized models — load and check len(vectorizer.stop_words_) against expected stop word counts. Any count significantly higher than your stop word list indicates leaked tokens.
RETRAIN

Retrain and replace affected models after upgrading. Do not assume patching the library fixes already-serialized models.
SCAN ARTIFACTS

Audit model registries and artifact stores for TfidfVectorizer models trained pre-1.5.0 — treat them as potentially containing sensitive training data.
ACCESS CONTROL

Restrict access to model artifact stores; apply least-privilege to model download endpoints.
DETECT

Add model scanning to CI/CD — flag serialized sklearn vectorizers with anomalously large stop_words_ sets.

CISA SSVC Assessment

Decision Track*

Exploitation poc

Automatable No

Technical Impact partial

Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.

Classification

Data Leakage Data Extraction Privacy Violation Framework Training Data AML.T0010.001 - AI Software AML.T0025 - Exfiltration via Cyber Means AML.T0035 - AI Artifact Collection AML.T0037 - Data from Local System AML.T0055 - Unsecured Credentials

Compliance Impact

This CVE is relevant to:

EU AI Act

Article 10 - Data and data governance

ISO 42001

A.7.4 - AI data management

NIST AI RMF

MANAGE 2.2 - Risk responses include technical and operational controls

OWASP LLM Top 10

LLM06 - Sensitive Information Disclosure

Frequently Asked Questions

What is CVE-2024-5206?

Any scikit-learn pipeline that processes sensitive text (logs, emails, documents containing credentials or PII) via TfidfVectorizer may have inadvertently baked that data into serialized model files via the stop_words_ attribute. Upgrade to scikit-learn 1.5.0+ immediately and audit existing serialized models for credential exposure. Risk is amplified in MLOps environments where model artifacts are shared across teams or stored in accessible registries.

Is CVE-2024-5206 actively exploited?

Proof-of-concept exploit code is publicly available for CVE-2024-5206, increasing the risk of exploitation.

How to fix CVE-2024-5206?

1. PATCH: Upgrade scikit-learn to >=1.5.0 across all environments (pip install 'scikit-learn>=1.5.0'). 2. AUDIT: Inspect existing serialized models — load and check len(vectorizer.stop_words_) against expected stop word counts. Any count significantly higher than your stop word list indicates leaked tokens. 3. RETRAIN: Retrain and replace affected models after upgrading. Do not assume patching the library fixes already-serialized models. 4. SCAN ARTIFACTS: Audit model registries and artifact stores for TfidfVectorizer models trained pre-1.5.0 — treat them as potentially containing sensitive training data. 5. ACCESS CONTROL: Restrict access to model artifact stores; apply least-privilege to model download endpoints. 6. DETECT: Add model scanning to CI/CD — flag serialized sklearn vectorizers with anomalously large stop_words_ sets.

What systems are affected by CVE-2024-5206?

This vulnerability affects the following AI/ML architecture patterns: NLP training pipelines, text classification systems, document processing pipelines, model serving (serialized artifacts), MLOps artifact registries.

What is the CVSS score for CVE-2024-5206?

CVE-2024-5206 has a CVSS v3.1 base score of 4.7 (MEDIUM). The EPSS exploitation probability is 0.04%.

Technical Details

NVD Description

A sensitive data leakage vulnerability was identified in scikit-learn's TfidfVectorizer, specifically in versions up to and including 1.4.1.post1, which was fixed in version 1.5.0. The vulnerability arises from the unexpected storage of all tokens present in the training data within the `stop_words_` attribute, rather than only storing the subset of tokens required for the TF-IDF technique to function. This behavior leads to the potential leakage of sensitive information, as the `stop_words_` attribute could contain tokens that were meant to be discarded and not stored, such as passwords or keys. The impact of this vulnerability varies based on the nature of the data being processed by the vectorizer.

Exploitation Scenario

An attacker with read access to a model artifact store (misconfigured S3 bucket, shared MLflow registry, or via compromised developer credentials) downloads a serialized TfidfVectorizer trained on internal documents — say, a SIEM alert classifier trained on raw log data containing auth tokens, or an email classifier trained on customer support tickets. The attacker loads the pickle file and iterates vectorizer.stop_words_ to extract a list of tokens from the original training corpus. Depending on the training data, this yields internal hostnames, usernames, API keys embedded in log lines, or PII from customer communications. No ML expertise required — a single Python one-liner extracts the full token list.