CVE-2021-41220: TensorFlow: use-after-free in async collective ops

HIGH PoC AVAILABLE
Published November 5, 2021
CISO Take

TensorFlow's distributed training operations contain a use-after-free and memory leak in CollectiveReduceV2, exploitable locally with low privileges (CVSS 7.8). Any org running multi-GPU or multi-node TensorFlow training workloads on TF 2.6.0 should patch immediately to 2.6.1 or 2.7.0. Training infrastructure is high-value — a compromised training node enables model poisoning, data exfiltration, or lateral movement within ML pipelines.

What is the risk?

High severity (7.8) but local attack vector limits exposure to environments where untrusted users share TensorFlow compute resources, such as multi-tenant GPU clusters, JupyterHub environments, or shared ML training infrastructure. UAF vulnerabilities can be reliably turned into arbitrary code execution by skilled attackers; the low complexity and no-user-interaction requirements amplify risk once local access exists. Organizations with shared ML compute (academic clusters, cloud ML notebooks with multi-tenancy) face the highest exposure.

What systems are affected?

Package Ecosystem Vulnerable Range Patched
TensorFlow pip No patch
195.8K OpenSSF 7.1 3.7K dependents Pushed 3d ago 4% patched ~1372d to patch Full package profile →

Do you use TensorFlow? You're affected.

How severe is it?

CVSS 3.1
7.8 / 10
EPSS
0.2%
chance of exploitation in 30 days
Higher than 10% of all CVEs
Exploitation Status
Exploit Available
Exploitation: MEDIUM
Sophistication
Moderate
Exploitation Confidence
medium
Public PoC indexed (trickest/cve)
Composite signal derived from CISA KEV, VulnCheck KEV, CISA SSVC, EPSS, Metasploit, Exploit-DB, trickest/cve, Nuclei templates, and inthewild.io exploitation reports.

What is the attack surface?

AV AC PR UI S C I A
AV Local
AC Low
PR Low
UI None
S Unchanged
C High
I High
A High

What should I do?

5 steps
  1. Patch: upgrade to TensorFlow 2.7.0 or 2.6.1 (backport available). Verify with pip show tensorflow | grep Version.

  2. Isolate training workloads: enforce one-job-per-node policies on shared compute; avoid multi-tenant GPU clusters until patched.

  3. Detect: monitor for anomalous process crashes or memory faults in TF training jobs (SIGABRT, SIGSEGV from the TF runtime).

  4. Audit exposure: identify all internal services running TF 2.6.0 in distributed mode — check CI/CD pipelines, MLOps platforms (Kubeflow, Vertex, SageMaker custom containers), and Jupyter environments.

  5. Enforce image pinning in container-based ML pipelines to prevent accidental rollback to vulnerable versions.

How is it classified?

Which compliance frameworks are affected?

This CVE is relevant to:

EU AI Act
Article 15 - Accuracy, robustness and cybersecurity
ISO 42001
A.6.2.6 - AI system security
NIST AI RMF
MANAGE 2.2 - Mechanisms for managing AI risks
OWASP LLM Top 10
LLM05:2025 - Supply Chain Vulnerabilities

Frequently Asked Questions

What is CVE-2021-41220?

TensorFlow's distributed training operations contain a use-after-free and memory leak in CollectiveReduceV2, exploitable locally with low privileges (CVSS 7.8). Any org running multi-GPU or multi-node TensorFlow training workloads on TF 2.6.0 should patch immediately to 2.6.1 or 2.7.0. Training infrastructure is high-value — a compromised training node enables model poisoning, data exfiltration, or lateral movement within ML pipelines.

Is CVE-2021-41220 actively exploited?

Proof-of-concept exploit code is publicly available for CVE-2021-41220, increasing the risk of exploitation.

How to fix CVE-2021-41220?

1. Patch: upgrade to TensorFlow 2.7.0 or 2.6.1 (backport available). Verify with `pip show tensorflow | grep Version`. 2. Isolate training workloads: enforce one-job-per-node policies on shared compute; avoid multi-tenant GPU clusters until patched. 3. Detect: monitor for anomalous process crashes or memory faults in TF training jobs (SIGABRT, SIGSEGV from the TF runtime). 4. Audit exposure: identify all internal services running TF 2.6.0 in distributed mode — check CI/CD pipelines, MLOps platforms (Kubeflow, Vertex, SageMaker custom containers), and Jupyter environments. 5. Enforce image pinning in container-based ML pipelines to prevent accidental rollback to vulnerable versions.

What systems are affected by CVE-2021-41220?

This vulnerability affects the following AI/ML architecture patterns: distributed training pipelines, multi-GPU training infrastructure, MLOps platforms (Kubeflow, Vertex AI, SageMaker custom containers), shared Jupyter/notebook environments, model training pipelines.

What is the CVSS score for CVE-2021-41220?

CVE-2021-41220 has a CVSS v3.1 base score of 7.8 (HIGH). The EPSS exploitation probability is 0.20%.

What is the AI security impact?

Affected AI Architectures

distributed training pipelinesmulti-GPU training infrastructureMLOps platforms (Kubeflow, Vertex AI, SageMaker custom containers)shared Jupyter/notebook environmentsmodel training pipelines

MITRE ATLAS Techniques

AML.T0010.001 AI Software
AML.T0018.000 Poison AI Model
AML.T0020 Poison Training Data

Compliance Controls Affected

EU AI Act: Article 15
ISO 42001: A.6.2.6
NIST AI RMF: MANAGE 2.2
OWASP LLM Top 10: LLM05:2025

What are the technical details?

Original Advisory

TensorFlow is an open source platform for machine learning. In affected versions the async implementation of `CollectiveReduceV2` suffers from a memory leak and a use after free. This occurs due to the asynchronous computation and the fact that objects that have been `std::move()`d from are still accessed. The fix will be included in TensorFlow 2.7.0. We will also cherrypick this commit on TensorFlow 2.6.1, as this version is the only one that is also affected.

Exploitation Scenario

An attacker with low-privileged access to a shared GPU training cluster (e.g., a compromised ML engineer account or a rogue training job submitted via an MLOps pipeline) launches a specially crafted distributed training job that triggers the async CollectiveReduceV2 code path. The std::move() misuse causes the runtime to access freed memory, which the attacker controls via heap shaping to redirect execution. With code execution on the training node, the attacker can inject malicious gradient updates to poison the model under training, exfiltrate proprietary training data in-flight, or install persistence on the ML infrastructure. In Kubernetes-based ML platforms (Kubeflow, Argo Workflows), this could mean escaping the training pod boundary depending on cluster configuration.

Weaknesses (CWE)

CWE-416 — Use After Free: The product reuses or references memory after it has been freed. At some point afterward, the memory may be allocated again and saved in another pointer, while the original pointer references a location somewhere within the new allocation. Any operations using the original pointer are no longer valid because the memory "belongs" to the code that operates on the new pointer.

  • [Architecture and Design] Choose a language that provides automatic memory management.
  • [Implementation] When freeing pointers, be sure to set them to NULL once they are freed. However, the utilization of multiple or complex data structures may lower the usefulness of this strategy.

Source: MITRE CWE corpus.

CVSS Vector

CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:U/C:H/I:H/A:H

Timeline

Published
November 5, 2021
Last Modified
November 21, 2024
First Seen
November 5, 2021

Related Vulnerabilities