CVE-2021-41218: TensorFlow AllToAll DoS

CISO Take

A local attacker with low privileges can crash TensorFlow processes by passing split_count=0 to AllToAll, causing an unhandled division by zero in shape inference. Patch to TF 2.7.0 / 2.6.1 / 2.5.2 / 2.4.4 immediately if running distributed training or serving workloads. No workaround exists beyond input validation at the application layer.

What is the risk?

Medium severity in isolation, but context elevates risk for AI/ML environments. AllToAll is a collective communication primitive central to distributed training—crash impact is multiplied across all participating nodes in a training job, potentially taking down an entire GPU cluster mid-run and destroying in-progress training state. Local access requirement limits external attack surface, but insider threat, compromised notebooks, or shared multi-tenant ML platforms (Kubeflow, SageMaker multi-user) make this realistic. EPSS data unavailable; CVSS 5.5 underrepresents operational cost of crashing long-running training jobs.

What systems are affected?

Package	Ecosystem	Vulnerable Range	Patched
TensorFlow	pip	—	No patch
195.8K OpenSSF 7.1 3.7K dependents Pushed 3d ago 4% patched ~1372d to patch Full package profile →

Do you use TensorFlow? You're affected.

How severe is it?

CVSS 3.1

5.5 / 10

EPSS

0.1%

chance of exploitation in 30 days

Higher than 3% of all CVEs

Source: EPSS v3 — FIRST.org

Exploitation Status

No known exploitation

Sophistication

Trivial

What is the attack surface?

AV Local

AC Low

PR Low

UI None

S Unchanged

C None

I None

A High

What should I do?

5 steps

PATCH

Upgrade to TensorFlow 2.7.0, 2.6.1, 2.5.2, or 2.4.4. Cherry-pick commit a8ad3e5e79c75f36edb81e0ba3f3c0c5442aeddc if pinned to an older release.
VALIDATE INPUT

Add application-level guards asserting split_count >= 1 before calling AllToAll.
SANDBOX

Run training jobs in isolated containers with resource limits to contain crash blast radius.
DETECT

Monitor for unexpected TF process terminations or SIGFPE/SIGABRT signals in training infrastructure.
MULTI-TENANT: On shared ML platforms, enforce input schema validation and restrict custom op execution.

How is it classified?

DoS Framework Training Data AML.T0010.001 - AI Software AML.T0029 - Denial of AI Service AML.T0034 - Cost Harvesting

Which compliance frameworks are affected?

This CVE is relevant to:

EU AI Act

Article 15 - Accuracy, robustness and cybersecurity

ISO 42001

8.4 - AI system robustness and reliability

NIST AI RMF

MANAGE-2.2 - Mechanisms are in place and applied to sustain the value of deployed AI systems

OWASP LLM Top 10

LLM06 - Sensitive Information Disclosure / Insecure Design

Frequently Asked Questions

What is CVE-2021-41218?

A local attacker with low privileges can crash TensorFlow processes by passing split_count=0 to AllToAll, causing an unhandled division by zero in shape inference. Patch to TF 2.7.0 / 2.6.1 / 2.5.2 / 2.4.4 immediately if running distributed training or serving workloads. No workaround exists beyond input validation at the application layer.

Is CVE-2021-41218 actively exploited?

No confirmed active exploitation of CVE-2021-41218 has been reported, but organizations should still patch proactively.

How to fix CVE-2021-41218?

1. PATCH: Upgrade to TensorFlow 2.7.0, 2.6.1, 2.5.2, or 2.4.4. Cherry-pick commit a8ad3e5e79c75f36edb81e0ba3f3c0c5442aeddc if pinned to an older release. 2. VALIDATE INPUT: Add application-level guards asserting split_count >= 1 before calling AllToAll. 3. SANDBOX: Run training jobs in isolated containers with resource limits to contain crash blast radius. 4. DETECT: Monitor for unexpected TF process terminations or SIGFPE/SIGABRT signals in training infrastructure. 5. MULTI-TENANT: On shared ML platforms, enforce input schema validation and restrict custom op execution.

What systems are affected by CVE-2021-41218?

This vulnerability affects the following AI/ML architecture patterns: distributed training pipelines, model training infrastructure, shared ML platforms.

What is the CVSS score for CVE-2021-41218?

CVE-2021-41218 has a CVSS v3.1 base score of 5.5 (MEDIUM). The EPSS exploitation probability is 0.13%.

What is the AI security impact?

Affected AI Architectures

distributed training pipelinesmodel training infrastructureshared ML platforms

MITRE ATLAS Techniques

AML.T0010.001 AI Software

AML.T0029 Denial of AI Service

AML.T0034 Cost Harvesting

Compliance Controls Affected

EU AI Act: Article 15

ISO 42001: 8.4

NIST AI RMF: MANAGE-2.2

OWASP LLM Top 10: LLM06

What are the technical details?

Original Advisory

TensorFlow is an open source platform for machine learning. In affected versions the shape inference code for `AllToAll` can be made to execute a division by 0. This occurs whenever the `split_count` argument is 0. The fix will be included in TensorFlow 2.7.0. We will also cherrypick this commit on TensorFlow 2.6.1, TensorFlow 2.5.2, and TensorFlow 2.4.4, as these are also affected and still in supported range.

Exploitation Scenario

A malicious insider or compromised data scientist on a shared ML platform (e.g., Kubeflow, JupyterHub) submits a training script that calls tf.raw_ops.AllToAll with split_count=0. TensorFlow's shape inference executes a division by zero, crashing the TF process. In a distributed training job across 64 GPUs, this terminates all workers simultaneously, discarding hours of training progress and wasting significant compute budget. On a shared cluster, this could be used repeatedly to deny GPU resources to other teams or disrupt production model retraining schedules.