CVE-2021-41203: TensorFlow: malformed checkpoint triggers overflow/crash

HIGH PoC AVAILABLE
Published November 5, 2021
CISO Take

Attackers who can modify TensorFlow checkpoint files on disk can crash training or inference processes via integer overflows and undefined behavior. Patch to TF 2.7.0 / 2.6.1 / 2.5.2 / 2.4.4 immediately — any shared storage or model registry accessible to low-privileged users is a viable attack path. Treat checkpoint files as untrusted inputs and enforce integrity checks (checksums, access controls) before loading.

What is the risk?

CVSS 7.8 High with local attack vector and low complexity/privileges. Risk is elevated in MLOps environments with shared storage (NFS, S3, NAS) where checkpoints are written by one process and loaded by another — a compromised low-privilege account or insider threat can trigger crashes or undefined behavior across the ML stack. Not in CISA KEV and no known active exploitation, but the attack primitive (craft malicious file → crash ML process) is trivial once filesystem access is obtained.

What systems are affected?

Package Ecosystem Vulnerable Range Patched
TensorFlow pip No patch
195.8K OpenSSF 7.1 3.7K dependents Pushed 2d ago 4% patched ~1372d to patch Full package profile →

Do you use TensorFlow? You're affected.

How severe is it?

CVSS 3.1
7.8 / 10
EPSS
0.2%
chance of exploitation in 30 days
Higher than 8% of all CVEs
Exploitation Status
Exploit Available
Exploitation: MEDIUM
Sophistication
Moderate
Exploitation Confidence
medium
Public PoC indexed (trickest/cve)
Composite signal derived from CISA KEV, VulnCheck KEV, CISA SSVC, EPSS, Metasploit, Exploit-DB, trickest/cve, Nuclei templates, and inthewild.io exploitation reports.

What is the attack surface?

AV AC PR UI S C I A
AV Local
AC Low
PR Low
UI None
S Unchanged
C High
I High
A High

What should I do?

6 steps
  1. Patch: Upgrade to TensorFlow 2.7.0, 2.6.1, 2.5.2, or 2.4.4 immediately.

  2. Restrict filesystem permissions: checkpoint directories should be writable only by the process that creates them; separate write/read service accounts.

  3. Integrity verification: implement SHA-256 checksums on checkpoint files and validate before loading — reject any checkpoint that fails verification.

  4. Immutable storage: use write-once/append-only storage policies for checkpoint artifacts in production.

  5. Detection: monitor for unexpected process crashes (segfaults, OOM) in TF training/inference workloads — repeated crashes against checkpoint-loading paths may indicate active exploitation.

  6. Audit: inventory all systems running unpatched TF versions, prioritize those with shared checkpoint storage.

How is it classified?

Which compliance frameworks are affected?

This CVE is relevant to:

EU AI Act
Article 15 - Accuracy, robustness and cybersecurity Article 9 - Risk management system
ISO 42001
8.4 - AI system lifecycle — data and model integrity 9.1 - Monitoring, measurement, analysis and evaluation
NIST AI RMF
GOVERN 1.1 - Policies, processes, and procedures for AI risk management MANAGE 2.2 - Mechanisms to sustain the value of deployed AI systems

Frequently Asked Questions

What is CVE-2021-41203?

Attackers who can modify TensorFlow checkpoint files on disk can crash training or inference processes via integer overflows and undefined behavior. Patch to TF 2.7.0 / 2.6.1 / 2.5.2 / 2.4.4 immediately — any shared storage or model registry accessible to low-privileged users is a viable attack path. Treat checkpoint files as untrusted inputs and enforce integrity checks (checksums, access controls) before loading.

Is CVE-2021-41203 actively exploited?

Proof-of-concept exploit code is publicly available for CVE-2021-41203, increasing the risk of exploitation.

How to fix CVE-2021-41203?

1. Patch: Upgrade to TensorFlow 2.7.0, 2.6.1, 2.5.2, or 2.4.4 immediately. 2. Restrict filesystem permissions: checkpoint directories should be writable only by the process that creates them; separate write/read service accounts. 3. Integrity verification: implement SHA-256 checksums on checkpoint files and validate before loading — reject any checkpoint that fails verification. 4. Immutable storage: use write-once/append-only storage policies for checkpoint artifacts in production. 5. Detection: monitor for unexpected process crashes (segfaults, OOM) in TF training/inference workloads — repeated crashes against checkpoint-loading paths may indicate active exploitation. 6. Audit: inventory all systems running unpatched TF versions, prioritize those with shared checkpoint storage.

What systems are affected by CVE-2021-41203?

This vulnerability affects the following AI/ML architecture patterns: training pipelines, model serving, MLOps CI/CD pipelines, transfer learning workflows, distributed training infrastructure.

What is the CVSS score for CVE-2021-41203?

CVE-2021-41203 has a CVSS v3.1 base score of 7.8 (HIGH). The EPSS exploitation probability is 0.18%.

What is the AI security impact?

Affected AI Architectures

training pipelinesmodel servingMLOps CI/CD pipelinestransfer learning workflowsdistributed training infrastructure

MITRE ATLAS Techniques

AML.T0010.001 AI Software
AML.T0011.000 Unsafe AI Artifacts
AML.T0018 Manipulate AI Model
AML.T0076 Corrupt AI Model

Compliance Controls Affected

EU AI Act: Article 15, Article 9
ISO 42001: 8.4, 9.1
NIST AI RMF: GOVERN 1.1, MANAGE 2.2

What are the technical details?

Original Advisory

TensorFlow is an open source platform for machine learning. In affected versions an attacker can trigger undefined behavior, integer overflows, segfaults and `CHECK`-fail crashes if they can change saved checkpoints from outside of TensorFlow. This is because the checkpoints loading infrastructure is missing validation for invalid file formats. The fixes will be included in TensorFlow 2.7.0. We will also cherrypick these commits on TensorFlow 2.6.1, TensorFlow 2.5.2, and TensorFlow 2.4.4, as these are also affected and still in supported range.

Exploitation Scenario

An adversary with low-privilege access to a shared MLOps environment (e.g., compromised data scientist account, malicious insider, or supply chain compromise of a model registry) locates the checkpoint storage directory for a production training or fine-tuning job. They craft a malformed checkpoint file — manipulating file format fields to trigger integer overflow conditions — and replace or inject it into the expected checkpoint path. When the TensorFlow training process resumes from checkpoint (e.g., nightly scheduled training job), it loads the malicious file without validation, triggering undefined behavior, segfaults, or CHECK-fail crashes. In a Kubernetes-based ML training cluster, this could repeatedly crash pods and disrupt model delivery pipelines, or in worst-case exploit the undefined behavior for code execution under the training process's service account.

Weaknesses (CWE)

CWE-190 — Integer Overflow or Wraparound: The product performs a calculation that can produce an integer overflow or wraparound when the logic assumes that the resulting value will always be larger than the original value. This occurs when an integer value is incremented to a value that is too large to store in the associated representation. When this occurs, the value may become a very small or negative number.

  • [Requirements] Ensure that all protocols are strictly defined, such that all out-of-bounds behavior can be identified simply, and require strict conformance to the protocol.
  • [Requirements] Use a language that does not allow this weakness to occur or provides constructs that make this weakness easier to avoid. If possible, choose a language or compiler that performs automatic bounds checking.

Source: MITRE CWE corpus.

CVSS Vector

CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:U/C:H/I:H/A:H

Timeline

Published
November 5, 2021
Last Modified
November 21, 2024
First Seen
November 5, 2021

Related Vulnerabilities