CVE-2021-37653: TensorFlow DoS via divide-by-zero

CISO Take

A local attacker with minimal privileges can crash TensorFlow processes by triggering a floating-point exception in the ResourceGather operation — no patching, no protection. Upgrade immediately to TF 2.6.0 or apply the backport to 2.5.1, 2.4.3, or 2.3.4. Priority is moderate for isolated deployments but elevated in shared ML infrastructure where multiple users or workloads share TF processes.

What is the risk?

Medium risk overall, but context-dependent. Exploitation requires local access with only low privileges, which limits the remote attack surface considerably. In shared ML environments — Jupyter Hub clusters, multi-tenant training platforms, containerized inference pods serving multiple clients — the effective risk is significantly higher. The crash is deterministic and trivially reproducible, making targeted availability attacks straightforward once minimal access is obtained. No confidentiality or integrity impact, but sustained availability disruption can cause training job loss and inference downtime.

What systems are affected?

Package	Ecosystem	Vulnerable Range	Patched
TensorFlow	pip	—	No patch
195.8K OpenSSF 7.1 3.7K dependents Pushed 2d ago 4% patched ~1372d to patch Full package profile →

Do you use TensorFlow? You're affected.

How severe is it?

CVSS 3.1

5.5 / 10

EPSS

0.2%

chance of exploitation in 30 days

Higher than 5% of all CVEs

Source: EPSS v3 — FIRST.org

Exploitation Status

No known exploitation

Sophistication

Trivial

What is the attack surface?

AV Local

AC Low

PR Low

UI None

S Unchanged

C None

I None

A High

What should I do?

5 steps

PATCH

Upgrade to TensorFlow ≥2.6.0 or apply the backport commit ac117ee8a8ea57b73d34665cdf00ef3303bc0b11 to TF 2.5.1, 2.4.3, or 2.3.4.
WORKAROUND

If immediate patching is not possible, validate batch_size != 0 before any ResourceGather calls in custom or user-submitted operations.
DETECTION

Monitor ML serving and training logs for unexpected SIGFPE signals or TF process crashes; anomalous crash spikes may indicate exploitation attempts.
ISOLATION

In multi-tenant environments, run TF workloads in per-user/per-tenant isolated containers or VMs to limit blast radius from a triggered crash.
INVENTORY

Audit all internal services and pipelines running TensorFlow to identify unpatched versions.

How is it classified?

DoS Framework Inference AML.T0010.001 - AI Software AML.T0029 - Denial of AI Service AML.T0049 - Exploit Public-Facing Application

Which compliance frameworks are affected?

This CVE is relevant to:

EU AI Act

Art. 17 - Quality management system for high-risk AI

ISO 42001

A.9.3 - AI system performance and availability monitoring

NIST AI RMF

MANAGE 2.2 - Manage residual risk from AI system dependencies

Frequently Asked Questions

What is CVE-2021-37653?

A local attacker with minimal privileges can crash TensorFlow processes by triggering a floating-point exception in the ResourceGather operation — no patching, no protection. Upgrade immediately to TF 2.6.0 or apply the backport to 2.5.1, 2.4.3, or 2.3.4. Priority is moderate for isolated deployments but elevated in shared ML infrastructure where multiple users or workloads share TF processes.

Is CVE-2021-37653 actively exploited?

No confirmed active exploitation of CVE-2021-37653 has been reported, but organizations should still patch proactively.

How to fix CVE-2021-37653?

1. PATCH: Upgrade to TensorFlow ≥2.6.0 or apply the backport commit ac117ee8a8ea57b73d34665cdf00ef3303bc0b11 to TF 2.5.1, 2.4.3, or 2.3.4. 2. WORKAROUND: If immediate patching is not possible, validate batch_size != 0 before any ResourceGather calls in custom or user-submitted operations. 3. DETECTION: Monitor ML serving and training logs for unexpected SIGFPE signals or TF process crashes; anomalous crash spikes may indicate exploitation attempts. 4. ISOLATION: In multi-tenant environments, run TF workloads in per-user/per-tenant isolated containers or VMs to limit blast radius from a triggered crash. 5. INVENTORY: Audit all internal services and pipelines running TensorFlow to identify unpatched versions.

What systems are affected by CVE-2021-37653?

This vulnerability affects the following AI/ML architecture patterns: training pipelines, model serving, multi-tenant ML platforms.

What is the CVSS score for CVE-2021-37653?

CVE-2021-37653 has a CVSS v3.1 base score of 5.5 (MEDIUM). The EPSS exploitation probability is 0.15%.

What is the AI security impact?

Affected AI Architectures

training pipelinesmodel servingmulti-tenant ML platforms

MITRE ATLAS Techniques

AML.T0010.001 AI Software

AML.T0029 Denial of AI Service

AML.T0049 Exploit Public-Facing Application

Compliance Controls Affected

EU AI Act: Art. 17

ISO 42001: A.9.3

NIST AI RMF: MANAGE 2.2

What are the technical details?

Original Advisory

TensorFlow is an end-to-end open source platform for machine learning. In affected versions an attacker can trigger a crash via a floating point exception in `tf.raw_ops.ResourceGather`. The [implementation](https://github.com/tensorflow/tensorflow/blob/f24faa153ad31a4b51578f8181d3aaab77a1ddeb/tensorflow/core/kernels/resource_variable_ops.cc#L725-L731) computes the value of a value, `batch_size`, and then divides by it without checking that this value is not 0. We have patched the issue in GitHub commit ac117ee8a8ea57b73d34665cdf00ef3303bc0b11. The fix will be included in TensorFlow 2.6.0. We will also cherrypick this commit on TensorFlow 2.5.1, TensorFlow 2.4.3, and TensorFlow 2.3.4, as these are also affected and still in supported range.

Exploitation Scenario

An attacker with low-privilege access to a shared ML training cluster — for example, a data scientist account on a shared Jupyter Hub or a compromised service account — submits a crafted TensorFlow operation invoking tf.raw_ops.ResourceGather with a tensor configuration that produces a zero batch_size. The integer division triggers a floating-point exception that immediately crashes the TensorFlow process. On a shared inference server, this brings down active model endpoints serving other users. On a training cluster, it terminates in-progress training jobs, potentially causing hours of lost compute. The attack requires no ML expertise, only knowledge of the TF op API, and leaves minimal forensic evidence beyond a crash log.