CVE-2021-37648: TensorFlow SaveV2: null ptr deref, local crash/RCE
HIGHA validation bypass in TensorFlow's SaveV2 kernel allows any local user to trigger a null pointer dereference, crashing training processes or potentially escalating to code execution. Shared ML compute environments—Kubeflow, JupyterHub, on-prem GPU clusters—are the primary exposure surface. Patch to TF 2.5.1, 2.4.3, 2.3.4, or 2.6.0 immediately; the fix is a single cherry-picked commit.
What is the risk?
CVSS 7.8 High with local vector, low complexity, and low privilege requirement. The root cause—OP_REQUIRES silently setting an error status and returning from the validation helper while execution continues in the parent Compute function—means the validation is completely bypassed with no runtime signal or log entry. In single-tenant environments risk is moderate; in shared ML clusters (multi-user notebook servers, distributed training farms) where low-privilege users can submit arbitrary ops, the attack surface is significantly broader and exploitability approaches trivial.
What systems are affected?
| Package | Ecosystem | Vulnerable Range | Patched |
|---|---|---|---|
| TensorFlow | pip | — | No patch |
Do you use TensorFlow? You're affected.
How severe is it?
What is the attack surface?
What should I do?
1 step-
1) Upgrade to TF 2.6.0, 2.5.1, 2.4.3, or 2.3.4 (commit 9728c60e). 2) If patching is blocked, containerize training workloads in isolated single-user environments to eliminate the lateral movement path. 3) Restrict execution of raw TF ops in shared notebook environments via OPA or admission controllers. 4) Monitor for unexpected process crashes or null-deref signals in training job logs—add alerting on SIGSEGV from TF worker processes. 5) Audit internal model training infrastructure for pinned TF versions and update dependency lock files in CI/CD pipelines.
How is it classified?
Which compliance frameworks are affected?
This CVE is relevant to:
Frequently Asked Questions
What is CVE-2021-37648?
A validation bypass in TensorFlow's SaveV2 kernel allows any local user to trigger a null pointer dereference, crashing training processes or potentially escalating to code execution. Shared ML compute environments—Kubeflow, JupyterHub, on-prem GPU clusters—are the primary exposure surface. Patch to TF 2.5.1, 2.4.3, 2.3.4, or 2.6.0 immediately; the fix is a single cherry-picked commit.
Is CVE-2021-37648 actively exploited?
No confirmed active exploitation of CVE-2021-37648 has been reported, but organizations should still patch proactively.
How to fix CVE-2021-37648?
1) Upgrade to TF 2.6.0, 2.5.1, 2.4.3, or 2.3.4 (commit 9728c60e). 2) If patching is blocked, containerize training workloads in isolated single-user environments to eliminate the lateral movement path. 3) Restrict execution of raw TF ops in shared notebook environments via OPA or admission controllers. 4) Monitor for unexpected process crashes or null-deref signals in training job logs—add alerting on SIGSEGV from TF worker processes. 5) Audit internal model training infrastructure for pinned TF versions and update dependency lock files in CI/CD pipelines.
What systems are affected by CVE-2021-37648?
This vulnerability affects the following AI/ML architecture patterns: training pipelines, model serving, model registry.
What is the CVSS score for CVE-2021-37648?
CVE-2021-37648 has a CVSS v3.1 base score of 7.8 (HIGH). The EPSS exploitation probability is 0.19%.
What is the AI security impact?
Affected AI Architectures
MITRE ATLAS Techniques
AML.T0010.001 AI Software AML.T0035 AI Artifact Collection AML.T0049 Exploit Public-Facing Application Compliance Controls Affected
What are the technical details?
Original Advisory
TensorFlow is an end-to-end open source platform for machine learning. In affected versions the code for `tf.raw_ops.SaveV2` does not properly validate the inputs and an attacker can trigger a null pointer dereference. The [implementation](https://github.com/tensorflow/tensorflow/blob/8d72537c6abf5a44103b57b9c2e22c14f5f49698/tensorflow/core/kernels/save_restore_v2_ops.cc) uses `ValidateInputs` to check that the input arguments are valid. This validation would have caught the illegal state represented by the reproducer above. However, the validation uses `OP_REQUIRES` which translates to setting the `Status` object of the current `OpKernelContext` to an error status, followed by an empty `return` statement which just terminates the execution of the function it is present in. However, this does not mean that the kernel execution is finalized: instead, execution continues from the next line in `Compute` that follows the call to `ValidateInputs`. This is equivalent to lacking the validation. We have patched the issue in GitHub commit 9728c60e136912a12d99ca56e106b7cce7af5986. The fix will be included in TensorFlow 2.6.0. We will also cherrypick this commit on TensorFlow 2.5.1, TensorFlow 2.4.3, and TensorFlow 2.3.4, as these are also affected and still in supported range.
Exploitation Scenario
An attacker with low-privilege shell or notebook access on a shared Kubeflow cluster crafts a session that calls tf.raw_ops.SaveV2 with intentionally malformed tensor shape arguments. ValidateInputs sets an error on the OpKernelContext and returns, but Compute continues executing past the call and dereferences a null pointer. This reliably crashes the training pod (DoS, disrupting active training runs) and, with controlled heap grooming, can be escalated to code execution under the training process's service account—enabling exfiltration of model checkpoints, training data, or cloud provider credentials stored in the pod's environment.
Weaknesses (CWE)
CWE-476 — NULL Pointer Dereference: The product dereferences a pointer that it expects to be valid but is NULL.
- [Implementation] For any pointers that could have been modified or provided from a function that can return NULL, check the pointer for NULL before use. When working with a multithreaded or otherwise asynchronous environment, ensure that proper locking APIs are used to lock before the check, and unlock when it has finished [REF-1484].
- [Requirements] Select a programming language that is not susceptible to these issues.
Source: MITRE CWE corpus.
CVSS Vector
CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:U/C:H/I:H/A:H References
Timeline
Related Vulnerabilities
CVE-2020-15196 9.9 TensorFlow: heap OOB read in sparse/ragged count ops
Same package: tensorflow CVE-2020-15205 9.8 TensorFlow: heap overflow in StringNGrams, ASLR bypass
Same package: tensorflow CVE-2020-15208 9.8 TFLite: OOB read/write via tensor dimension mismatch
Same package: tensorflow CVE-2019-16778 9.8 TensorFlow: heap overflow in UnsortedSegmentSum op
Same package: tensorflow CVE-2022-23587 9.8 TensorFlow: integer overflow in Grappler enables RCE
Same package: tensorflow