PyTorch Denial-of-Service Vulnerability in NCCL Reduce Function

Vulnerability

A denial-of-service vulnerability has been identified in PyTorch versions 2.6.0 and later, specifically in the CUDA NCCL reduce function. The issue arises when the function is called with invalid operation codes, causing the program to crash with a core dump instead of properly validating the input or raising an appropriate error. This vulnerability can be exploited locally.

Impact

Exploiting this vulnerability leads to a crash of the PyTorch application, causing a denial-of-service condition where the application is terminated unexpectedly.

Reproduction

The vulnerability can be reproduced by calling the `torch.cuda.nccl.reduce` function with invalid operation codes, such as 0xFF or 0xAA. This can be done by creating a tensor on the CUDA device, specifying an output tensor, and then invoking the reduce function with the invalid operation codes. The program will crash with an 'Aborted (core dumped)' message, indicating that a core dump has been generated due to the crash.

Remediation

Users can upgrade to the patched version of PyTorch, which includes validation checks for the NCCL reduce operation codes, preventing the use of invalid codes that could lead to a crash.

Added: Jun 9, 2025, 7:46 PM
Updated: Jun 9, 2025, 7:46 PM

Vulnerability Rating

Custom Algorithm
spread
0.0
impact
2.5
exploitability
6.0
remediation
7.7
relevance
0.0
threat
6.4
urgency
2.9
incentive
1.7

Our algorithm analyzes dozens of metrics to generate these 8 key vulnerability categories, which are then combined to calculate the overall risk score.