Linux Kernel RDMA/mlx5 UMR QP Recovery Flow Vulnerability

Vulnerability

A vulnerability in the Linux kernel's RDMA subsystem, specifically within the mlx5 driver, has been addressed. The issue pertained to the recovery process of the User Memory Region (UMR) Queue Pair (QP), where tasks could become stalled. During recovery, it was crucial for the software to wait for all outstanding Work Requests (WRs) to finish before changing the QP state to RESET. Failure to do so could lead to the firmware omitting some error-laden flushed Completion Queue Entries (CQEs) and discarding them during the RESET, causing a race condition that resulted in lost CQEs and hung tasks. The recently applied patch rectifies this by sending a final WR as a barrier, ensuring all previous WRs have been acknowledged before safely resetting the QP state, thus restoring normal operation.

Impact

The vulnerability could cause tasks to become blocked for extended periods, disrupting normal operations. This was evidenced by system logs indicating tasks were stalled for over 120 seconds.

Reproduction

The vulnerability could be reproduced by initiating a recovery process on a UMR QP without first ensuring that all outstanding WRs had completed. This could be done by manually triggering a QP reset while WRs were still pending, leading to a blockage as the system awaited WR completions that were never received due to the firmware's handling of the reset process.

Remediation

Users should apply the latest patches available in the Linux kernel to address this vulnerability.

Added: Jun 9, 2025, 7:46 PM
Updated: Jun 9, 2025, 7:46 PM

Vulnerability Rating

Custom Algorithm
spread
9.0
impact
2.5
exploitability
3.9
remediation
0.0
relevance
0.0
threat
4.8
urgency
2.9
incentive
1.7

Our algorithm analyzes dozens of metrics to generate these 8 key vulnerability categories, which are then combined to calculate the overall risk score.