ByteDance verl Arbitrary Code Execution Vulnerability

Vulnerability

A vulnerability allowing arbitrary code execution has been identified in ByteDance's verl framework, specifically in versions through 0.7.0. The issue arises in the math answer grading module, within the 'math_equal' function of 'prime_math/grader.py'. The vulnerability is triggered when the ground truth answer is a matrix type and the model's response is formatted as a list. In such cases, the function improperly uses Python's 'eval()' to execute the model's output without any sanitization or sandboxing. This flaw can be exploited remotely, and an exploit is publicly available.

Impact

Exploitation of this vulnerability allows for arbitrary code execution on the server where the model is being evaluated. This could lead to unauthorized access to sensitive information such as API keys, cloud credentials, and model weights. Additionally, it could allow an attacker to manipulate model checkpoints, potentially introducing backdoors that affect all users of the model.

Reproduction

To reproduce this vulnerability, inject a 'matrix problem' containing Prompt Injection instructions into a public math dataset. When the 'verl' framework is used to train a model on this dataset, the injected instructions will cause the model to output a response that includes malicious Python code. This response will be processed by the 'math_equal' function, which will execute the code via 'eval()', resulting in arbitrary code execution on the training server.

Remediation

Replace the 'eval()' calls in the vulnerable 'math_equal' function with 'ast.literal_eval()', which safely parses Python literals without executing code. Additionally, conduct a global audit of all 'eval()' and 'exec()' calls in the codebase to identify and replace unsafe usages.

Added: Apr 23, 2026, 12:20 AM
Updated: Apr 23, 2026, 12:20 AM

Vulnerability Rating

Custom Algorithm
spread
0.0
impact
10.0
exploitability
7.0
remediation
0.0
relevance
6.5
threat
6.4
urgency
2.9
incentive
0.0

Our algorithm analyzes dozens of metrics to generate these 8 key vulnerability categories, which are then combined to calculate the overall risk score.