lm-sys FastChat Content Moderation Bypass Vulnerability in Arena Side-by-Side Views

Vulnerability

A content moderation bypass vulnerability has been identified in lm-sys FastChat versions through 0.2.36. This issue occurs in the Arena Side-by-Side View Handler, specifically within the add_text function. The vulnerability arises because the content moderation filter fails to accurately assess the full conversation history before each turn, allowing potentially harmful content to persist unmoderated. In two affected files, the moderation process incorrectly references the same model's history, omitting the other model's context. A third file completely bypasses moderation by only sending the current user's input, disregarding any prior conversation. The vulnerability can be exploited remotely, and an exploit is publicly available.

Impact

Exploitation of this vulnerability leads to a complete bypass of the content moderation system in the affected arena modes. This allows users to generate and maintain harmful or policy-violating content across multiple turns without any moderation oversight.

Reproduction

To reproduce this vulnerability, navigate to the FastChat Anonymous Arena or Vision Anonymous Arena. Initiate a conversation with two models, prompting them to generate content that approaches moderation boundaries. Then, send a follow-up message that is benign, such as 'tell me more' or 'continue'. In the background, the moderation system will only process the short input, completely missing any context from the second model, or in the case of the Vision Anonymous Arena, from both models.

Remediation

The vulnerability has been fixed in the file gradio_block_arena_named.py, but the same issue persists in three other files. Users should manually apply the same fix by correcting the conversation history references in the add_text function of the Arena Side-by-Side View Handler.

Added: Apr 20, 2026, 6:22 AM
Updated: Apr 20, 2026, 6:22 AM

Vulnerability Rating

Custom Algorithm
spread
0.0
impact
0.6
exploitability
8.7
remediation
0.0
relevance
6.3
threat
6.4
urgency
2.9
incentive
4.2

Our algorithm analyzes dozens of metrics to generate these 8 key vulnerability categories, which are then combined to calculate the overall risk score.