run-llama llama_index ArxivReader Class MD5 Hash Collision Vulnerability

Vulnerability

A vulnerability exists in the ArxivReader class of the run-llama/llama_index repository, in versions prior to 0.12.22.post1. This vulnerability allows for MD5 hash collisions when generating filenames for downloaded papers, leading to potential data loss. Papers with the same title but different content may overwrite each other, causing some papers to be missed during processing for AI model training.

Impact

Exploitation of this vulnerability can result in data loss, as papers may be overwritten and not processed for AI model training.

Reproduction

To reproduce this vulnerability, download papers using the ArxivReader class in a version of the llama_index repository prior to 0.12.22.post1. The MD5 hash collision can be observed when papers with identical titles but different contents are downloaded, leading to one paper overwriting the other.

Remediation

Users can upgrade to llama_index version 0.12.28 or later, where this vulnerability has been fixed.

Added: Jul 7, 2025, 10:48 AM
Updated: Jul 7, 2025, 10:48 AM

Vulnerability Rating

Custom Algorithm
spread
4.2
impact
2.5
exploitability
5.7
remediation
7.7
relevance
0.2
threat
4.8
urgency
2.9
incentive
1.7

Our algorithm analyzes dozens of metrics to generate these 8 key vulnerability categories, which are then combined to calculate the overall risk score.