llama.cpp Buffer Overflow Vulnerability in Vocabulary Loading Prior to Version b5662
Vulnerability
A buffer overflow vulnerability has been identified in llama.cpp, an inference library for various LLM models, prior to version b5662. The issue arises in the vocabulary-loading code, specifically within the function 'llama_vocab::impl::token_to_piece()'. Here, an attacker-supplied GGUF model vocabulary can trigger the vulnerability by providing a token length that exceeds the maximum limit. This oversized length bypasses the length check, leading to an unchecked 'memcpy' operation that allows for arbitrary memory corruption. Such memory corruption can potentially be exploited for code execution.
Impact
Exploitation of this vulnerability causes a buffer overflow, allowing for arbitrary memory corruption. This memory corruption can lead to application crashes, instability, or, in some cases, remote code execution by overwriting critical data such as return addresses or vtable pointers. Additionally, the vulnerability can cause a denial-of-service condition, especially when the application is running under memory sanitizers, which can detect such errors and cause the program to crash.
Reproduction
To reproduce this vulnerability, load a GGUF model file that contains a vocabulary entry with a token size exceeding INT32_MAX. During the model loading process, the 'token_to_piece()' function will be called, where the oversized token length will bypass the length check and trigger the buffer overflow. This can also be reproduced through the public API 'llama_token_to_piece()' or during detokenization processes.
Remediation
Users should update to llama.cpp version b5662 or later, where this vulnerability has been patched.
Vulnerability Rating
Our algorithm analyzes dozens of metrics to generate these 8 key vulnerability categories, which are then combined to calculate the overall risk score.
