BIBLIOS

  Sistema de Gestão de Referências Bibliográficas de Ciências

Modo Visitante (Login)
Need help?


Voltar

Detalhes Referência

Tipo
Artigos em Conferência

Tipo de Documento
Artigo Completo

Título
Are Large Language Models Memorizing Bug Benchmarks?

Participantes na publicação
Daniel Ramos (Author)
Claudia Mamede (Author)
Kush Jain (Author)
Paulo Canelas (Author)
Dep. Informática
Unidade de I&D e Inovação
LASIGE
Catarina Gamboa (Author)
Dep. Informática
Dep. Informática
LASIGE
Claire Le Goues (Author)

Resumo
Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. To evaluate model performance in these domains, numerous bug benchmarks containing real-world bugs from software projects have been developed. However, a growing concern within the software engineering community is that these benchmarks may not reliably reflect true LLM performance due to the risk of data leakage. Despite this concern, limited research has been conducted to quantify the impact of potential leakage. In this paper, we systematically evaluate popular LLMs to assess their susceptibility to data leakage from widely used bug benchmarks. To identify potential leakage, we use multiple metrics, including a study of benchmark membership within commonly used training datasets, as well as analyses of negative log-likelihood and n-gram accuracy. Our findings show that certain models, in particular codegen-multi, exhibit significant evidence of memorization in widely used benchmarks like Defects4J, while newer models trained on larger datasets like LLaMa 3.1 exhibit limited signs of leakage. These results highlight the need for careful benchmark selection and the adoption of robust metrics to adequately assess models capabilities.

Data de Submissão/Pedido
2024-11-18
Data de Aceitação
2024-12-16
Data de Publicação
2025-05-03

Evento
The Second International Workshop on Large Language Models for Code

Identificadores da Publicação

Número de Páginas
8


Exportar referência

APA
Daniel Ramos, Claudia Mamede, Kush Jain, Paulo Canelas, Catarina Gamboa, Claire Le Goues, (2025). Are Large Language Models Memorizing Bug Benchmarks?. The Second International Workshop on Large Language Models for Code, -

IEEE
Daniel Ramos, Claudia Mamede, Kush Jain, Paulo Canelas, Catarina Gamboa, Claire Le Goues, "Are Large Language Models Memorizing Bug Benchmarks?" in The Second International Workshop on Large Language Models for Code, , 2025, pp. -, doi:

BIBTEX
@InProceedings{62803, author = {Daniel Ramos and Claudia Mamede and Kush Jain and Paulo Canelas and Catarina Gamboa and Claire Le Goues}, title = {Are Large Language Models Memorizing Bug Benchmarks?}, booktitle = {The Second International Workshop on Large Language Models for Code}, year = 2025, pages = {-}, address = {}, publisher = {} }