A Qualitative Investigation into LLM-Generated Multilingual Code Comments and Automatic Evaluation Metrics

Jonathan Katzy, Yongcheng Huang, Gopal Raj Panchu, Maksym Ziemlewski, Paris Loizides, Sander Vermeulen, Arie van Deursen, Maliheh Izadi

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

3 Downloads (Pure)

Abstract

Large Language Models are essential coding assistants, yet their training is predominantly English-centric. In this study, we evaluate the performance of code language models in non-English contexts, identifying challenges in their adoption and integration into multilingual workflows. We conduct an open-coding study to analyze errors in code comments generated by five state-of-the-art code models, CodeGemma, CodeLlama, CodeQwen1.5, GraniteCode, and StarCoder2 across five natural languages: Chinese, Dutch, English, Greek, and Polish. Our study yields a dataset of 12,500 labeled generations, which we publicly release. We then assess the reliability of standard metrics in capturing comment correctness across languages and evaluate their trustworthiness as judgment criteria. Through our open-coding investigation, we identified a taxonomy of 26 distinct error categories in model-generated code comments. They highlight variations in language cohesion, informativeness, and syntax adherence across different natural languages. Our analysis shows that, while these models frequently produce partially correct comments, modern neural metrics fail to reliably differentiate meaningful completions from random noise. Notably, the significant score overlap between expert-rated correct and incorrect comments calls into question the effectiveness of these metrics in assessing generated comments.
Original languageEnglish
Title of host publicationPROMISE 2025 - Proceedings of the 2025 21st International Conference on Predictive Models and Data Analytics in Software Engineering
Place of PublicationNew York, NY
PublisherAssociation for Computing Machinery (ACM)
Pages31-40
Number of pages10
ISBN (Electronic)9798400715945
DOIs
Publication statusPublished - 2025
Event21st International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE 2025 , co-located with the International Conference on the Foundations of Software Engineering, FSE 2025 - Trondheim, Norway
Duration: 26 Jun 202526 Jun 2025
https://conf.researchr.org/home/promise-2025

Conference

Conference21st International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE 2025 , co-located with the International Conference on the Foundations of Software Engineering, FSE 2025
Country/TerritoryNorway
CityTrondheim
Period26/06/2526/06/25
Internet address

Keywords

  • Comment Generation
  • Large Language Models
  • Multilingual
  • Open Coding
  • Qualitative Evaluation

Fingerprint

Dive into the research topics of 'A Qualitative Investigation into LLM-Generated Multilingual Code Comments and Automatic Evaluation Metrics'. Together they form a unique fingerprint.

Cite this