Abstract
In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.
Original language | English |
---|---|
Title of host publication | Proceedings o the 2nd International Workshop on NL-based Software Engineering |
Pages | 9-10 |
Number of pages | 2 |
ISBN (Electronic) | 979-8-3503-0178-6 |
DOIs | |
Publication status | Published - 2023 |
Event | 2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE) - Melbourne, Australia Duration: 14 May 2023 → 20 May 2023 Conference number: 2 https://nlbse2023.github.io/ |
Publication series
Name | Proceedings - 2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering, NLBSE 2023 |
---|
Workshop
Workshop | 2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE) |
---|---|
Abbreviated title | NLBSE 2023 |
Country/Territory | Australia |
City | Melbourne |
Period | 14/05/23 → 20/05/23 |
Internet address |
Bibliographical note
Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-careOtherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.