The (ab)use of Open Source Code to Train Large Language Models

A. Al-Kaswan; M. Izadi

doi:10.1109/NLBSE59153.2023.00008

The (ab)use of Open Source Code to Train Large Language Models

Software Engineering

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

50 Downloads (Pure)

Abstract

In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.

Original language	English
Title of host publication	Proceedings o the 2nd International Workshop on NL-based Software Engineering
Pages	9-10
Number of pages	2
ISBN (Electronic)	979-8-3503-0178-6
DOIs	https://doi.org/10.1109/NLBSE59153.2023.00008
Publication status	Published - 2023
Event	2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE) - Melbourne, Australia Duration: 14 May 2023 → 20 May 2023 Conference number: 2 https://nlbse2023.github.io/

Publication series

Name	Proceedings - 2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering, NLBSE 2023

Workshop

Workshop	2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE)
Abbreviated title	NLBSE 2023
Country/Territory	Australia
City	Melbourne
Period	14/05/23 → 20/05/23
Internet address	https://nlbse2023.github.io/

Bibliographical note

Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care
Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

Access to Document

10.1109/NLBSE59153.2023.00008

NLBSE___Position_Paper___ (2)Accepted author manuscript, 87 KB
The_abuse_of_Open_Source_Code_to_Train_Large_Language_ModelsFinal published version, 166 KB

Cite this

@inproceedings{44972319b624470a8ea6fa758ac6cec3,

title = "The (ab)use of Open Source Code to Train Large Language Models",

abstract = "In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.",

author = "A. Al-Kaswan and M. Izadi",

note = "Green Open Access added to TU Delft Institutional Repository {\textquoteleft}You share, we take care!{\textquoteright} – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public. ; 2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE), NLBSE 2023 ; Conference date: 14-05-2023 Through 20-05-2023",

year = "2023",

doi = "10.1109/NLBSE59153.2023.00008",

language = "English",

series = "Proceedings - 2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering, NLBSE 2023",

pages = "9--10",

booktitle = "Proceedings o the 2nd International Workshop on NL-based Software Engineering",

url = "https://nlbse2023.github.io/",

}

Al-Kaswan, A & Izadi, M 2023, The (ab)use of Open Source Code to Train Large Language Models. in Proceedings o the 2nd International Workshop on NL-based Software Engineering. Proceedings - 2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering, NLBSE 2023, pp. 9-10, 2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE), Melbourne, Australia, 14/05/23. https://doi.org/10.1109/NLBSE59153.2023.00008

The (ab)use of Open Source Code to Train Large Language Models. / Al-Kaswan, A.; Izadi, M.
Proceedings o the 2nd International Workshop on NL-based Software Engineering. 2023. p. 9-10 (Proceedings - 2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering, NLBSE 2023).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - The (ab)use of Open Source Code to Train Large Language Models

AU - Al-Kaswan, A.

AU - Izadi, M.

N1 - Conference code: 2

PY - 2023

Y1 - 2023

N2 - In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.

AB - In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.

UR - http://www.scopus.com/inward/record.url?scp=85167945950&partnerID=8YFLogxK

U2 - 10.1109/NLBSE59153.2023.00008

DO - 10.1109/NLBSE59153.2023.00008

M3 - Conference contribution

T3 - Proceedings - 2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering, NLBSE 2023

SP - 9

EP - 10

BT - Proceedings o the 2nd International Workshop on NL-based Software Engineering

T2 - 2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE)

Y2 - 14 May 2023 through 20 May 2023

ER -

The (ab)use of Open Source Code to Train Large Language Models

Abstract

Publication series

Workshop

Bibliographical note

Access to Document

Other files and links

Fingerprint

Cite this