TY - GEN
T1 - The software heritage graph dataset
T2 - 16th IEEE/ACM International Conference on Mining Software Repositories, MSR 2019
AU - Pietri, Antoine
AU - Spinellis, Diomidis
AU - Zacchiroli, Stefano
PY - 2019/5
Y1 - 2019/5
N2 - Software Heritage is the largest existing public archive of software source code and accompanying development history: it currently spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects. This paper introduces the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset's contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild. The Software Heritage graph dataset is available in multiple formats, including downloadable CSV dumps and Apache Parquet files for local use, as well as a public instance on Amazon Athena interactive query service for ready-to-use powerful analytical processing. Source code file contents are cross-referenced at the graph leaves, and can be retrieved through individual requests using the Software Heritage archive API.
AB - Software Heritage is the largest existing public archive of software source code and accompanying development history: it currently spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects. This paper introduces the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset's contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild. The Software Heritage graph dataset is available in multiple formats, including downloadable CSV dumps and Apache Parquet files for local use, as well as a public instance on Amazon Athena interactive query service for ready-to-use powerful analytical processing. Source code file contents are cross-referenced at the graph leaves, and can be retrieved through individual requests using the Software Heritage archive API.
KW - Dataset
KW - Development history graph
KW - Digital preservation
KW - Free software
KW - Mining software repositories
KW - Open source software
KW - Source code
UR - http://www.scopus.com/inward/record.url?scp=85072338237&partnerID=8YFLogxK
U2 - 10.1109/MSR.2019.00030
DO - 10.1109/MSR.2019.00030
M3 - Conference contribution
AN - SCOPUS:85072338237
T3 - IEEE International Working Conference on Mining Software Repositories
SP - 138
EP - 142
BT - Proceedings - 2019 IEEE/ACM 16th International Conference on Mining Software Repositories, MSR 2019
PB - IEEE
Y2 - 26 May 2019 through 27 May 2019
ER -