TY - GEN
T1 - The promises and perils of mining GitHub
AU - Kalliamvakou, Eirini
AU - Singer, Leif
AU - Gousios, Georgios
AU - German, Daniel M.
AU - Blincoe, Kelly
AU - Damian, Daniela
PY - 2014/5/31
Y1 - 2014/5/31
N2 - With over 10 million git repositories, GitHub is becoming one of the most important source of software artifacts on the Internet. Researchers are starting to mine the information stored in GitHub's event logs, trying to understand how its users employ the site to collaborate on software. However, so far there have been no studies describing the quality and properties of the data available from GitHub. We document the results of an empirical study aimed at understanding the characteristics of the repositories in GitHub and how users take advantage of GitHub's main features-namely commits, pull requests, and issues. Our results indicate that, while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration. We show, for example, that the majority of the projects are personal and inactive; that GitHub is also being used for free storage and as a Web hosting service; and that almost 40% of all pull requests do not appear as merged, even though they were. We provide a set of recommendations for software engineering researchers on how to approach the data in GitHub.
AB - With over 10 million git repositories, GitHub is becoming one of the most important source of software artifacts on the Internet. Researchers are starting to mine the information stored in GitHub's event logs, trying to understand how its users employ the site to collaborate on software. However, so far there have been no studies describing the quality and properties of the data available from GitHub. We document the results of an empirical study aimed at understanding the characteristics of the repositories in GitHub and how users take advantage of GitHub's main features-namely commits, pull requests, and issues. Our results indicate that, while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration. We show, for example, that the majority of the projects are personal and inactive; that GitHub is also being used for free storage and as a Web hosting service; and that almost 40% of all pull requests do not appear as merged, even though they were. We provide a set of recommendations for software engineering researchers on how to approach the data in GitHub.
KW - Bias
KW - Code reviews
KW - Git
KW - GitHub
KW - Mining software repositories
UR - http://www.scopus.com/inward/record.url?scp=84914172716&partnerID=8YFLogxK
U2 - 10.1145/2597073.2597074
DO - 10.1145/2597073.2597074
M3 - Conference contribution
AN - SCOPUS:84914172716
T3 - 11th Working Conference on Mining Software Repositories, MSR 2014 - Proceedings
SP - 92
EP - 101
BT - 11th Working Conference on Mining Software Repositories, MSR 2014 - Proceedings
PB - Association for Computing Machinery (ACM)
T2 - 11th International Working Conference on Mining Software Repositories, MSR 2014
Y2 - 31 May 2014 through 1 June 2014
ER -