The promises and perils of mining GitHub

Eirini Kalliamvakou; Leif Singer; Georgios Gousios; Daniel M. German; Kelly Blincoe; Daniela Damian

doi:10.1145/2597073.2597074

The promises and perils of mining GitHub

Eirini Kalliamvakou, Leif Singer, Georgios Gousios, Daniel M. German, Kelly Blincoe, Daniela Damian

Software Engineering

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

474 Citations (Scopus)

Abstract

With over 10 million git repositories, GitHub is becoming one of the most important source of software artifacts on the Internet. Researchers are starting to mine the information stored in GitHub's event logs, trying to understand how its users employ the site to collaborate on software. However, so far there have been no studies describing the quality and properties of the data available from GitHub. We document the results of an empirical study aimed at understanding the characteristics of the repositories in GitHub and how users take advantage of GitHub's main features-namely commits, pull requests, and issues. Our results indicate that, while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration. We show, for example, that the majority of the projects are personal and inactive; that GitHub is also being used for free storage and as a Web hosting service; and that almost 40% of all pull requests do not appear as merged, even though they were. We provide a set of recommendations for software engineering researchers on how to approach the data in GitHub.

Original language	English
Title of host publication	11th Working Conference on Mining Software Repositories, MSR 2014 - Proceedings
Publisher	Association for Computing Machinery (ACM)
Pages	92-101
Number of pages	10
ISBN (Electronic)	9781450328630
DOIs	https://doi.org/10.1145/2597073.2597074
Publication status	Published - 31 May 2014
Event	11th International Working Conference on Mining Software Repositories, MSR 2014 - Hyderabad, India Duration: 31 May 2014 → 1 Jun 2014

Publication series

Name	11th Working Conference on Mining Software Repositories, MSR 2014 - Proceedings

Conference

Conference	11th International Working Conference on Mining Software Repositories, MSR 2014
Country/Territory	India
City	Hyderabad
Period	31/05/14 → 1/06/14

Keywords

Bias
Code reviews
Git
GitHub
Mining software repositories

Access to Document

10.1145/2597073.2597074

Cite this

Kalliamvakou, E., Singer, L., Gousios, G., German, D. M., Blincoe, K., & Damian, D. (2014). The promises and perils of mining GitHub. In 11th Working Conference on Mining Software Repositories, MSR 2014 - Proceedings (pp. 92-101). (11th Working Conference on Mining Software Repositories, MSR 2014 - Proceedings). Association for Computing Machinery (ACM). https://doi.org/10.1145/2597073.2597074

@inproceedings{d9115632cfdd4211bc889fb27952f6a6,

title = "The promises and perils of mining GitHub",

abstract = "With over 10 million git repositories, GitHub is becoming one of the most important source of software artifacts on the Internet. Researchers are starting to mine the information stored in GitHub's event logs, trying to understand how its users employ the site to collaborate on software. However, so far there have been no studies describing the quality and properties of the data available from GitHub. We document the results of an empirical study aimed at understanding the characteristics of the repositories in GitHub and how users take advantage of GitHub's main features-namely commits, pull requests, and issues. Our results indicate that, while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration. We show, for example, that the majority of the projects are personal and inactive; that GitHub is also being used for free storage and as a Web hosting service; and that almost 40% of all pull requests do not appear as merged, even though they were. We provide a set of recommendations for software engineering researchers on how to approach the data in GitHub.",

keywords = "Bias, Code reviews, Git, GitHub, Mining software repositories",

author = "Eirini Kalliamvakou and Leif Singer and Georgios Gousios and German, {Daniel M.} and Kelly Blincoe and Daniela Damian",

year = "2014",

month = may,

day = "31",

doi = "10.1145/2597073.2597074",

language = "English",

series = "11th Working Conference on Mining Software Repositories, MSR 2014 - Proceedings",

publisher = "Association for Computing Machinery (ACM)",

pages = "92--101",

booktitle = "11th Working Conference on Mining Software Repositories, MSR 2014 - Proceedings",

address = "United States",

note = "11th International Working Conference on Mining Software Repositories, MSR 2014 ; Conference date: 31-05-2014 Through 01-06-2014",

}

Kalliamvakou, E, Singer, L, Gousios, G, German, DM, Blincoe, K & Damian, D 2014, The promises and perils of mining GitHub. in 11th Working Conference on Mining Software Repositories, MSR 2014 - Proceedings. 11th Working Conference on Mining Software Repositories, MSR 2014 - Proceedings, Association for Computing Machinery (ACM), pp. 92-101, 11th International Working Conference on Mining Software Repositories, MSR 2014, Hyderabad, India, 31/05/14. https://doi.org/10.1145/2597073.2597074

The promises and perils of mining GitHub. / Kalliamvakou, Eirini; Singer, Leif; Gousios, Georgios et al.
11th Working Conference on Mining Software Repositories, MSR 2014 - Proceedings. Association for Computing Machinery (ACM), 2014. p. 92-101 (11th Working Conference on Mining Software Repositories, MSR 2014 - Proceedings).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - The promises and perils of mining GitHub

AU - Kalliamvakou, Eirini

AU - Singer, Leif

AU - Gousios, Georgios

AU - German, Daniel M.

AU - Blincoe, Kelly

AU - Damian, Daniela

PY - 2014/5/31

Y1 - 2014/5/31

N2 - With over 10 million git repositories, GitHub is becoming one of the most important source of software artifacts on the Internet. Researchers are starting to mine the information stored in GitHub's event logs, trying to understand how its users employ the site to collaborate on software. However, so far there have been no studies describing the quality and properties of the data available from GitHub. We document the results of an empirical study aimed at understanding the characteristics of the repositories in GitHub and how users take advantage of GitHub's main features-namely commits, pull requests, and issues. Our results indicate that, while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration. We show, for example, that the majority of the projects are personal and inactive; that GitHub is also being used for free storage and as a Web hosting service; and that almost 40% of all pull requests do not appear as merged, even though they were. We provide a set of recommendations for software engineering researchers on how to approach the data in GitHub.

AB - With over 10 million git repositories, GitHub is becoming one of the most important source of software artifacts on the Internet. Researchers are starting to mine the information stored in GitHub's event logs, trying to understand how its users employ the site to collaborate on software. However, so far there have been no studies describing the quality and properties of the data available from GitHub. We document the results of an empirical study aimed at understanding the characteristics of the repositories in GitHub and how users take advantage of GitHub's main features-namely commits, pull requests, and issues. Our results indicate that, while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration. We show, for example, that the majority of the projects are personal and inactive; that GitHub is also being used for free storage and as a Web hosting service; and that almost 40% of all pull requests do not appear as merged, even though they were. We provide a set of recommendations for software engineering researchers on how to approach the data in GitHub.

KW - Bias

KW - Code reviews

KW - Git

KW - GitHub

KW - Mining software repositories

UR - http://www.scopus.com/inward/record.url?scp=84914172716&partnerID=8YFLogxK

U2 - 10.1145/2597073.2597074

DO - 10.1145/2597073.2597074

M3 - Conference contribution

AN - SCOPUS:84914172716

T3 - 11th Working Conference on Mining Software Repositories, MSR 2014 - Proceedings

SP - 92

EP - 101

BT - 11th Working Conference on Mining Software Repositories, MSR 2014 - Proceedings

PB - Association for Computing Machinery (ACM)

T2 - 11th International Working Conference on Mining Software Repositories, MSR 2014

Y2 - 31 May 2014 through 1 June 2014

ER -

The promises and perils of mining GitHub

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this