Mining File Histories: Should we consider branches?

Vladimir Kovalenko, Fabio Palomba, Alberto Bacchelli

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

18 Citations (Scopus)
43 Downloads (Pure)


Modern distributed version control systems, such as Git, offer support for branching — the possibility to develop parts of software outside the master trunk. Consideration of the repository structure in Mining Software Repository (MSR) studies requires a thorough approach to mining, but there is no well-documented, widespread methodology regarding the handling of merge commits and branches. Moreover, there is still a lack of knowledge of the extent to which considering branches during MSR studies impacts the results of the studies. In this study, we set out to evaluate the importance of proper handling of branches when calculating file modification histories. We analyze over 1,400 Git repositories of four open source ecosystems and compute modification histories for over two million files, using two different algorithms. One algorithm only follows the first parent of each commit when traversing the repository, the other returns the full modification history of a file across all branches. We show that the two algorithms consistently deliver different results, but the scale of the difference varies across projects and ecosystems. Further, we evaluate the importance of accurate mining of file histories by comparing the performance of common techniques that rely on file modification history — reviewer recommendation, change recommendation, and defect prediction — for two algorithms of file history retrieval. We find that considering full file histories leads to an increase in the techniques’ performance that is rather modest.
Original languageEnglish
Title of host publicationASE 2018
Subtitle of host publicationProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering
Place of PublicationNew York, NY
PublisherAssociation for Computing Machinery (ACM)
Number of pages12
ISBN (Print)978-1-4503-5937-5
Publication statusPublished - 2018
EventASE 2018: 33rd IEEE/ACM International Conference on Automated Software Engineering - Montpellier, France
Duration: 3 Jul 20187 Jul 2018


ConferenceASE 2018
Abbreviated titleASE 2018
Internet address


  • Version Control Systems
  • Branches
  • Mining Software Repositories


Dive into the research topics of 'Mining File Histories: Should we consider branches?'. Together they form a unique fingerprint.

Cite this