A Complete Set of Related Git Repositories Identified via Community Detection Approaches Based on Shared Commits

Audris Mockus, Diomidis Spinellis, Zoe Kotti, Gabriel John Dusing

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

12 Citations (Scopus)

Abstract

In order to understand the state and evolution of the entirety of open source software we need to get a handle on the set of distinct software projects. Most of open source projects presently utilize Git, which is a distributed version control system allowing easy creation of clones and resulting in numerous repositories that are almost entirely based on some parent repository from which they were cloned. Git commits are unlikely to get produce and represent a way to group cloned repositories. We use World of Code infrastructure containing approximately 2B commits and 100M repositories to create and share such a map. We discover that the largest group contains almost 14M repositories most of which are unrelated to each other. As it turns out, the developers can push git object to an arbitrary repository or pull objects from unrelated repositories, thus linking unrelated repositories. To address this, we apply Louvain community detection algorithm to this very large graph consisting of links between commits and projects. The approach successfully reduces the size of the megacluster with the largest group of highly interconnected projects containing under 400K repositories. We expect that the resulting map of related projects as well as tools and methods to handle the very large graph will serve as a reference set for mining software projects and other applications. Further work is needed to determine different types of relationships among projects induced by shared commits and other relationships, for example, by shared source code or similar filenames.

Original languageEnglish
Title of host publicationProceedings - 2020 IEEE/ACM 17th International Conference on Mining Software Repositories, MSR 2020
PublisherAssociation for Computing Machinery (ACM)
Pages513-517
Number of pages5
ISBN (Electronic)9781450379571
DOIs
Publication statusPublished - 29 Jun 2020
Externally publishedYes
Event17th IEEE/ACM International Conference on Mining Software Repositories, MSR 2020, co-located with the 42nd International Conference on Software Engineering. ICSE 2020 - Virtual, Online, Korea, Republic of
Duration: 29 Jun 202030 Jun 2020

Publication series

NameProceedings - 2020 IEEE/ACM 17th International Conference on Mining Software Repositories, MSR 2020

Conference

Conference17th IEEE/ACM International Conference on Mining Software Repositories, MSR 2020, co-located with the 42nd International Conference on Software Engineering. ICSE 2020
Country/TerritoryKorea, Republic of
CityVirtual, Online
Period29/06/2030/06/20

Keywords

  • forks and clones

Fingerprint

Dive into the research topics of 'A Complete Set of Related Git Repositories Identified via Community Detection Approaches Based on Shared Commits'. Together they form a unique fingerprint.

Cite this