Word embeddings for the software engineering domain

Vasiliki Efstathiou, Christos Chatzilenas, Diomidis Spinellis

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

22 Citations (Scopus)

Abstract

The software development process produces vast amounts of textual data expressed in natural language. Outcomes from the natural language processing community have been adapted in software engineering research for leveraging this rich textual information; these include methods and readily available tools, often furnished with pre-trained models. State of the art pre-trained models however, capture general, common sense knowledge, with limited value when it comes to handling data specific to a specialized domain. There is currently a lack of domain-specific pre-trained models that would further enhance the processing of natural language artefacts related to software engineering. To this end, we release a word2vec model trained over 15GB of textual data from Stack Overflow posts. We illustrate how the model disambiguates polysemous words by interpreting them within their software engineering context. In addition, we present examples of fine-grained semantics captured by the model, that imply transferability of these results to diverse, targeted information retrieval tasks in software engineering and motivate for further reuse of the model.

Original languageEnglish
Title of host publicationProceedings - 2018 ACM/IEEE 15th International Conference on Mining Software Repositories, MSR 2018
PublisherIEEE
Pages38-41
Number of pages4
ISBN (Print)9781450357166
DOIs
Publication statusPublished - 28 May 2018
Externally publishedYes
Event15th ACM/IEEE International Conference on Mining Software Repositories, MSR 2018, co-located with the 40th International Conference on Software Engineering, ICSE 2018 - Gothenburg, Sweden
Duration: 28 May 201829 May 2018

Publication series

NameProceedings - International Conference on Software Engineering
ISSN (Print)0270-5257

Conference

Conference15th ACM/IEEE International Conference on Mining Software Repositories, MSR 2018, co-located with the 40th International Conference on Software Engineering, ICSE 2018
CountrySweden
CityGothenburg
Period28/05/1829/05/18

Keywords

  • natural language processing
  • skip-gram
  • stack overflow
  • word2vec

Fingerprint Dive into the research topics of 'Word embeddings for the software engineering domain'. Together they form a unique fingerprint.

Cite this