GEMMS: A Generic and Extensible Metadata Management System for Data Lakes

Christoph Quix, Rihan Hai, Ivan Vatov

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review


The heterogeneity of sources in Big Data systems requires new integration approaches which can handle the large volume of the data as well as its variety. Data lakes have been proposed to reduce the upfront integration costs and to provide more _exibility in integrating and analyzing information. In data lakes, data from the sources is copied in its original structure to a repository; only a syntactic integration is done as data is stored in a common semi-structured format. Metadata plays an important role, as the source data is not loaded into an integrated repository with a uni_ed schema; the data has to come with its own metadata. This paper presents GEMMS, a Generic and Extensible Metadata Management System for data lakes which extracts metadata from the sources and manages the structural and semantical information in an extensible metamodel. The system has been developed with a focus on scienti_c data management in the life sciences which is often only _le-based with limited query functionality. The evaluation shows the usefulness in this domain, but also the _exibility and extensibility of our approach which makes GEMMS also applicable to other domains.
Original languageEnglish
Title of host publication28th International Conference on Advanced Information Systems Engineering (CAiSE)
EditorsSergio España
Publication statusPublished - 2016
Externally publishedYes


Dive into the research topics of 'GEMMS: A Generic and Extensible Metadata Management System for Data Lakes'. Together they form a unique fingerprint.

Cite this