TSE-NER: An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

18 Citations (Scopus)
132 Downloads (Pure)

Abstract

Named Entity Recognition and Typing (NER/NET) is a challenging task, especially with long-tail entities such as the ones found in scientific publications. These entities (e.g. “WebKB”, “StatSnowball”) are rare, often relevant only in specific knowledge domains, yet important for retrieval and exploration purposes. State-of-the-art NER approaches employ supervised machine learning models, trained on expensive typelabeled data laboriously produced by human annotators. A common workaround is the generation of labeled training data from knowledge bases; this approach is not suitable for long-tail entity types that are, by definition, scarcely represented in KBs.
This paper presents an iterative approach for training NER and NET
classifiers in scientific publications that relies on minimal human input,
namely a small seed set of instances for the targeted entity type. We
introduce different strategies for training data extraction, semantic expansion, and result entity filtering.We evaluate our approach on scientific
publications, focusing on the long-tail entities types Datasets, Methods in
computer science publications, and Proteins in biomedical publications.
Original languageEnglish
Title of host publicationThe Semantic Web – ISWC 2018
Subtitle of host publicationProceedings of the 17th International Semantic Web Conference
EditorsD. Vrandečić, K. Bontcheva, M.C. Suárez-Figueroa, V. Presutti, I. Celino, M. Sabou, L.M Kaffee, E. Simperl
Place of PublicationCham
PublisherSpringer
Pages127-143
Number of pages16
ISBN (Electronic)978-3-030-00671-6
ISBN (Print)978-3-030-00670-9
DOIs
Publication statusPublished - 2018
EventISWC 2018: 17th International Semantic Web Conference - Monterey, CA, United States
Duration: 8 Oct 201812 Oct 2018
Conference number: 17
http://iswc2018.semanticweb.org/

Publication series

NameLecture Notes in Computer Science (LNCS)
Volume11136

Conference

ConferenceISWC 2018
Country/TerritoryUnited States
CityMonterey, CA
Period8/10/1812/10/18
Internet address

Bibliographical note

Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care
Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

Fingerprint

Dive into the research topics of 'TSE-NER: An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications'. Together they form a unique fingerprint.

Cite this