Semantic-Enhanced Training Data Augmentation Methods for Long-Tail Entity Recognition Models

Sepideh Mesbah

doi:10.4233/uuid:dbbfe1fc-bf63-45f0-8cf2-28ed7dab90eb

Semantic-Enhanced Training Data Augmentation Methods for Long-Tail Entity Recognition Models

Sepideh Mesbah

Human-Centred Artificial Intelligence

Research output: Thesis › Dissertation (TU Delft)

133 Downloads (Pure)

Abstract

Named Entity Recognition (NER) is an essential information retrieval task. It enables a wide range of natural language processing applications such as semantic search, machine translation, etc. The NER can be formulated as the task of identifying and typing words or phrases in a text that refers to certain classes of interest (e.g., disease, Adverse Drug Reactions). There are different techniques to tackle NER, such as dictionary-based, rulebased, and machine learning-based. Machine learning-based NER techniques have shown to perform the best for entities with large amounts of human-labeled training datasets.
However, their performance is limited when dealing with long-tail entities. Long-tail entities are entities that have a low frequency in the document collections and usually have no reference to existing Knowledge Bases. Obtaining human-labeled datasets is expensive and time-consuming, especially for long-tail entities that are scarcely available in document collections. This dissertation focuses on the problem of the lack of training data, arguably the largest bottleneck in training machine learning-based NER techniques. We investigated efficient and effective ways to augment training data by enhancing their size and quality automatically. Our work aimed at showing how, by enhancing the size and quality of the training data using different techniques, it will be possible to improve the performance of Long-tail Entity Recognition (L-tER).

Original language	English
Qualification	Doctor of Philosophy
Awarding Institution	Delft University of Technology
Supervisors/Advisors	Houben, G.J.P.M., Supervisor Bozzon, A., Supervisor Lofi, C., Advisor
Award date	20 May 2020
Print ISBNs	978-94-6380-808-8
DOIs	https://doi.org/10.4233/uuid:dbbfe1fc-bf63-45f0-8cf2-28ed7dab90eb
Publication status	Published - 2020

Keywords

Long-tail Name Entity Recognition
Semantic Enrichment
Training Data Augmentation

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.4233/uuid:dbbfe1fc-bf63-45f0-8cf2-28ed7dab90eb

Sepideh_DissertationApril2020Final published version, 11.4 MB

Cite this

@phdthesis{dbbfe1fcbf6345f08cf228ed7dab90eb,

title = "Semantic-Enhanced Training Data Augmentation Methods for Long-Tail Entity Recognition Models",

abstract = "Named Entity Recognition (NER) is an essential information retrieval task. It enables a wide range of natural language processing applications such as semantic search, machine translation, etc. The NER can be formulated as the task of identifying and typing words or phrases in a text that refers to certain classes of interest (e.g., disease, Adverse Drug Reactions). There are different techniques to tackle NER, such as dictionary-based, rulebased, and machine learning-based. Machine learning-based NER techniques have shown to perform the best for entities with large amounts of human-labeled training datasets.However, their performance is limited when dealing with long-tail entities. Long-tail entities are entities that have a low frequency in the document collections and usually have no reference to existing Knowledge Bases. Obtaining human-labeled datasets is expensive and time-consuming, especially for long-tail entities that are scarcely available in document collections. This dissertation focuses on the problem of the lack of training data, arguably the largest bottleneck in training machine learning-based NER techniques. We investigated efficient and effective ways to augment training data by enhancing their size and quality automatically. Our work aimed at showing how, by enhancing the size and quality of the training data using different techniques, it will be possible to improve the performance of Long-tail Entity Recognition (L-tER).",

keywords = "Long-tail Name Entity Recognition, Semantic Enrichment, Training Data Augmentation",

author = "Sepideh Mesbah",

year = "2020",

doi = "10.4233/uuid:dbbfe1fc-bf63-45f0-8cf2-28ed7dab90eb",

language = "English",

isbn = "978-94-6380-808-8",

type = "Dissertation (TU Delft)",

school = "Delft University of Technology",

}

TY - THES

T1 - Semantic-Enhanced Training Data Augmentation Methods for Long-Tail Entity Recognition Models

AU - Mesbah, Sepideh

PY - 2020

Y1 - 2020

N2 - Named Entity Recognition (NER) is an essential information retrieval task. It enables a wide range of natural language processing applications such as semantic search, machine translation, etc. The NER can be formulated as the task of identifying and typing words or phrases in a text that refers to certain classes of interest (e.g., disease, Adverse Drug Reactions). There are different techniques to tackle NER, such as dictionary-based, rulebased, and machine learning-based. Machine learning-based NER techniques have shown to perform the best for entities with large amounts of human-labeled training datasets.However, their performance is limited when dealing with long-tail entities. Long-tail entities are entities that have a low frequency in the document collections and usually have no reference to existing Knowledge Bases. Obtaining human-labeled datasets is expensive and time-consuming, especially for long-tail entities that are scarcely available in document collections. This dissertation focuses on the problem of the lack of training data, arguably the largest bottleneck in training machine learning-based NER techniques. We investigated efficient and effective ways to augment training data by enhancing their size and quality automatically. Our work aimed at showing how, by enhancing the size and quality of the training data using different techniques, it will be possible to improve the performance of Long-tail Entity Recognition (L-tER).

AB - Named Entity Recognition (NER) is an essential information retrieval task. It enables a wide range of natural language processing applications such as semantic search, machine translation, etc. The NER can be formulated as the task of identifying and typing words or phrases in a text that refers to certain classes of interest (e.g., disease, Adverse Drug Reactions). There are different techniques to tackle NER, such as dictionary-based, rulebased, and machine learning-based. Machine learning-based NER techniques have shown to perform the best for entities with large amounts of human-labeled training datasets.However, their performance is limited when dealing with long-tail entities. Long-tail entities are entities that have a low frequency in the document collections and usually have no reference to existing Knowledge Bases. Obtaining human-labeled datasets is expensive and time-consuming, especially for long-tail entities that are scarcely available in document collections. This dissertation focuses on the problem of the lack of training data, arguably the largest bottleneck in training machine learning-based NER techniques. We investigated efficient and effective ways to augment training data by enhancing their size and quality automatically. Our work aimed at showing how, by enhancing the size and quality of the training data using different techniques, it will be possible to improve the performance of Long-tail Entity Recognition (L-tER).

KW - Long-tail Name Entity Recognition

KW - Semantic Enrichment

KW - Training Data Augmentation

U2 - 10.4233/uuid:dbbfe1fc-bf63-45f0-8cf2-28ed7dab90eb

DO - 10.4233/uuid:dbbfe1fc-bf63-45f0-8cf2-28ed7dab90eb

M3 - Dissertation (TU Delft)

SN - 978-94-6380-808-8

ER -

Semantic-Enhanced Training Data Augmentation Methods for Long-Tail Entity Recognition Models

Abstract

Keywords

UN SDGs

Access to Document

Fingerprint

Cite this