Abstract
Training machine learning (ML) models for natural language processing usually requires large amount of data, often acquired through crowdsourcing. The way this data is collected and aggregated can have an effect on the outputs of the trained model such as ignoring the labels which differ from the majority. In this paper we investigate how label aggregation can bias the ML results towards certain data samples and propose a methodology to highlight and mitigate this bias. Although our work is applicable to any kind of label aggregation for data subject to multiple interpretations, we focus on the effects of the bias introduced by majority voting on toxicity prediction over sentences. Our preliminary results point out that we can mitigate the majority-bias and get increased prediction accuracy for the minority opinions if we take into account the different labels from annotators when training adapted models, rather than rely on the aggregated labels.
Original language | English |
---|---|
Title of host publication | Proceedings of the 1st Workshop on Subjectivity, Ambiguity and Disagreement in Crowdsourcing, and Short Paper Proceedings of the 1st Workshop on Disentangling the Relation Between Crowdsourcing and Bias Management |
Editors | Lora Aroyo, Anca Dumitrache, Praveen Paritosh, Alex Quinn, Chris Welty, Alessandro Checco, Gianluca Demartini, Ujwal Gadiraju, Cristina Sarasua |
Publisher | CEUR |
Pages | 67-71 |
Number of pages | 5 |
Volume | 2276 |
Publication status | Published - 2018 |
Event | 1st Workshop on Subjectivity, Ambiguity and Disagreement in Crowdsourcing, and 1st Workshop on Disentangling the Relation Between Crowdsourcing and Bias Management - University of Zurich, Zurich, Switzerland Duration: 5 Jul 2018 → 5 Jul 2018 https://sites.google.com/view/crowdbias |
Publication series
Name | CEUR Workshop Proceedings |
---|---|
Volume | 2276 |
ISSN (Electronic) | 1613-0073 |
Conference
Conference | 1st Workshop on Subjectivity, Ambiguity and Disagreement in Crowdsourcing, and 1st Workshop on Disentangling the Relation Between Crowdsourcing and Bias Management |
---|---|
Abbreviated title | SAD2018 CrowdBias2018 |
Country/Territory | Switzerland |
City | Zurich |
Period | 5/07/18 → 5/07/18 |
Internet address |
Bibliographical note
Accepted Author ManuscriptKeywords
- dataset bias
- Machine Learning fairness
- crowdsourcing
- annotation aggregation