Scalpel-CD: Leveraging Crowdsourcing and Deep Probabilistic Modeling for Debugging Noisy Training Data

Jie Yang; Alisa Smirnova; Dingqi Yang; Gianluca Demartini; Yuan Lu; Philippe Cudré-Mauroux

Scalpel-CD: Leveraging Crowdsourcing and Deep Probabilistic Modeling for Debugging Noisy Training Data

Jie Yang, Alisa Smirnova, Dingqi Yang, Gianluca Demartini, Yuan Lu, Philippe Cudré-Mauroux

Research output: Chapter in Book/Conference proceedings/Edited volume › Chapter › Scientific › peer-review

Abstract

This paper presents Scalpel-CD, a first-of-its-kind system that leverages both human and machine intelligence to debug noisy labels from the training data of machine learning systems. Our system identifies potentially wrong labels using a deep probabilistic model, which is able to infer the latent class of a high-dimensional data instance by exploiting data distributions in the underlying latent feature space. To minimize crowd efforts, it employs a data sampler which selects data instances that would benefit the most from being inspected by the crowd. The manually verified labels are then propagated to similar data instances in the original training data by exploiting the underlying data structure, thus scaling out the contribution from the crowd. Scalpel-CD is designed with a set of algorithmic solutions to automatically search for the optimal configurations for different types of training data, in terms of the underlying data structure, noise ratio, and noise types (random vs. structural). In a real deployment on multiple machine learning tasks, we demonstrate that Scalpel-CD is able to improve label quality by 12.9% with only 2.8% instances inspected by the crowd.

Original language	English
Title of host publication	WWW '19: The World Wide Web Conference
Publisher	Association for Computing Machinery (ACM)
Pages	2158–2168
ISBN (Electronic)	978-1-4503-6674-8
Publication status	Published - 2019
Externally published	Yes

Cite this

@inbook{6fb7482162074289b09fee232f74f417,

title = "Scalpel-CD: Leveraging Crowdsourcing and Deep Probabilistic Modeling for Debugging Noisy Training Data",

abstract = "This paper presents Scalpel-CD, a first-of-its-kind system that leverages both human and machine intelligence to debug noisy labels from the training data of machine learning systems. Our system identifies potentially wrong labels using a deep probabilistic model, which is able to infer the latent class of a high-dimensional data instance by exploiting data distributions in the underlying latent feature space. To minimize crowd efforts, it employs a data sampler which selects data instances that would benefit the most from being inspected by the crowd. The manually verified labels are then propagated to similar data instances in the original training data by exploiting the underlying data structure, thus scaling out the contribution from the crowd. Scalpel-CD is designed with a set of algorithmic solutions to automatically search for the optimal configurations for different types of training data, in terms of the underlying data structure, noise ratio, and noise types (random vs. structural). In a real deployment on multiple machine learning tasks, we demonstrate that Scalpel-CD is able to improve label quality by 12.9% with only 2.8% instances inspected by the crowd.",

author = "Jie Yang and Alisa Smirnova and Dingqi Yang and Gianluca Demartini and Yuan Lu and Philippe Cudr{\'e}-Mauroux",

year = "2019",

language = "English",

pages = "2158–2168",

booktitle = "WWW '19: The World Wide Web Conference",

publisher = "Association for Computing Machinery (ACM)",

address = "United States",

}

Scalpel-CD: Leveraging Crowdsourcing and Deep Probabilistic Modeling for Debugging Noisy Training Data. / Yang, Jie; Smirnova, Alisa; Yang, Dingqi et al.
WWW '19: The World Wide Web Conference. Association for Computing Machinery (ACM), 2019. p. 2158–2168.

Research output: Chapter in Book/Conference proceedings/Edited volume › Chapter › Scientific › peer-review

TY - CHAP

T1 - Scalpel-CD: Leveraging Crowdsourcing and Deep Probabilistic Modeling for Debugging Noisy Training Data

AU - Yang, Jie

AU - Smirnova, Alisa

AU - Yang, Dingqi

AU - Demartini, Gianluca

AU - Lu, Yuan

AU - Cudré-Mauroux, Philippe

PY - 2019

Y1 - 2019

N2 - This paper presents Scalpel-CD, a first-of-its-kind system that leverages both human and machine intelligence to debug noisy labels from the training data of machine learning systems. Our system identifies potentially wrong labels using a deep probabilistic model, which is able to infer the latent class of a high-dimensional data instance by exploiting data distributions in the underlying latent feature space. To minimize crowd efforts, it employs a data sampler which selects data instances that would benefit the most from being inspected by the crowd. The manually verified labels are then propagated to similar data instances in the original training data by exploiting the underlying data structure, thus scaling out the contribution from the crowd. Scalpel-CD is designed with a set of algorithmic solutions to automatically search for the optimal configurations for different types of training data, in terms of the underlying data structure, noise ratio, and noise types (random vs. structural). In a real deployment on multiple machine learning tasks, we demonstrate that Scalpel-CD is able to improve label quality by 12.9% with only 2.8% instances inspected by the crowd.

AB - This paper presents Scalpel-CD, a first-of-its-kind system that leverages both human and machine intelligence to debug noisy labels from the training data of machine learning systems. Our system identifies potentially wrong labels using a deep probabilistic model, which is able to infer the latent class of a high-dimensional data instance by exploiting data distributions in the underlying latent feature space. To minimize crowd efforts, it employs a data sampler which selects data instances that would benefit the most from being inspected by the crowd. The manually verified labels are then propagated to similar data instances in the original training data by exploiting the underlying data structure, thus scaling out the contribution from the crowd. Scalpel-CD is designed with a set of algorithmic solutions to automatically search for the optimal configurations for different types of training data, in terms of the underlying data structure, noise ratio, and noise types (random vs. structural). In a real deployment on multiple machine learning tasks, we demonstrate that Scalpel-CD is able to improve label quality by 12.9% with only 2.8% instances inspected by the crowd.

M3 - Chapter

SP - 2158

EP - 2168

BT - WWW '19: The World Wide Web Conference

PB - Association for Computing Machinery (ACM)

ER -

Scalpel-CD: Leveraging Crowdsourcing and Deep Probabilistic Modeling for Debugging Noisy Training Data

Abstract

Fingerprint

Cite this