The representation of speech and its processing in the human brain and deep neural networks

Odette Scharenborg

doi:10.1007/978-3-030-26061-3_1

The representation of speech and its processing in the human brain and deep neural networks

^*Corresponding author for this work

Multimedia Computing

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

40 Downloads (Pure)

Abstract

For most languages in the world and for speech that deviates from the standard pronunciation, not enough (annotated) speech data is available to train an automatic speech recognition (ASR) system. Moreover, human intervention is needed to adapt an ASR system to a new language or type of speech. Human listeners, on the other hand, are able to quickly adapt to nonstandard speech and can learn the sound categories of a new language without having been explicitly taught to do so. In this paper, I will present comparisons between human speech processing and deep neural network (DNN)-based ASR and will argue that the cross-fertilisation of the two research fields can provide valuable information for the development of ASR systems that can flexibly adapt to any type of speech in any language. Specifically, I present results of several experiments carried out on both human listeners and DNN-based ASR systems on the representation of speech and lexically-guided perceptual learning, i.e., the ability to adapt a sound category on the basis of new incoming information resulting in improved processing of subsequent speech. The results showed that DNNs appear to learn structures that humans use to process speech without being explicitly trained to do so, and that, similar to humans, DNN systems learn speaker-adapted phone category boundaries from a few labelled examples. These results are the first steps towards building human-speech processing inspired ASR systems that, similar to human listeners, can adjust flexibly and fast to all kinds of new speech.

Original language	English
Title of host publication	Speech and Computer
Subtitle of host publication	21st International Conference, SPECOM 2019, Proceedings
Editors	Albert Ali Salah, Alexey Karpov, Rodmonga Potapova
Place of Publication	Cham
Publisher	Springer
Pages	1-8
Number of pages	8
ISBN (Electronic)	978-3-030-26061-3
ISBN (Print)	978-3-030-26060-6
DOIs	https://doi.org/10.1007/978-3-030-26061-3_1
Publication status	Published - 2019
Event	SPECOM 2019: The 21st International Conference on Speech and Computer - Istanbul, Turkey Duration: 20 Aug 2019 → 25 Aug 2019 Conference number: 21st

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	11658 LNAI
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	SPECOM 2019
Country/Territory	Turkey
City	Istanbul
Period	20/08/19 → 25/08/19

Keywords

Adaptation
Deep neural networks
Human speech processing
Non-standard speech
Perceptual learning
Speech representations

Access to Document

10.1007/978-3-030-26061-3_1

Scharenborg2019_Chapter_TheRepresentationOfSpeechAndItFinal published version, 285 KB

Cite this

Scharenborg, O. (2019). The representation of speech and its processing in the human brain and deep neural networks. In A. A. Salah, A. Karpov, & R. Potapova (Eds.), Speech and Computer: 21st International Conference, SPECOM 2019, Proceedings (pp. 1-8). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11658 LNAI). Springer. https://doi.org/10.1007/978-3-030-26061-3_1

Scharenborg, Odette. / The representation of speech and its processing in the human brain and deep neural networks. Speech and Computer: 21st International Conference, SPECOM 2019, Proceedings. editor / Albert Ali Salah ; Alexey Karpov ; Rodmonga Potapova. Cham : Springer, 2019. pp. 1-8 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{3b886ba21c3e4ff2b91c7ff32e2bcf3c,

title = "The representation of speech and its processing in the human brain and deep neural networks",

abstract = "For most languages in the world and for speech that deviates from the standard pronunciation, not enough (annotated) speech data is available to train an automatic speech recognition (ASR) system. Moreover, human intervention is needed to adapt an ASR system to a new language or type of speech. Human listeners, on the other hand, are able to quickly adapt to nonstandard speech and can learn the sound categories of a new language without having been explicitly taught to do so. In this paper, I will present comparisons between human speech processing and deep neural network (DNN)-based ASR and will argue that the cross-fertilisation of the two research fields can provide valuable information for the development of ASR systems that can flexibly adapt to any type of speech in any language. Specifically, I present results of several experiments carried out on both human listeners and DNN-based ASR systems on the representation of speech and lexically-guided perceptual learning, i.e., the ability to adapt a sound category on the basis of new incoming information resulting in improved processing of subsequent speech. The results showed that DNNs appear to learn structures that humans use to process speech without being explicitly trained to do so, and that, similar to humans, DNN systems learn speaker-adapted phone category boundaries from a few labelled examples. These results are the first steps towards building human-speech processing inspired ASR systems that, similar to human listeners, can adjust flexibly and fast to all kinds of new speech.",

keywords = "Adaptation, Deep neural networks, Human speech processing, Non-standard speech, Perceptual learning, Speech representations",

author = "Odette Scharenborg",

year = "2019",

doi = "10.1007/978-3-030-26061-3_1",

language = "English",

isbn = "978-3-030-26060-6",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer",

pages = "1--8",

editor = "Salah, {Albert Ali} and Alexey Karpov and Rodmonga Potapova",

booktitle = "Speech and Computer",

note = "SPECOM 2019 : The 21st International Conference on Speech and Computer ; Conference date: 20-08-2019 Through 25-08-2019",

}

Scharenborg, O 2019, The representation of speech and its processing in the human brain and deep neural networks. in AA Salah, A Karpov & R Potapova (eds), Speech and Computer: 21st International Conference, SPECOM 2019, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11658 LNAI, Springer, Cham, pp. 1-8, SPECOM 2019, Istanbul, Turkey, 20/08/19. https://doi.org/10.1007/978-3-030-26061-3_1

The representation of speech and its processing in the human brain and deep neural networks. / Scharenborg, Odette.
Speech and Computer: 21st International Conference, SPECOM 2019, Proceedings. ed. / Albert Ali Salah; Alexey Karpov; Rodmonga Potapova. Cham: Springer, 2019. p. 1-8 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11658 LNAI).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - The representation of speech and its processing in the human brain and deep neural networks

AU - Scharenborg, Odette

N1 - Conference code: 21st

PY - 2019

Y1 - 2019

N2 - For most languages in the world and for speech that deviates from the standard pronunciation, not enough (annotated) speech data is available to train an automatic speech recognition (ASR) system. Moreover, human intervention is needed to adapt an ASR system to a new language or type of speech. Human listeners, on the other hand, are able to quickly adapt to nonstandard speech and can learn the sound categories of a new language without having been explicitly taught to do so. In this paper, I will present comparisons between human speech processing and deep neural network (DNN)-based ASR and will argue that the cross-fertilisation of the two research fields can provide valuable information for the development of ASR systems that can flexibly adapt to any type of speech in any language. Specifically, I present results of several experiments carried out on both human listeners and DNN-based ASR systems on the representation of speech and lexically-guided perceptual learning, i.e., the ability to adapt a sound category on the basis of new incoming information resulting in improved processing of subsequent speech. The results showed that DNNs appear to learn structures that humans use to process speech without being explicitly trained to do so, and that, similar to humans, DNN systems learn speaker-adapted phone category boundaries from a few labelled examples. These results are the first steps towards building human-speech processing inspired ASR systems that, similar to human listeners, can adjust flexibly and fast to all kinds of new speech.

AB - For most languages in the world and for speech that deviates from the standard pronunciation, not enough (annotated) speech data is available to train an automatic speech recognition (ASR) system. Moreover, human intervention is needed to adapt an ASR system to a new language or type of speech. Human listeners, on the other hand, are able to quickly adapt to nonstandard speech and can learn the sound categories of a new language without having been explicitly taught to do so. In this paper, I will present comparisons between human speech processing and deep neural network (DNN)-based ASR and will argue that the cross-fertilisation of the two research fields can provide valuable information for the development of ASR systems that can flexibly adapt to any type of speech in any language. Specifically, I present results of several experiments carried out on both human listeners and DNN-based ASR systems on the representation of speech and lexically-guided perceptual learning, i.e., the ability to adapt a sound category on the basis of new incoming information resulting in improved processing of subsequent speech. The results showed that DNNs appear to learn structures that humans use to process speech without being explicitly trained to do so, and that, similar to humans, DNN systems learn speaker-adapted phone category boundaries from a few labelled examples. These results are the first steps towards building human-speech processing inspired ASR systems that, similar to human listeners, can adjust flexibly and fast to all kinds of new speech.

KW - Adaptation

KW - Deep neural networks

KW - Human speech processing

KW - Non-standard speech

KW - Perceptual learning

KW - Speech representations

UR - http://www.scopus.com/inward/record.url?scp=85071427354&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-26061-3_1

DO - 10.1007/978-3-030-26061-3_1

M3 - Conference contribution

AN - SCOPUS:85071427354

SN - 978-3-030-26060-6

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 1

EP - 8

BT - Speech and Computer

A2 - Salah, Albert Ali

A2 - Karpov, Alexey

A2 - Potapova, Rodmonga

PB - Springer

CY - Cham

T2 - SPECOM 2019

Y2 - 20 August 2019 through 25 August 2019

ER -

Scharenborg O. The representation of speech and its processing in the human brain and deep neural networks. In Salah AA, Karpov A, Potapova R, editors, Speech and Computer: 21st International Conference, SPECOM 2019, Proceedings. Cham: Springer. 2019. p. 1-8. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-26061-3_1

The representation of speech and its processing in the human brain and deep neural networks

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this