TY - GEN
T1 - Unsupervised Domain Adaptation for Question Generation with Domain Data Selection and Self-training
AU - Zhu, Peide
AU - Hauff, Claudia
PY - 2022
Y1 - 2022
N2 - Question generation (QG) approaches based on large neural models require (i) large-scale and (ii) high-quality training data. These two requirements pose difficulties for specific application domains where training data is expensive and difficult to obtain. The trained QG models' effectiveness can degrade significantly when they are applied on a different domain due to domain shift. In this paper, we explore an unsupervised domain adaptation approach to combat the lack of training data and domain shift issue with domain data selection and self-training. We first present a novel answer-aware strategy for domain data selection to select data with the most similarity to a new domain. The selected data are then used as pseudo in-domain data to retrain the QG model. We then present generation confidenceguided self-training with two generation confidence modeling methods: (i) generated questions' perplexity and (ii) the fluency score. We test our approaches on three large public datasets with different domain similarities, using a transformer-based pre-trained QG model. The results show that our proposed approaches outperform the baselines, and show the viability of unsupervised domain adaptation with answer-aware data selection and self-training on the QG task. The code is available at https://github.com/zpeide/transfer_qg.
AB - Question generation (QG) approaches based on large neural models require (i) large-scale and (ii) high-quality training data. These two requirements pose difficulties for specific application domains where training data is expensive and difficult to obtain. The trained QG models' effectiveness can degrade significantly when they are applied on a different domain due to domain shift. In this paper, we explore an unsupervised domain adaptation approach to combat the lack of training data and domain shift issue with domain data selection and self-training. We first present a novel answer-aware strategy for domain data selection to select data with the most similarity to a new domain. The selected data are then used as pseudo in-domain data to retrain the QG model. We then present generation confidenceguided self-training with two generation confidence modeling methods: (i) generated questions' perplexity and (ii) the fluency score. We test our approaches on three large public datasets with different domain similarities, using a transformer-based pre-trained QG model. The results show that our proposed approaches outperform the baselines, and show the viability of unsupervised domain adaptation with answer-aware data selection and self-training on the QG task. The code is available at https://github.com/zpeide/transfer_qg.
UR - http://www.scopus.com/inward/record.url?scp=85137354293&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85137354293
T3 - Findings of the Association for Computational Linguistics: NAACL 2022 - Findings
SP - 2388
EP - 2401
BT - Findings of the Association for Computational Linguistics: NAACL 2022
PB - Association for Computational Linguistics (ACL)
T2 - 2022 Findings of the Association for Computational Linguistics: NAACL 2022
Y2 - 10 July 2022 through 15 July 2022
ER -