How do Metric Score Distributions affect the Type i Error Rate of Statistical Significance Tests in Information Retrieval?

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

1 Citation (Scopus)
1 Downloads (Pure)

Abstract

Statistical significance tests are the main tool that IR practitioners use to determine the reliability of their experimental evaluation results. The question of which test behaves best with IR evaluation data has been around for decades, and has seen all kinds of results and recommendations. Definitive answer to this question has recently been attempted via stochastic simulation of IR evaluation data, allowing researchers to compute actual Type I error rates because they can control the null hypothesis. One such research line simulates metric scores for a fixed set of systems on random topics, and concluded that the t-test behaves the best. Another such line simulates retrieval runs by random systems on a fixed set of topics, and concluded that the Wilcoxon test behaves the best. Interestingly, two recent surveys of the IR literature have shown that the community has a clear preference precisely for these two tests, so further investigation is critical to understand why the above simulation studies reach opposite conclusions. It has been recently postulated that a reason for the disagreement is the distributions of metric scores used by one of these simulation methods. In this paper we investigate this issue and extend the argument to another key aspect of the simulation, namely the dependence between systems. Following a principled approach, we analyze the robustness of statistical tests to different factors, thus identifying under what conditions they behave well or not with respect to the Type I error rate. Our results suggest that differences between the Wilcoxon and t-test may be explained by the skewness of score differences.

Original languageEnglish
Title of host publicationICTIR 2021
Subtitle of host publicationProceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval
Place of PublicationNew York
PublisherAssociation for Computing Machinery (ACM)
Pages245-250
Number of pages6
ISBN (Print)978-1-4503-8611-1
DOIs
Publication statusPublished - 2021
Event11th ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR 2021 - Virtual, Online, Canada
Duration: 11 Jul 202111 Jul 2021

Conference

Conference11th ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR 2021
CountryCanada
CityVirtual, Online
Period11/07/2111/07/21

Keywords

  • simulation
  • skewness
  • statistical significance
  • type I error

Fingerprint

Dive into the research topics of 'How do Metric Score Distributions affect the Type i Error Rate of Statistical Significance Tests in Information Retrieval?'. Together they form a unique fingerprint.

Cite this