Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison

Zhu Sun; DI Yu; Hui Fang; Jie Yang; Xinghua Qu; Jie Zhang; Cong Geng

doi:10.1145/3383313.3412489

Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison

Zhu Sun, DI Yu, Hui Fang, Jie Yang, Xinghua Qu, Jie Zhang, Cong Geng

Web Information Systems

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

88 Citations (Scopus)

Abstract

With tremendous amount of recommendation algorithms proposed every year, one critical issue has attracted a considerable amount of attention: there are no effective benchmarks for evaluation, which leads to two major concerns, i.e., unreproducible evaluation and unfair comparison. This paper aims to conduct rigorous (i.e., reproducible and fair) evaluation for implicit-feedback based top-N recommendation algorithms. We first systematically review 85 recommendation papers published at eight top-tier conferences (e.g., RecSys, SIGIR) to summarize important evaluation factors, e.g., data splitting and parameter tuning strategies, etc. Through a holistic empirical study, the impacts of different factors on recommendation performance are then analyzed in-depth. Following that, we create benchmarks with standardized procedures and provide the performance of seven well-tuned state-of-the-arts across six metrics on six widely-used datasets as a reference for later study. Additionally, we release a user-friendly Python toolkit, which differs from existing ones in addressing the broad scope of rigorous evaluation for recommendation. Overall, our work sheds light on the issues in recommendation evaluation and lays the foundation for further investigation. Our code and datasets are available at GitHub (https://github.com/AmazingDD/daisyRec).

Original language	English
Title of host publication	RecSys 2020 - 14th ACM Conference on Recommender Systems
Publisher	Association for Computing Machinery (ACM)
Pages	23-32
Number of pages	10
ISBN (Electronic)	9781450375832
DOIs	https://doi.org/10.1145/3383313.3412489
Publication status	Published - 2020
Event	14th ACM Conference on Recommender Systems, RecSys 2020 - Virtual, Online, Brazil Duration: 22 Sept 2020 → 26 Sept 2020

Publication series

Name	RecSys 2020 - 14th ACM Conference on Recommender Systems

Conference

Conference	14th ACM Conference on Recommender Systems, RecSys 2020
Country/Territory	Brazil
City	Virtual, Online
Period	22/09/20 → 26/09/20

Keywords

Benchmarks
Recommender Systems
Reproducible Evaluation

Access to Document

10.1145/3383313.3412489

Cite this

Sun, Z., Yu, DI., Fang, H., Yang, J., Qu, X., Zhang, J., & Geng, C. (2020). Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison. In RecSys 2020 - 14th ACM Conference on Recommender Systems (pp. 23-32). (RecSys 2020 - 14th ACM Conference on Recommender Systems). Association for Computing Machinery (ACM). https://doi.org/10.1145/3383313.3412489

@inproceedings{fa4d29abe7d44b26ae321ea4feba24f2,

title = "Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison",

abstract = "With tremendous amount of recommendation algorithms proposed every year, one critical issue has attracted a considerable amount of attention: there are no effective benchmarks for evaluation, which leads to two major concerns, i.e., unreproducible evaluation and unfair comparison. This paper aims to conduct rigorous (i.e., reproducible and fair) evaluation for implicit-feedback based top-N recommendation algorithms. We first systematically review 85 recommendation papers published at eight top-tier conferences (e.g., RecSys, SIGIR) to summarize important evaluation factors, e.g., data splitting and parameter tuning strategies, etc. Through a holistic empirical study, the impacts of different factors on recommendation performance are then analyzed in-depth. Following that, we create benchmarks with standardized procedures and provide the performance of seven well-tuned state-of-the-arts across six metrics on six widely-used datasets as a reference for later study. Additionally, we release a user-friendly Python toolkit, which differs from existing ones in addressing the broad scope of rigorous evaluation for recommendation. Overall, our work sheds light on the issues in recommendation evaluation and lays the foundation for further investigation. Our code and datasets are available at GitHub (https://github.com/AmazingDD/daisyRec).",

keywords = "Benchmarks, Recommender Systems, Reproducible Evaluation",

author = "Zhu Sun and DI Yu and Hui Fang and Jie Yang and Xinghua Qu and Jie Zhang and Cong Geng",

year = "2020",

doi = "10.1145/3383313.3412489",

language = "English",

series = "RecSys 2020 - 14th ACM Conference on Recommender Systems",

publisher = "Association for Computing Machinery (ACM)",

pages = "23--32",

booktitle = "RecSys 2020 - 14th ACM Conference on Recommender Systems",

address = "United States",

note = "14th ACM Conference on Recommender Systems, RecSys 2020 ; Conference date: 22-09-2020 Through 26-09-2020",

}

Sun, Z, Yu, DI, Fang, H, Yang, J, Qu, X, Zhang, J & Geng, C 2020, Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison. in RecSys 2020 - 14th ACM Conference on Recommender Systems. RecSys 2020 - 14th ACM Conference on Recommender Systems, Association for Computing Machinery (ACM), pp. 23-32, 14th ACM Conference on Recommender Systems, RecSys 2020, Virtual, Online, Brazil, 22/09/20. https://doi.org/10.1145/3383313.3412489

Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison. / Sun, Zhu; Yu, DI; Fang, Hui et al.
RecSys 2020 - 14th ACM Conference on Recommender Systems. Association for Computing Machinery (ACM), 2020. p. 23-32 (RecSys 2020 - 14th ACM Conference on Recommender Systems).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison

AU - Sun, Zhu

AU - Yu, DI

AU - Fang, Hui

AU - Yang, Jie

AU - Qu, Xinghua

AU - Zhang, Jie

AU - Geng, Cong

PY - 2020

Y1 - 2020

N2 - With tremendous amount of recommendation algorithms proposed every year, one critical issue has attracted a considerable amount of attention: there are no effective benchmarks for evaluation, which leads to two major concerns, i.e., unreproducible evaluation and unfair comparison. This paper aims to conduct rigorous (i.e., reproducible and fair) evaluation for implicit-feedback based top-N recommendation algorithms. We first systematically review 85 recommendation papers published at eight top-tier conferences (e.g., RecSys, SIGIR) to summarize important evaluation factors, e.g., data splitting and parameter tuning strategies, etc. Through a holistic empirical study, the impacts of different factors on recommendation performance are then analyzed in-depth. Following that, we create benchmarks with standardized procedures and provide the performance of seven well-tuned state-of-the-arts across six metrics on six widely-used datasets as a reference for later study. Additionally, we release a user-friendly Python toolkit, which differs from existing ones in addressing the broad scope of rigorous evaluation for recommendation. Overall, our work sheds light on the issues in recommendation evaluation and lays the foundation for further investigation. Our code and datasets are available at GitHub (https://github.com/AmazingDD/daisyRec).

AB - With tremendous amount of recommendation algorithms proposed every year, one critical issue has attracted a considerable amount of attention: there are no effective benchmarks for evaluation, which leads to two major concerns, i.e., unreproducible evaluation and unfair comparison. This paper aims to conduct rigorous (i.e., reproducible and fair) evaluation for implicit-feedback based top-N recommendation algorithms. We first systematically review 85 recommendation papers published at eight top-tier conferences (e.g., RecSys, SIGIR) to summarize important evaluation factors, e.g., data splitting and parameter tuning strategies, etc. Through a holistic empirical study, the impacts of different factors on recommendation performance are then analyzed in-depth. Following that, we create benchmarks with standardized procedures and provide the performance of seven well-tuned state-of-the-arts across six metrics on six widely-used datasets as a reference for later study. Additionally, we release a user-friendly Python toolkit, which differs from existing ones in addressing the broad scope of rigorous evaluation for recommendation. Overall, our work sheds light on the issues in recommendation evaluation and lays the foundation for further investigation. Our code and datasets are available at GitHub (https://github.com/AmazingDD/daisyRec).

KW - Benchmarks

KW - Recommender Systems

KW - Reproducible Evaluation

UR - http://www.scopus.com/inward/record.url?scp=85092688569&partnerID=8YFLogxK

U2 - 10.1145/3383313.3412489

DO - 10.1145/3383313.3412489

M3 - Conference contribution

AN - SCOPUS:85092688569

T3 - RecSys 2020 - 14th ACM Conference on Recommender Systems

SP - 23

EP - 32

BT - RecSys 2020 - 14th ACM Conference on Recommender Systems

PB - Association for Computing Machinery (ACM)

T2 - 14th ACM Conference on Recommender Systems, RecSys 2020

Y2 - 22 September 2020 through 26 September 2020

ER -

Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this