TY - GEN
T1 - Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison
AU - Sun, Zhu
AU - Yu, DI
AU - Fang, Hui
AU - Yang, Jie
AU - Qu, Xinghua
AU - Zhang, Jie
AU - Geng, Cong
PY - 2020
Y1 - 2020
N2 - With tremendous amount of recommendation algorithms proposed every year, one critical issue has attracted a considerable amount of attention: there are no effective benchmarks for evaluation, which leads to two major concerns, i.e., unreproducible evaluation and unfair comparison. This paper aims to conduct rigorous (i.e., reproducible and fair) evaluation for implicit-feedback based top-N recommendation algorithms. We first systematically review 85 recommendation papers published at eight top-tier conferences (e.g., RecSys, SIGIR) to summarize important evaluation factors, e.g., data splitting and parameter tuning strategies, etc. Through a holistic empirical study, the impacts of different factors on recommendation performance are then analyzed in-depth. Following that, we create benchmarks with standardized procedures and provide the performance of seven well-tuned state-of-the-arts across six metrics on six widely-used datasets as a reference for later study. Additionally, we release a user-friendly Python toolkit, which differs from existing ones in addressing the broad scope of rigorous evaluation for recommendation. Overall, our work sheds light on the issues in recommendation evaluation and lays the foundation for further investigation. Our code and datasets are available at GitHub (https://github.com/AmazingDD/daisyRec).
AB - With tremendous amount of recommendation algorithms proposed every year, one critical issue has attracted a considerable amount of attention: there are no effective benchmarks for evaluation, which leads to two major concerns, i.e., unreproducible evaluation and unfair comparison. This paper aims to conduct rigorous (i.e., reproducible and fair) evaluation for implicit-feedback based top-N recommendation algorithms. We first systematically review 85 recommendation papers published at eight top-tier conferences (e.g., RecSys, SIGIR) to summarize important evaluation factors, e.g., data splitting and parameter tuning strategies, etc. Through a holistic empirical study, the impacts of different factors on recommendation performance are then analyzed in-depth. Following that, we create benchmarks with standardized procedures and provide the performance of seven well-tuned state-of-the-arts across six metrics on six widely-used datasets as a reference for later study. Additionally, we release a user-friendly Python toolkit, which differs from existing ones in addressing the broad scope of rigorous evaluation for recommendation. Overall, our work sheds light on the issues in recommendation evaluation and lays the foundation for further investigation. Our code and datasets are available at GitHub (https://github.com/AmazingDD/daisyRec).
KW - Benchmarks
KW - Recommender Systems
KW - Reproducible Evaluation
UR - http://www.scopus.com/inward/record.url?scp=85092688569&partnerID=8YFLogxK
U2 - 10.1145/3383313.3412489
DO - 10.1145/3383313.3412489
M3 - Conference contribution
AN - SCOPUS:85092688569
T3 - RecSys 2020 - 14th ACM Conference on Recommender Systems
SP - 23
EP - 32
BT - RecSys 2020 - 14th ACM Conference on Recommender Systems
PB - Association for Computing Machinery (ACM)
T2 - 14th ACM Conference on Recommender Systems, RecSys 2020
Y2 - 22 September 2020 through 26 September 2020
ER -