Skip to main content
QUICK REVIEW

[论文解读] On the Difficulty of Evaluating Baselines: A Study on Recommender Systems

Steffen Rendle, Li Zhang|arXiv (Cornell University)|May 4, 2019
Recommender Systems and Techniques被引用 92
一句话总结

本论文表明,推荐系统中的基线很难正确运行;对简单基线的仔细调参有时可超越新方法,主张标准化基准和社区驱动的基线调参。

ABSTRACT

Numerical evaluations with comparisons to baselines play a central role when judging research in recommender systems. In this paper, we show that running baselines properly is difficult. We demonstrate this issue on two extensively studied datasets. First, we show that results for baselines that have been used in numerous publications over the past five years for the Movielens 10M benchmark are suboptimal. With a careful setup of a vanilla matrix factorization baseline, we are not only able to improve upon the reported results for this baseline but even outperform the reported results of any newly proposed method. Secondly, we recap the tremendous effort that was required by the community to obtain high quality results for simple methods on the Netflix Prize. Our results indicate that empirical findings in research papers are questionable unless they were obtained on standardized benchmarks where baselines have been tuned extensively by the research community.

研究动机与目标

  • Show that properly tuning baselines on standard benchmarks yields strong results in recommender systems.
  • Assess how well-known baselines on Movielens 10M compare to newly proposed methods under careful setup.
  • Compare Movielens 10M findings with Netflix Prize experiences to discuss experimental reliability.

提出的方法

  • Re-run and tune standard baselines on Movielens 10M with a vanilla matrix factorization setup.
  • Use a factorization machine framework (libFM) with five features (user, item, time, implicit user info, implicit item info).
  • Explore Bayesian matrix factorization with Gibbs sampling and SGD-based matrix factorization under varying embedding dimensions and sampling iterations.
  • Incorporate time dynamics and implicit feedback models (e.g., timeSVD++, SVD++ variants) to reproduce strong baselines.
  • Present a consolidated table of RMSEs for baselines and newer methods to compare calibration and performance.

实验结果

研究问题

  • RQ1Can well-tuned vanilla baselines outperform recently proposed recommender methods on a standard benchmark?
  • RQ2How does the difficulty of running baselines affect the reliability of empirical results in recommender systems?
  • RQ3What lessons from Netflix Prize experiments transfer to Movielens 10M regarding baseline calibration and evaluation practices?
  • RQ4What experimental practices are necessary to obtain reliable, comparable baseline results across studies?

主要发现

  • Carefully tuned vanilla matrix factorization baselines can outperform many recently proposed methods on Movielens 10M.
  • Bayesian MF (BPMF) and SGD-based MF methods can achieve substantially better RMSE when properly configured, sometimes beating newer models.
  • Time-aware and implicit-feedback enhancements (e.g., timeSVD++, timeSVD++ flipped) provide notable RMSE gains beyond standard MF baselines.
  • Netflix Prize experience showed that well-calibrated baseline evaluations require extensive retraining and ensemble approaches, a practice not consistently applied in ML10M evaluations.
  • Standard statistical significance and reproducibility do not guarantee reliable conclusions if baselines are not properly tuned; standardized benchmarks and community tuning are essential.
  • The study questions the reliability of empirical findings from one-off evaluations on non-standardized benchmarks and emphasizes community-driven baseline improvements.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。