[论文解读] What are the best systems? New perspectives on NLP Benchmarking
提出一种基于肯尼-共识的排名方法,用以聚合跨多个任务的NLP基准结果,显示其比简单均值聚合更可靠、鲁棒。
In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in (i) assessing the progress of new methods along different axes and (ii) selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (e.g. GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions. This paper proposes a new procedure to rank systems based on their performance across different tasks. Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task and is theoretically grounded. We conduct extensive numerical experiments (on over 270k scores) to assess the soundness of our approach both on synthetic and real scores (e.g. GLUE, EXTREM, SEVAL, TAC, FLICKR). In particular, we show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure while being both more reliable and robust.
研究动机与目标
- 推动改进NLP基准的聚合,不仅仅在跨任务和跨指标的简单平均之上。
- 引入以社会选择理论为基础的排名聚合框架(肯尼共识)。
- 提供可扩展的近似方法(博德斯计数)以及适用于任务级和实例级信息的实用聚合流程。
- 在大规模NLP基准数据上评估所提方法的鲁棒性和可靠性。
提出的方法
- 为NLP基准定义任务级和实例级聚合设置。
- 采用肯尼共识将任务级排名聚合为最终系统排名。
- 使用博德斯计数作为对NP-hard的肯尼优化的可扩展近似解。
- 提供两种实例级聚合流程:两级(2l)聚合和一级(l)聚合。
- 使用 Kendall 距离和 Kendall τ 相关性比较排名。
- 在合成实验和大规模实证数据中展示对分数操纵和缩放的鲁棒性。
实验结果
研究问题
- RQ1基于肯尼共识的排名在多任务NLP基准中是否比均值聚合得到更可靠的系统排序?
- RQ2排名聚合对跨任务的分数操纵和尺度变化有多鲁棒?
- RQ3添加或移除任务/指标对结果排名有何影响?
- RQ4在大型NLP基准上,任务级和实例级聚合在实践中有何差异?
主要发现
- 通过肯尼共识的排名可能产生与均值聚合不同的Top系统。
- 两级聚合(2l)在所提出的方法中对操纵和任务变化最具鲁棒性。
- 排名聚合对任务增加/删除比均值聚合更具鲁棒性。
- 在跨GLUE、SGLUE、XTREM和NLG数据集的大规模实验中,任务级排名与基于均值的排名存在差异,对Top系统高度一致但排序不同。
- 作者提供代码和数据以促进该方法在多任务与多标准基准中的应用。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。