QUICK REVIEW

[论文解读] Wider and Deeper LLM Networks are Fairer LLM Evaluators

Xinghua Zhang, Bowen Yu|arXiv (Cornell University)|Aug 3, 2023

Natural Language Processing Techniques被引用 17

一句话总结

本论文提出 WideDeep，一种更宽更深的基于大模型评估器，其中每个神经元具有不同角色，显示两层更宽的网络在对 LLM 输出的评估中更公平且更快，并引入 LLMEval2 基准。

ABSTRACT

Measuring the quality of responses generated by LLMs is a challenging task, particularly when it comes to evaluating whether the response is aligned with human preference. A novel approach involves using the LLM itself to make evaluation and stabilizing the results through multiple independent evaluations, similar to a single-layer narrow LLM network. This network consists of a fixed number of neurons, with each neuron being the same LLM. In this paper, we draw upon the extensive research on deep neural networks to explore whether deeper and wider networks can lead to fairer evaluations. Specifically, inspired by the observation that different neurons in a neural network are responsible for detecting different concepts, we first adaptively generate as many neuron roles as possible for each evaluation sample. Each perspective corresponds to the role of a specific LLM neuron in the first layer. In subsequent layers, we follow the idea that higher layers in deep networks are responsible for more comprehensive features, each layer receives representations from all neurons in the previous layer, integrating the locally learned evaluation information to obtain a more comprehensive evaluation result. Interestingly, this network design resembles the process of academic paper reviewing. To validate the effectiveness of our method, we construct the largest and most diverse English evaluation benchmark LLMEval$^2$ for LLM evaluators, comprising 15 tasks, 8 abilities, and 2,553 samples. Experimental results demonstrate that a wider network (involving many reviewers) with 2 layers (one round of discussion) performs the best, improving kappa correlation coefficient from 0.28 to 0.34. We also leverage WideDeep to aid in the assessment of Chinese LLMs, which has accelerated the evaluation time by 4.6 times, resulting in a 60% cost saving. WideDeep achieves a remarkable 93% agreement level among humans.

研究动机与目标

激励并形式化使用多层、多角色的 LLM 评估器，以提高与人类偏好的一致性。
研究扩大和加深评估器网络对评估公平性及可靠性的重要性。
在英语和中文 LLM 评估基准上验证 WideDeep 的有效性并分析神经元角色。
为 LLM 评估器提供一个多任务、多能力的多样化大型评估基准（LLMEval2）。

提出的方法

定义一个多层宽 LLM 网络，每个神经元代表具有特定评估角色的冻结 LLM。
使用神经元角色提示为每个样本生成自适应神经元角色以创建多样化视角。
层之间不使用可训练权重连接；通过提示模拟权重（pi2）来建立神经元间的连接。
通过 c1（平均）和 c2（神经元投票）策略对层输出进行汇聚以得出最终分数。
使用学术论文评审类比来激励评估过程（盲评、讨论、主席决定）。
构建 LLMEval2，一个大型且多样化的基准（15 个数据集，8 种能力，2,553 个样本）供 LLM 评估器使用。

实验结果

研究问题

RQ1更宽更深的 LLM 评估网络是否能改善与人类偏好的一致性？
RQ2哪些神经元角色在评估不同任务时更有效，以及它们如何影响结果？
RQ3WideDeep 能否加速人工标注并降低实际 LLM 评估的成本？
RQ4WideDeep 在英文和中文 LLM 评估情景下的表现如何？

主要发现

与单层 FairEval 相比，WideDeep（两层、宽网络）在 FairEval、PandaLM 和 LLMEval2 基准上显著提高评估准确性和 kappa 值。
通过两层增加宽度（更多神经元）可获得更好结果；超过两层的更深网络可能因信息同质化而降低性能。
多样化的神经元角色很重要；去除神经元角色引导会降低性能，而具有角色的无限神经元则能获得更高准确性。
在中文 LLM 评估上，WideDeep 超越基线，达到更高的标注准确度（74%），93% 的人工一致性，以及显著的时间和成本节省（速度提升 4.6 倍，成本下降 60%）。
LLMEval2 是一个全面、多样化的基准，解决了先前数据集的局限性，并支持对 LLM 评估器的鲁棒评估。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。