QUICK REVIEW

[논문 리뷰] SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

Ansar Aynetdinov, Alan Akbik|arXiv (Cornell University)|2024. 01. 30.

Natural Language Processing Techniques인용 수 8

한 줄 요약

SemScore는 골타깃과의 시맨틱 유사성을 측정하여 instruction-tuned LLM 출력의 품질을 평가하며, 12개 모델에서 9개의 메트릭 중 인간 판단과의 상관관계가 가장 높게 나타난다.

ABSTRACT

Instruction-tuned Large Language Models (LLMs) have recently showcased remarkable advancements in their ability to generate fitting responses to natural language instructions. However, many current works rely on manual evaluation to judge the quality of generated responses. Since such manual evaluation is time-consuming, it does not easily scale to the evaluation of multiple models and model variants. In this short paper, we propose a straightforward but remarkably effective evaluation metric called SemScore, in which we directly compare model outputs to gold target responses using semantic textual similarity (STS). We conduct a comparative evaluation of the model outputs of 12 prominent instruction-tuned LLMs using 8 widely-used evaluation metrics for text generation. We find that our proposed SemScore metric outperforms all other, in many cases more complex, evaluation metrics in terms of correlation to human evaluation. These findings indicate the utility of our proposed metric for the evaluation of instruction-tuned LLMs.

연구 동기 및 목표

Motivate scalable, automated evaluation of instruction-tuned LLMs to replace time-consuming manual judgments.
Assess the shortcomings of traditional n-gram metrics for evaluating diverse instruction-following tasks.
Propose a simple, effective semantic similarity-based metric (SemScore) and compare it to existing metrics against human rankings.
Provide insights into the robustness of SemScore across models and tasks.

제안 방법

Compute SemScore by embedding model outputs and target responses with a sentence transformer (all-mpnet-base-v2) and taking cosine similarity.
Collect human evaluations for 12 models (GPT-4, GPT-3.5-turbo, text-davinci variants, LLaMA, Alpaca) on 252 instructions.
Evaluate 8 baseline text-generation metrics (BLEU, ROUGE-L, BERTScore, BLEURT, BARTScore, BARTScore para, DiscoScore, G-Eval) plus SemScore.
Correlate automated metric scores with human rankings using Kendall’s tau and Pearson r.
Ablation: compare SemScore using different pooling strategies (CLS vs mean-pooling) and alternative transformers.

실험 결과

연구 질문

RQ1How well does SemScore correlate with human judgments compared to 8 existing metrics?
RQ2Does a simple embedding-based STS approach suffice for evaluating instruction-tuned LLM outputs across diverse tasks?
RQ3What is the impact of the underlying transformer model and pooling strategy on SemScore’s performance?
RQ4How do instruction-tuned models rank relative to non-instruction-tuned baselines in human evaluations?

주요 결과

SemScore achieves the strongest correlation to human judgments among all metrics tested (Kendall τ = 0.879, Pearson r = 0.970).
SemScore outperforms LLM-based evaluators like G-Eval in correlation with human rankings under the reported setup.
Among embedding-based metrics, SemScore slightly outperforms BERTScore for the evaluated dataset.
Ablation shows SemScore with all-mpnet-base-v2 and normal pooling performs best compared to DeBERTa variants.
G-Eval and BERTScore also show high correlations, but SemScore remains the top performer in this study.
The method remains simple, reproducible, and does not require special access to proprietary evaluators.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.