QUICK REVIEW

[논문 리뷰] The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs

Laura Ruis, Akbir Khan|arXiv (Cornell University)|2022. 10. 26.

Natural Language Processing Techniques인용 수 21

한 줄 요약

본 논문은 미세조정 전략이 대화 문맥 함축의 해석 능력에 미치는 영향을 평가하고, 예시 수준의 지시 학습(example-level instruction tuning)이 최상의 실용적 이해를 유도함을 보이며, GPT-4가 chain-of-thought 프롬퓨팅을 통해 평균적인 인간 수준의 성능에 도달함을 보여준다.

ABSTRACT

Despite widespread use of LLMs as conversational agents, evaluations of performance fail to capture a crucial aspect of communication: interpreting language in context -- incorporating its pragmatics. Humans interpret language using beliefs and prior knowledge about the world. For example, we intuitively understand the response "I wore gloves" to the question "Did you leave fingerprints?" as meaning "No". To investigate whether LLMs have the ability to make this type of inference, known as an implicature, we design a simple task and evaluate four categories of widely used state-of-the-art models. We find that, despite only evaluating on utterances that require a binary inference (yes or no), models in three of these categories perform close to random. However, LLMs instruction-tuned at the example-level perform significantly better. These results suggest that certain fine-tuning strategies are far better at inducing pragmatic understanding in models. We present our findings as the starting point for further research into evaluating how LLMs interpret language in context and to drive the development of more pragmatic and useful models of human discourse.

연구 동기 및 목표

의사소통의 결정적이고 아직 충분히 평가되지 않은 측면으로써 함축 이해의 중요성을 강조한다.
사람과 다수의 모델 클래스를 포함하는 견고한 평가 프로토콜을 갖춘 함축 해석 태스크를 설계한다.
제로샷 및 few-shot 성능을 평가한다.
LLM에서 어떤 미세조정 전략이 실용적 이해를 가장 잘 촉진하는지 식별한다.

제안 방법

자연스러운 대화 함축의 데이터 세트를 사용하여 이항 함축 해석 태스크를 정의한다.
4개 그룹에 걸쳐 모델을 평가한다: 기본 사전학습 모델, 대화 파인튜닝 모델, 벤치마크 지시 학습 모델, 그리고 예시 수준 지시 학습 모델.
프롬프트 민감도 테스트를 위해 제로샷 및 few-shot(k = 1,5) 프롬프팅과 여섯 개의 템플릿을 사용한다.
문맥 내 프롬프팅과 체인-오브-생각 프롬프팅을 적용하여 규모화와 추론 효과를 평가한다.
모델 성능을 인간 주석(평균 86.2%)과 비교한다.
모델 크기 및 프롬프트 유형에 걸쳐 개선이 지속되는지 평가한다.

실험 결과

연구 질문

RQ1LLM이 대화 함축을 해석할 수 있는가, 그리고 인간과의 비교에서 성능은 어떠한가?
RQ2다양한 미세조정 전략(base, dialogue FT, benchmark IT, example IT)이 실용적 이해에 어떤 영향을 미치는가?
RQ3few-shot 학습과 chain-of-thought 프롬프팅이 함축 해석에 미치는 영향은 무엇인가?
RQ4모델 크기가 각 미세조정 카테고리 내 함축 해석에 미치는 영향은 무엇인가?

주요 결과

예시 수준의 지시 학습 모델은 함축 해석에서 항상 다른 모든 모델 그룹을 능가한다.
GPT-4와 chain-of-thought 프롬프팅은 평균 인간 수준의 성능에 도달한다(~86.5%), 인간 평균 86.2%에 근접한다.
Base 모델 및 비-예시 수준으로 조정된 모델은 대부분 무작위에 근접한 수준으로 남아 있다(0-shot는 약 60%, 많은 경우 50-60% 근처).
규모화의 이점은 예시 수준의 지시 학습 모델에서 가장 두드러지며, 일부 기본 모델은 크기 관련 이득을 보이고 다른 모델은 정체된다.
Chain-of-thought 프롬프팅은 여러 모델의 성능을 향상시키며, 특히 GPT-4에서 이 설정에서 인간 수준의 성능을 달성한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.