QUICK REVIEW

[논문 리뷰] ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models

Ning Bian, Xianpei Han|arXiv (Cornell University)|2023. 03. 29.

Topic Modeling인용 수 47

한 줄 요약

본 논문은 ChatGPT와 다른 LLM들을 11개의 일반상식 QA 데이터셋에 대해 평가하여 질문에 대답하는 능력, 필요한 지식을 이해하고 정확히 회상하며 이를 추론에 활용하는 능력을 확인하고, ChatGPT가 지식은 많지만 경험이 부족한 해결자이며 지식을 선택적으로 활용하는 능력이 제한적임을 발견했다.

ABSTRACT

Large language models (LLMs) have made significant progress in NLP. However, their ability to memorize, represent, and leverage commonsense knowledge has been a well-known pain point. In this paper, we specifically focus on ChatGPT, a widely used and easily accessible LLM, and ask the following questions: (1) Can ChatGPT effectively answer commonsense questions? (2) Is ChatGPT aware of the underlying commonsense knowledge for answering a specific question? (3) Is ChatGPT knowledgeable in commonsense? (4) Can ChatGPT effectively leverage commonsense for answering questions? We conduct a series of experiments on 11 datasets to evaluate ChatGPT's commonsense abilities, including answering commonsense questions, identifying necessary knowledge, generating knowledge descriptions, and using knowledge descriptions to answer questions again. Experimental results show that: (1) ChatGPT can achieve good QA accuracies in commonsense tasks, while still struggling with certain domains of datasets. (2) ChatGPT is knowledgeable, and can accurately generate most of the commonsense knowledge using knowledge prompts. (3) Despite its knowledge, ChatGPT is an inexperienced commonsense problem solver, which cannot precisely identify the needed commonsense for answering a specific question. These findings raise the need to explore improved mechanisms for effectively incorporating commonsense into LLMs like ChatGPT, such as better instruction following and commonsense guidance.

연구 동기 및 목표

GPT가 다양한 도메인에서 일반상식 질문에 정확하게 답할 수 있는지 평가한다.
GPT가 질문에 답하는 데 필요한 지식을 알고 있고 이를 열거할 수 있는지 확인한다.
GPT가 질문에 필요한 일반상식 지식을 회상하고 서술할 수 있는지 평가한다.
GPT가 맥락에서 생성된 지식을 활용하여 추론을 개선할 수 있는지 조사한다.

제안 방법

일반, 물리, 사회, 과학, 사건, 수치, 전형적(prototypical), 시간적 영역에 걸친 11개의 일반상식 QA 데이터셋을 사용한다.
GPT-3(davinci), GPT-3.5(text-davinci-003), ChatGPT를 비교하며 GPT-3에는 4-shot 프롬프트를, GPT-3.5/ChatGPT에는 제로샷 프롬프트를 사용한다.
각 데이터셋에서 QA 정확도를 평가한다.
각 질문에 답하기 위해 필요한 지식을 설명하도록 모델에 요청하고 그 설명의 정밀도/재현율을 평가한다.
생성된 지식을 맥락으로 사용하여 ChatGPT에 질문에 다시 답하도록 하여 지식 활용력을 테스트한다.
지식 정확도와 답변 정확도 간의 상관관계를 분석한다.

실험 결과

연구 질문

RQ1GPT가 다양한 도메인에 걸친 일반상식 질문에 효과적으로 대답할 수 있는가?
RQ2GPT가 일반상식에 해박하고 관련 지식 프롬프트를 생성할 수 있는가?
RQ3GPT가 특정 질문에 답하는 데 필요한 기본 지식을 알고 있는가?
RQ4GPT가 맥락에서 일반상식 지식을 활용하여 답변을 개선할 수 있는가?

주요 결과

Dataset	Domain	GPT-3	GPT-3.5	ChatGPT
CommonsenseQA	General	38	81	74
OpenBookQA	General	22	65	73
WSC	General	46	78	78
PIQA	Physical	48	77	78
Social IQA	Social	36	71	62
ARC	Science	27	88	94
QASC	Science	25	75	74
HellaSWAG	Event	19	61	67
NumerSense	Numerical	45	63	79
ProtoQA	Prototypical	67.3	84.6	94.2
MC-TACO	Temporal	20	53	52

GPT는 일반상식 과제에서 양질의 QA 정확도를 달성하지만 특히 사회적, 사건적, 시간적 영역의 특정 지식 유형에서 어려움을 겪는다.
ChatGPT는 해박하며 프롬프트를 사용하여 대부분의 일반상식 지식을 정확하게 생성할 수 있다.
ChatGPT는 일반상식 문제 해결에 미숙한 편이며 주어진 질문에 필요한 지식을 정확히 식별하지 못한다.
GPT는 맥락에서 생성된 지식을 활용하여 답변을 개선하는 능력이 제한적이며 생성된 지식 설명을 사용할 때 효과가 엇갈리거나 유의미한 이득이 없다.
생성된 필요한 지식의 품질(지식 F1)과 전체 답변 정확도 간에 강한 상관관계가 있다(피어슨 0.77).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.