QUICK REVIEW

[논문 리뷰] Stance Detection on Social Media with Fine-Tuned Large Language Models

İlker Gül, Rémi Lebret|arXiv (Cornell University)|2024. 04. 18.

Sentiment Analysis and Opinion Mining인용 수 11

한 줄 요약

이 논문은 ChatGPT, LLaMa-2, 및 Mistral-7B를 여러 공개 데이터셋으로 미세조정하여 제로샷, 파샷, 미세조정 설정에서 입장(stance) 탐지를 평가하고, 특히 ChatGPT-ft 및 LLaMa-2/Mistral 변형과 같은 미세조정된 대형언어모델에서 강한 성능을 보임.

ABSTRACT

Stance detection, a key task in natural language processing, determines an author's viewpoint based on textual analysis. This study evaluates the evolution of stance detection methods, transitioning from early machine learning approaches to the groundbreaking BERT model, and eventually to modern Large Language Models (LLMs) such as ChatGPT, LLaMa-2, and Mistral-7B. While ChatGPT's closed-source nature and associated costs present challenges, the open-source models like LLaMa-2 and Mistral-7B offers an encouraging alternative. Initially, our research focused on fine-tuning ChatGPT, LLaMa-2, and Mistral-7B using several publicly available datasets. Subsequently, to provide a comprehensive comparison, we assess the performance of these models in zero-shot and few-shot learning scenarios. The results underscore the exceptional ability of LLMs in accurately detecting stance, with all tested models surpassing existing benchmarks. Notably, LLaMa-2 and Mistral-7B demonstrate remarkable efficiency and potential for stance detection, despite their smaller sizes compared to ChatGPT. This study emphasizes the potential of LLMs in stance detection and calls for more extensive research in this field.

연구 동기 및 목표

소셜 미디어 맥락에서 전통적 ML에서 BERT 및 LLM까지의 입장 탐지 방법의 진행 상황을 평가한다.
입장 탐지 데이터셋에서 미세조정된 LLM(ChatGPT, LLaMa-2, Mistral-7B)을 평가한다.
다양한 대상과 주제에 걸쳐 제로샷, 파샷, 그리고 완전한 미세조정 성능을 비교한다.

제안 방법

SemEval-2016, P-Stance, 및 Twitter Stance 2020 데이터셋에서 LoRA를 사용해 ChatGPT, LLaMa-2(7B/13B), 및 Mistral-7B를 미세조정한다.
데이터의 10%로 워밍업, 3 에포크, LR=3e-4, 배치크기 128, BF16을 사용한 A100 GPU에서 수행한다.
비교를 위한 제로샷 및 파샷 프롬프트를 인스트럭션-튜닝된 변형으로 평가한다.
프롬프트 전략을 위한 데이터셋별 템플릿 및 부록으로 프롬프트를 뒷받침한다.
주요 지표로 목표 간 F_avg 및 F1-macro를 보고한다.

실험 결과

연구 질문

RQ1미세조정된 LLM이 소셜 미디어 데이터셋에서 입장 탐지에서 전통적 기준선과 비교하면 어떤 차이를 보이는가?
RQ2훈련 크기와 프례프팅 전략(제로샷, 파샷, 미세조정)이 입장 탐지 성능에 어떤 영향을 미치는가?
RQ3어떤 대상(정치 인물 및 주제)이 SemEval-2016, P-Stance, Twitter Stance 2020에서 미세조정으로 가장 큰 개선을 보이는가?

주요 결과

모델	FM	HC	LA	A	CC	DT
BiCond	61.4	59.8	54.5	-	-	59.0
MemNet	57.8	60.3	61.0	-	-	-
AoA	60.0	58.2	62.4	-	-	-
TAN	55.8	65.4	63.7	59.3	53.5	-
ASGCN	58.7	64.3	62.9	-	-	58.7
AT-JSS-Lex	61.5	68.3	68.4	69.2	59.2	-
TPDG	67.3	73.4	74.7	-	-	63.0
TR-Tweet+COT	70.6	78.7	63.8	72.9	54.1	-
COLA	69.1	75.9	71.0	62.3	64.0	71.2
ChatGPT-ft	79.7	83.4	72.6	81.3	86.2	70.4
LLaMa-2-7b-ft	73.3	84.2	71.2	78.9	69.8	72.0
LLaMa-2-13b-ft	76.0	86.5	72.5	76.9	80.4	70.9
Mistral-7b-ft	78.7	85.0	76.0	74.7	71.8	68.6

미세조정된 LLM은 SemEval-2016에서 기준선보다 상당히 더 높은 성능을 보이며, ChatGPT-ft는 FM에서 최대 79.7을 달성하고 LLaMa-2-13b-ft는 HC에서 86.5에 도달한다.
P-Stance에서 ChatGPT-ft가 최고 F_avg를 보여주며 Bernie 81.8, Biden 89.7, Trump 91.9를 달성한다.
Twitter Stance 2020에서 ChatGPT-ft가 F1-macro 85.1( Biden ), 85.6( Trump )를 달성한다.
제로샷 및 파샷 프롬프트는 미세조정 모델로 전환할 때 유의미한 이점을 보이며, 예를 들어 LLaMa-2-7b-ft는 FM에서 51.6(제로샷)에서 73.3(미세조정)으로 향상된다.
훈련 크기 실험은 일부 대상에서 70% 데이터로도 전체 훈련에 거의 비례하는 성능을 보여준다(예: HC에서 LLaMa-2-7b).
오픈소스 LoRA-튜닝 LLM은 비용 효율적이고 정확한 분석을 강조하며 벤치마크에 근접하거나 이를 능가하는 입장 탐지 성능을 제공한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.