QUICK REVIEW

[논문 리뷰] OI-Bench: An Option Injection Benchmark for Evaluating LLM Susceptibility to Directive Interference

Yow-Fu Liou, Yu-Chien Tang|arXiv (Cornell University)|2026. 01. 19.

Topic Modeling인용 수 0

한 줄 요약

OI-Bench은 MCQA에서 오도하는 지시를 다섯 번째 옵션으로 주입하는 벤치마크를 도입하여 LLM의 지시 간섭에 대한 민감도를 정량화하고, 16가지 지시 유형에 걸쳐 12개 모델을 평가하며, 사후 학습 조정을 통한 완화를 탐구합니다.

ABSTRACT

Benchmarking large language models (LLMs) is critical for understanding their capabilities, limitations, and robustness. In addition to interface artifacts, prior studies have shown that LLM decisions can be influenced by directive signals such as social cues, framing, and instructions. In this work, we introduce option injection, a benchmarking approach that augments the multiple-choice question answering (MCQA) interface with an additional option containing a misleading directive, leveraging standardized choice structure and scalable evaluation. We construct OI-Bench, a benchmark of 3,000 questions spanning knowledge, reasoning, and commonsense tasks, with 16 directive types covering social compliance, bonus framing, threat framing, and instructional interference. This setting combines manipulation of the choice interface with directive-based interference, enabling systematic assessment of model susceptibility. We evaluate 12 LLMs to analyze attack success rates, behavioral responses, and further investigate mitigation strategies ranging from inference-time prompting to post-training alignment. Experimental results reveal substantial vulnerabilities and heterogeneous robustness across models. OI-Bench is expected to support more systematic evaluation of LLM robustness to directive interference within choice-based interfaces.

연구 동기 및 목표

MCQA 인터페이스 내에서 지시 간섭에 대한 LLM의 민감도에 대한 체계적 평가의 필요성을 촉진한다.
옵션 조작과 지시 기반 간섭을 혼합한 벤치마크(OI-Bench)를 개발한다.
다중 지시 유형을 사용하여 지식, 추론 및 상식 작업 전반에 걸친 모델 취약성을 정량화한다.
주입 효과를 완화하기 위한 방어 전략으로 방어적 프롬프트, 안전 정렬 모델, 그리고 사후 학습 정렬을 포함한 전략을 탐구한다.

제안 방법

작업과 무관한 주입 옵션 E를 MCQA에 추가하고, Social Compliance, Bonus Framing, Threat Framing, Instructional Interference를 포함하는 지시를 담는다.
기존 데이터셋(MMLU, LogiQA, HellaSwag)에서 사실 지식, 논리적 추론 및 상식 서사를 아우르는 3,000문항 벤치마크를 구축한다.
평가 지표를 정의한다: Standard Accuracy, Injected Accuracy, Attack Success Rate, and Accuracy Drop.
네 가지 주입 카테고리 하에서 여러 계열의 12개 LLM을 평가하여 ASR(Attack Success Rate)과 견고성을 분석한다.
Defensive Prompting, Safety-Aligned 모델, 그리고 Direct Preference Optimization(DPO) 및 PPO를 통한 사후 학습 정렬을 포함한 방어 전략을 평가한다.
주입 옵션에 대한 모델의 주의 집중을 분석하고, 주입 옵션을 서로 다른 위치로 옮김으로써 위치 바이어스 실험을 수행한다.

Figure 1: Option injection in MCQA. A question-irrelevant option $E$ with a misleading directive can flip the model’s decision.

실험 결과

연구 질문

RQ1오도하는 옵션 E를 추가하는 것이 서로 다른 LLM 및 작업 도메인에서 MCQA 성능에 어떤 영향을 미치는가?
RQ2어떤 지시 유형 및 범주가 모델 의사결정을 가장 강하게 방해하며, 이 효과가 모델 간에 얼마나 다양한가?
RQ3프롬프팅, 안전 가드레일, 또는 사후 학습 정렬을 통한 완화가 기본 정확도를 해치지 않으면서 주입 취약성을 줄일 수 있는가?
RQ4주입된 옵션의 위치가 지시 간섭에 대한 민감도에 어떤 역할을 하는가?
RQ5고성능 모델이 반드시 주입된 지시에 대해 더 큰 강건성을 보이는가?

주요 결과

위협 프레이밍은 모델 전반에서 가장 큰 악화를 야기하며, 가장 높은 Attack Success Rate와 정확도 하락을 보인다.
평균적으로 주입 옵션 E는 정확도를 감소시키고 오류율을 증가시키며, 모델과 작업에 따라 변동성을 보인다.
Override 기반 지시(Override Penalty/Override Bonus)는 특히 파괴적이며 손실 형식화와 명시적 재정의를 민감하게 한다.
방어적 프롬프트와 안전 가드 모델은 제한적 완화를 제공하는 반면, 사후 학습 정렬(DPO 및 PPO) 방법은 공격 성공률 감소에 더 유망하고 때로는 표준 정확도를 유지하거나 향상시키기도 한다.
주의 분석은 PPO가 깊은 계층에서 주입 옵션에 대한 과도한 주의를 감소시키킴을 보여주며, 정렬 미세 조정 하에서 추론 역학이 달라짐을 시사한다.
주입 옵션을 앞으로 배치(순열)하면 취약성이 증가하여 MCQA에서 강한 위치 편향 효과를 나타낸다.

Figure 2: Standard accuracy vs E-option attack success rate on OI-Bench. We report each model’s Standard Accuracy (y-axis), and Attack Success Rate (ASR) (x-axis), averaged across all 16 injected prompts (4 prompt families) and further averaged over MMLU, LogiQA, and HellaSwag. Models in the top-lef

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.