QUICK REVIEW
[论文解读] Thunder-KoNUBench: A Corpus-Aligned Benchmark for Korean Negation Understanding
Sungmok Jung, Yeonkyoung So|arXiv (Cornell University)|Jan 8, 2026
Topic Modeling被引用 0
一句话总结
本文提出 Thunder-KoNUBench,一个基于语料分布的韩语句子级否定基准,且在该基准上对 47 个 LLM 进行微调可提升否定理解,且填空式监督优于符号式监督。
ABSTRACT
Although negation is known to challenge large language models (LLMs), benchmarks for evaluating negation understanding, especially in Korean, are scarce. We conduct a corpus-based analysis of Korean negation and show that LLM performance degrades under negation. We then introduce Thunder-KoNUBench, a sentence-level benchmark that reflects the empirical distribution of Korean negation phenomena. Evaluating 47 LLMs, we analyze the effects of model size and instruction tuning, and show that fine-tuning on Thunder-KoNUBench improves negation understanding and broader contextual comprehension in Korean.
研究动机与目标
- Motivate and quantify how Korean negation affects LLM performance and establish a benchmark reflecting Korean negation distributions.
- Characterize Korean negation types and sentence structures to inform benchmark design.
- Evaluate a wide range of LLMs on negation understanding and analyze effects of model size and instruction tuning.
- Investigate supervised fine-tuning strategies to improve Korean negation understanding and contextual comprehension.
提出的方法
- Perform corpus-based analysis of Korean negation to characterize distribution of negation types and sentence structures.
- Define standard and local negation in Korean and categorize negation phenomena (standard negation, local negation, contradiction, paraphrase).
- Construct Thunder-KoNUBench as a 4,784-item multiple-choice dataset reflecting empirical Korean negation distributions and categories.
- Evaluate 47 LLMs in cloze and symbol MCQA settings, zero-shot and few-shot, using LM Evaluation Harness.
- Apply supervised fine-tuning with Low-Rank Adaptation (LoRA) on Thunder-KoNUBench training data to study SFT effects.
- Compare cloze vs. symbol formats to assess supervision signal richness for learning negation.
实验结果
研究问题
- RQ1How is negation distributed in Korean corpora, and how does it manifest in sentence structure across main and dependent clauses?
- RQ2Do LLMs exhibit performance degradation when processing negation in Korean, and how do model size and tuning influence this?
- RQ3Can Thunder-KoNUBench effectively measure Korean negation understanding and guide improvements via supervised fine-tuning?
- RQ4Is cloze-style generation-based supervision more effective than symbol-based choice supervision for learning Korean negation?
主要发现
- LLMs, including Korean and non-Korean models, show performance degradation when required to reason with negation in Korean.
- Larger models generally perform better on Thunder-KoNUBench, but some non-monotonic behavior appears around 8–12B params.
- Instruction tuning can improve overall performance in symbol format but may degrade cloze-based performance for Korean, indicating format bias.
- Supervised fine-tuning on Thunder-KoNUBench improves negation understanding and broader contextual comprehension in Korean.
- Cloze-style fine-tuning yields larger gains than symbol-style fine-tuning for negation tasks, suggesting generation-based supervision is more effective.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。