QUICK REVIEW

[论文解读] Thunder-KoNUBench: A Corpus-Aligned Benchmark for Korean Negation Understanding

Sungmok Jung, Yeonkyoung So|arXiv (Cornell University)|Jan 8, 2026

Topic Modeling被引用 0

一句话总结

本文提出 Thunder-KoNUBench，一个基于语料分布的韩语句子级否定基准，且在该基准上对 47 个 LLM 进行微调可提升否定理解，且填空式监督优于符号式监督。

ABSTRACT

Although negation is known to challenge large language models (LLMs), benchmarks for evaluating negation understanding, especially in Korean, are scarce. We conduct a corpus-based analysis of Korean negation and show that LLM performance degrades under negation. We then introduce Thunder-KoNUBench, a sentence-level benchmark that reflects the empirical distribution of Korean negation phenomena. Evaluating 47 LLMs, we analyze the effects of model size and instruction tuning, and show that fine-tuning on Thunder-KoNUBench improves negation understanding and broader contextual comprehension in Korean.

研究动机与目标

Motivate and quantify how Korean negation affects LLM performance and establish a benchmark reflecting Korean negation distributions.
Characterize Korean negation types and sentence structures to inform benchmark design.
Evaluate a wide range of LLMs on negation understanding and analyze effects of model size and instruction tuning.
Investigate supervised fine-tuning strategies to improve Korean negation understanding and contextual comprehension.

提出的方法

Perform corpus-based analysis of Korean negation to characterize distribution of negation types and sentence structures.
Define standard and local negation in Korean and categorize negation phenomena (standard negation, local negation, contradiction, paraphrase).
Construct Thunder-KoNUBench as a 4,784-item multiple-choice dataset reflecting empirical Korean negation distributions and categories.
Evaluate 47 LLMs in cloze and symbol MCQA settings, zero-shot and few-shot, using LM Evaluation Harness.
Apply supervised fine-tuning with Low-Rank Adaptation (LoRA) on Thunder-KoNUBench training data to study SFT effects.
Compare cloze vs. symbol formats to assess supervision signal richness for learning negation.

实验结果

研究问题

RQ1How is negation distributed in Korean corpora, and how does it manifest in sentence structure across main and dependent clauses?
RQ2Do LLMs exhibit performance degradation when processing negation in Korean, and how do model size and tuning influence this?
RQ3Can Thunder-KoNUBench effectively measure Korean negation understanding and guide improvements via supervised fine-tuning?
RQ4Is cloze-style generation-based supervision more effective than symbol-based choice supervision for learning Korean negation?

主要发现

LLMs, including Korean and non-Korean models, show performance degradation when required to reason with negation in Korean.
Larger models generally perform better on Thunder-KoNUBench, but some non-monotonic behavior appears around 8–12B params.
Instruction tuning can improve overall performance in symbol format but may degrade cloze-based performance for Korean, indicating format bias.
Supervised fine-tuning on Thunder-KoNUBench improves negation understanding and broader contextual comprehension in Korean.
Cloze-style fine-tuning yields larger gains than symbol-style fine-tuning for negation tasks, suggesting generation-based supervision is more effective.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。