Skip to main content
QUICK REVIEW

[论文解读] Thunder-KoNUBench: A Corpus-Aligned Benchmark for Korean Negation Understanding

Sungmok Jung, Yeonkyoung So|arXiv (Cornell University)|Jan 8, 2026
Topic Modeling被引用 0
一句话总结

本文提出 Thunder-KoNUBench,一个基于语料分布的韩语句子级否定基准,且在该基准上对 47 个 LLM 进行微调可提升否定理解,且填空式监督优于符号式监督。

ABSTRACT

Although negation is known to challenge large language models (LLMs), benchmarks for evaluating negation understanding, especially in Korean, are scarce. We conduct a corpus-based analysis of Korean negation and show that LLM performance degrades under negation. We then introduce Thunder-KoNUBench, a sentence-level benchmark that reflects the empirical distribution of Korean negation phenomena. Evaluating 47 LLMs, we analyze the effects of model size and instruction tuning, and show that fine-tuning on Thunder-KoNUBench improves negation understanding and broader contextual comprehension in Korean.

研究动机与目标

  • Motivate and quantify how Korean negation affects LLM performance and establish a benchmark reflecting Korean negation distributions.
  • Characterize Korean negation types and sentence structures to inform benchmark design.
  • Evaluate a wide range of LLMs on negation understanding and analyze effects of model size and instruction tuning.
  • Investigate supervised fine-tuning strategies to improve Korean negation understanding and contextual comprehension.

提出的方法

  • Perform corpus-based analysis of Korean negation to characterize distribution of negation types and sentence structures.
  • Define standard and local negation in Korean and categorize negation phenomena (standard negation, local negation, contradiction, paraphrase).
  • Construct Thunder-KoNUBench as a 4,784-item multiple-choice dataset reflecting empirical Korean negation distributions and categories.
  • Evaluate 47 LLMs in cloze and symbol MCQA settings, zero-shot and few-shot, using LM Evaluation Harness.
  • Apply supervised fine-tuning with Low-Rank Adaptation (LoRA) on Thunder-KoNUBench training data to study SFT effects.
  • Compare cloze vs. symbol formats to assess supervision signal richness for learning negation.

实验结果

研究问题

  • RQ1How is negation distributed in Korean corpora, and how does it manifest in sentence structure across main and dependent clauses?
  • RQ2Do LLMs exhibit performance degradation when processing negation in Korean, and how do model size and tuning influence this?
  • RQ3Can Thunder-KoNUBench effectively measure Korean negation understanding and guide improvements via supervised fine-tuning?
  • RQ4Is cloze-style generation-based supervision more effective than symbol-based choice supervision for learning Korean negation?

主要发现

  • LLMs, including Korean and non-Korean models, show performance degradation when required to reason with negation in Korean.
  • Larger models generally perform better on Thunder-KoNUBench, but some non-monotonic behavior appears around 8–12B params.
  • Instruction tuning can improve overall performance in symbol format but may degrade cloze-based performance for Korean, indicating format bias.
  • Supervised fine-tuning on Thunder-KoNUBench improves negation understanding and broader contextual comprehension in Korean.
  • Cloze-style fine-tuning yields larger gains than symbol-style fine-tuning for negation tasks, suggesting generation-based supervision is more effective.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。