QUICK REVIEW

[論文レビュー] Preferences for Idiomatic Language are Acquired Slowly -- and Forgotten Quickly: A Case Study on Swedish

Jenny Kunz|arXiv (Cornell University)|Feb 3, 2026

Natural Language Processing Techniques被引用数 0

ひとこと要約

この論文は、言語モデルにおけるスウェーデン語の慣用性が事前学習中にゆっくり発展し、翻訳データを用いた指示調整によって急速に失われることを示している。

ABSTRACT

In this study, we investigate how language models develop preferences for extit{idiomatic} as compared to extit{linguistically acceptable} Swedish, both during pretraining and when adapting a model from English to Swedish. To do so, we train models on Swedish from scratch and by fine-tuning English-pretrained models, probing their preferences at various checkpoints using minimal pairs that differ in linguistic acceptability or idiomaticity. For linguistic acceptability, we adapt existing benchmarks into a minimal-pair format. To assess idiomaticity, we introduce two novel datasets: one contrasting conventionalized idioms with plausible variants, and another contrasting idiomatic Swedish with Translationese. Our findings suggest that idiomatic competence emerges more slowly than other linguistic abilities, including grammatical and lexical correctness. While longer training yields diminishing returns for most tasks, idiom-related performance continues to improve, particularly in the largest model tested (8B). However, instruction tuning on data machine-translated from English -- the common approach for languages with little or no native instruction data -- causes models to rapidly lose their preference for idiomatic language.

研究の動機と目的

事前学習中および英語からスウェーデン語への適応によって、言語モデルが慣用的なスウェーデン語と言語的に許容されるスウェーデン語をどのように獲得するかを調査する。
慣用性の能力に対する継続的な事前学習とスクラッチからの訓練の影響を検討する。
機械翻訳データを用いた指示調整がスウェーデン語の慣用的嗜好に与える影響を評価する。

提案手法

スウェーデン語データを用いたスクラッチ学習および継続事前学習から135MパラメータのSmolLMモデルを訓練する。
新規スウェーデン語の慣用表現データセットおよび翻訳依存性対比を含む最小対ベンチマークを用いて、慣用表現と言語的可接受性を調べる。
最小対ごとにトークンあたりのパープレックス性を用いて複数のチェックポイントでモデルを評価し、各最小対で好まれる文を決定する。

実験結果

リサーチクエスチョン

RQ1RQ1 言語モデルは一般的な言語的許容性と比較して、慣用的嗜好をどれだけ早く、どれだけうまく獲得するのか。
RQ2RQ2 機械翻訳データを用いた指示調整は、モデルの慣用的嗜好にどのように影響するのか。

主な発見

慣用表現は、語彙的・統語的正確さよりも、モデル全体で獲得が遅い。
英語の事前学習は全体の性能を高め、徐々の慣用的言語習得を促進する。
翻訳データを用いた指示調整は、一般的な言語的可接受性が比較的安定している一方で、慣用的嗜好を急激に低下させる。
より大きいまたはより能力の高いモデル（AI Sweden LLaMA 8B）は、継続事前学習とともに慣用能力の向上をより強く示す。
翻訳依存性のサンプルは学習が不十分であり、翻訳ベースの調整によってさらに低下する可能性が高く、慣用性がこのような調整に脆弱であることを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。