QUICK REVIEW

[論文レビュー] Impact of Pretraining Term Frequencies on Few-Shot Reasoning

Yasaman Razeghi, Robert L. Logan|arXiv (Cornell University)|Feb 15, 2022

Topic Modeling被引用数 52

ひとこと要約

この論文は、GPTベースのモデルによる少数ショットの数値推論が、事前学習データにどれくらい頻繁に現れるテスト用語かに強く依存することを示しており、事前学習統計を超えた一般化の限界を強調している。

ABSTRACT

Pretrained Language Models (LMs) have demonstrated ability to perform numerical reasoning by extrapolating from a few examples in few-shot settings. However, the extent to which this extrapolation relies on robust reasoning is unclear. In this paper, we investigate how well these models reason with terms that are less frequent in the pretraining data. In particular, we examine the correlations between the model performance on test instances and the frequency of terms from those instances in the pretraining data. We measure the strength of this correlation for a number of GPT-based language models (pretrained on the Pile dataset) on various numerical deduction tasks (e.g., arithmetic and unit conversion). Our results consistently demonstrate that models are more accurate on instances whose terms are more prevalent, in some cases above $70\%$ (absolute) more accurate on the top 10\% frequent terms in comparison to the bottom 10\%. Overall, although LMs exhibit strong performance at few-shot numerical reasoning tasks, our results raise the question of how much models actually generalize beyond pretraining data, and we encourage researchers to take the pretraining data into account when interpreting evaluation results.

研究の動機と目的

言語モデルの少数ショットの数値推論が、事前学習データで頻繁に出現する語に限らず一般化するかを評価する。
算術および単位変換タスクにおいて、精度が事前学習データの用語頻度とどの程度相関するかを定量化する。
推論タスクにおける事前学習用語頻度への依存を測る指標（性能ギャップ）を開発する。
モデルサイズとショット数が頻度駆動の性能ギャップにどのように影響するかを分析する。

提案手法

事前学習コーパスにおけるテスト用語の頻度を定義する（ユニグラム、バイグラム、および term+y の共起。）
自然言語プロンプトとして表現された数値推論データセットを構築する（算術、演算推定、時間-単位変換）。
さまざまな用語頻度定義に対して、上位10%と下位10%の頻度グループ間の性能ギャップ Delta を計算する。
複数の EleutherAI GPT モデル（GPT-J-6B、GPT-Neo-1.3B、GPT-Neo-2.7B）を少数ショットプロンプト下で評価する（k=0,2,4,8,16）。
タスクと語頻度定義を横断して結果を分析し、推論の堅牢性と記憶化の程度を評価する。

実験結果

リサーチクエスチョン

RQ1事前学習の用語頻度が少数ショットの数値推論の精度とどのように相関するか。
RQ2精度ギャップは異なるモデルサイズとショット数を跨いでも持続するか。
RQ3ユニグラム型の用語頻度（単一の数値）は推論性能にどの程度影響するか。
RQ4少数ショットプロンプトにおいて、事前学習データに偏ったパターンマッチングと真の推論とを区別できるか。

主な発見

モデルの精度は事前学習データ中のインスタンス用語の頻度と強く相関し、高頻度語と低頻度語の間で大きなギャップが生じる（トップ10%とボトム10%で時に70%を超える）。
算術タスクは大きな性能ギャップを示し、特に乗算で顕著であり、頑健な推論よりも事前学習統計への依存を示唆する。
演算推定と時間-単位変換タスクも、ショットが増えて全体の精度が向上しても、頻度に基づく顕著なギャップを示す。
小型モデルは同じ傾向を示すがギャップが小さく、サイズが事前学習データへの依存を増幅することを示唆する。
結果は、多くの少数ショットの推論評価が真の一般化可能な推論ではなく、記憶化やパターンマッチングを反映している可能性があることを示唆する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。