QUICK REVIEW

[論文レビュー] A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilities

Faiz Ghifari Haznitrama, Faeyza Rishad Ardi|arXiv (Cornell University)|Mar 3, 2026

Neurobiology of Language and Bilingualism被引用数 0

ひとこと要約

要約: 論文は NeuroCognition を提案する。これは RPM, SWM, WCST の3つの神経心理学テストを用いたマルチモーダルベンチマークで、標準ベンチマークを超えたLLMの認知能力を評価し、テキストでの強みと画像・複雑タスクでの弱点を示し、既存のベンチマークとの相関を明らかにする。

ABSTRACT

Large language models (LLMs) exhibit a unified "general factor" of capability across 10 benchmarks, a finding confirmed by our factor analysis of 156 models, yet they still struggle with simple, trivial tasks for humans. This is because current benchmarks focus on task completion, failing to probe the foundational cognitive abilities that highlight these behaviors. We address this by introducing the NeuroCognition benchmark, grounded in three adapted neuropsychological tests: Raven's Progressive Matrices (abstract relational reasoning), Spatial Working Memory (maintenance and systematic search), and the Wisconsin Card Sorting Test (cognitive flexibility). Our evaluation reveals that while models perform strongly on text, their performance degrades for images and with increased complexity. Furthermore, we observe that complex reasoning is not universally beneficial, whereas simple, human-like strategies yield partial gains. We also find that NeuroCognition correlates positively with standard general-capability benchmarks, while still measuring distinct cognitive abilities beyond them. Overall, NeuroCognition emphasizes where current LLMs align with human-like intelligence and where they lack core adaptive cognition, showing the potential to serve as a verifiable, scalable source for improving LLMs.

研究の動機と目的

確立された神経心理学的テストをLLMsのスケーラブルなマルチモーダルベンチマークへ再利用する。
現在のLLMsが抽象的推論、作業記憶、認知的柔軟性でどのように性能を示すか特徴づける。
モダリティ（テキスト vs. 画像）とタスクの複雑さが性能に与える影響を評価する。
単純な人間のような戦略（ノート取り、ヒント）がLLMsに役立つかを検証する。
NeuroCognitionと標準的な一般能力ベンチマークとの関係を探る。

提案手法

Raven’s Progressive Matrices (RPM) をテキストと画像フォーマットで抽象的 Relational 推論に適用する。
Spatial Working Memory (SWM) を適用して、難易度とモダリティを変えた維持と体系的探索を測定する。
Wisconsin Card Sorting Test (WCST) を適用して、制御された曖昧さの下で認知的柔軟性とルール切替を評価する。
精度、S_sw m、S_wcst、エラー種別分析（illegal、no-box、repeated）などの性能指標を導入する。
パターンヒント、ノート取りなど人間のような戦略を取り入れて認知オフロード効果を評価する。
10ベンチマークを横断する156のLLMs に対して因子分析を行い、一般能力因子（g）を評価する。

実験結果

リサーチクエスチョン

RQ1NeuroCognition によって測定された一般的なタスク性能を超える明確な認知能力をLLMsは示すか。
RQ2モダリティ（テキスト vs. 画像）とタスクの複雑さが RPM、SWM、WCST の性能にどのように影響するか。
RQ3単純な人間のような戦略は神経心理学的タスクにおけるLLMsの性能を改善するか。
RQ4NeuroCognition は標準的な一般能力ベンチマークとどのような関係があるか。
RQ5多様なLLMベンチマーク across において一元的な一般因子 (g) の存在を示すエビデンスはあるか。

主な発見

LLMs はテキストタスクで高い性能を示す一方、画像タスクやタスクの複雑さが増すにつれて性能が低下する。
明示的な推論強化は必ずしも有益ではなく、場合によっては単純で人間らしい戦略が部分的な利得をもたらす。
NeuroCognition は標準ベンチマークと正の相関を示すが、それらを超えて独自の認知能力を捉えている。
因子分析は10ベンチマーク全体の分散の約75%を説明する単一の潜在的一般能力 (g) を明らかにする一方、NeuroCognition は異なる認知原理を対象としている。
ノート取りやその他の認知オフロード技術は影響が可変的で、RAMベースのタスクよりWCSTでより一貫した利得を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。