QUICK REVIEW

[論文レビュー] A Systematic Assessment of OpenAI o1-Preview for Higher Order Thinking in Education

Ehsan Latif, Yifan Zhou|arXiv (Cornell University)|Oct 11, 2024

Open Education and E-Learning被引用数 5

ひとこと要約

本論文は、OpenAI o1-previewを14の高次思考次元にわたって体系的に評価し、確立された手法を用いて人間のベンチマークと比較し、教育的含意について論じている。

ABSTRACT

As artificial intelligence (AI) continues to advance, it demonstrates capabilities comparable to human intelligence, with significant potential to transform education and workforce development. This study evaluates OpenAI o1-preview's ability to perform higher-order cognitive tasks across 14 dimensions, including critical thinking, systems thinking, computational thinking, design thinking, metacognition, data literacy, creative thinking, abstract reasoning, quantitative reasoning, logical reasoning, analogical reasoning, and scientific reasoning. We used validated instruments like the Ennis-Weir Critical Thinking Essay Test and the Biological Systems Thinking Test to compare the o1-preview's performance with human performance systematically. Our findings reveal that o1-preview outperforms humans in most categories, achieving 150% better results in systems thinking, computational thinking, data literacy, creative thinking, scientific reasoning, and abstract reasoning. However, compared to humans, it underperforms by around 25% in logical reasoning, critical thinking, and quantitative reasoning. In analogical reasoning, both o1-preview and humans achieved perfect scores. Despite these strengths, the o1-preview shows limitations in abstract reasoning, where human psychology students outperform it, highlighting the continued importance of human oversight in tasks requiring high-level abstraction. These results have significant educational implications, suggesting a shift toward developing human skills that complement AI, such as creativity, abstract reasoning, and critical thinking. This study emphasizes the transformative potential of AI in education and calls for a recalibration of educational goals, teaching methods, and curricula to align with an AI-driven world.

研究の動機と目的

OpenAI o1-previewの高次思考能力を、14の認知領域にわたって評価する。
検証済みの指標を用いて、o1-previewの性能を人間のベンチマークと比較する。
教育や評価におけるo1-previewの強み、限界、教育的含意を特定する。

提案手法

各認知領域に対して検証済みの評価手法を選択する（例：Ennis-Weir Critical Thinking Essay Test、Biological Systems Thinking Test、Bebras、The Village of Abeesee、Lake Urmia Vignetteなど）。
各手法に合わせたワンショットまたは限定的なプロンプティング戦略を用いてo1-previewに指示を出す。
ドメイン間でo1-previewと人間ベンチマークの平均得点とパフォーマンス差を算出・比較する。
結果を分析し、強みのある領域（o1-previewが人間を上回る領域）と弱点（性能が低下する領域）を特定する。
比較を根拠づけるため、正確さ・平均点・百分位などの手法特有の指標を取り入れる。
高次思考の向上に関連するo1-previewのアーキテクチャ特性とトレーニングの根拠を説明する。

実験結果

リサーチクエスチョン

RQ1OpenAI o1-previewは、定義された認知領域に跨る高次思考を示すことができるか？
RQ2各ドメインにおいて、確立された思考評価手法に対するo1-previewの性能は人間の性能とどのように比較されるか？
RQ3AIと人間の推論の整合性が最も高いまたは低い領域はどれか、そしてどんな教育的含意が生じるか？
RQ4観察された強みや制限に関連するo1-previewのアーキテクチャ的要因またはトレーニング要因は何か？

主な発見

o1-previewは体系的思考、計算思考、データリテラシー、創造的思考、科学的推論、抽象的推論で人間を上回る（例：いくつかのシステム思考テストで人間の成績が100%対48%となる等、Bebras/アルゴリズム課題で96.15%対61.7%）
o1-previewは論理的推論、批判的思考、定量的推論でわずかに下回る（例：批判的思考：81.25%対87.6%の人間平均、論理的推論は一部課題で25%の差）。
分析結果は、人間が抽象的推論やo1-previewが遅れる特定の高レベル推論課題で優れており（例：Raven’s matricesでは心理学専攻の学生がモデルを上回る）ことを示唆している。
提供されたプロンプトに対して類推推理課題はo1-previewと人間の双方が満点を取った。
メタ認知といくつかの校正指標は、数学文脈での自己評価と意思決定においてo1-previewがGPT-4oより改善していることを示しているが、直接的な人間比較は限られている。
全体的に、本研究はAIと人間の補完的役割を強調しており、AIは多くの領域で強みを示す一方、抽象的推論や特定の定量課題には依然として限界がある。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。