QUICK REVIEW

[論文レビュー] METIS: Mentoring Engine for Thoughtful Inquiry & Solutions

Abhinav Rajeev Kumar, Dhruv Trehan|arXiv (Cornell University)|Jan 19, 2026

Intelligent Tutoring Systems and Adaptive Learning被引用数 0

ひとこと要約

METIS はアイデア創出から公開論文までを導く、ツール付与・段階認識型のAIメンターであり、Claude Sonnet 4.5 を超え、GPT-5 にも一定程度対抗できる。単一ターンおよび多ターンの評価で、特に文書に基づくドラフト段階で最大の効果を示す。

ABSTRACT

Many students lack access to expert research mentorship. We ask whether an AI mentor can move undergraduates from an idea to a paper. We build METIS, a tool-augmented, stage-aware assistant with literature search, curated guidelines, methodology checks, and memory. We evaluate METIS against GPT-5 and Claude Sonnet 4.5 across six writing stages using LLM-as-a-judge pairwise preferences, student-persona rubrics, short multi-turn tutoring, and evidence/compliance checks. On 90 single-turn prompts, LLM judges preferred METIS to Claude Sonnet 4.5 in 71% and to GPT-5 in 54%. Student scores (clarity/actionability/constraint-fit; 90 prompts x 3 judges) are higher across stages. In multi-turn sessions (five scenarios/agent), METIS yields slightly higher final quality than GPT-5. Gains concentrate in document-grounded stages (D-F), consistent with stage-aware routing and groundings failure modes include premature tool routing, shallow grounding, and occasional stage misclassification.

研究の動機と目的

アイデアから論文までのAIベースの研究指導のための段階認識ワークフローと評価フレームワークを提供する。
ツール搭載のメンターを構築し、文献検索、ガイドライン、方法論チェック、記憶を備え、週間を超えて学習者を支援する。
LLM-judge ペアワイズ好みと学生ルーブリックを用いた単一ターンおよび多ターン課題で METIS を GPT-5 および Claude Sonnet 4.5 と実証的に比較する。

提案手法

段階認識エージェントアーキテクチャとツールをルーティングする段階検出器（Research Guidelines、Literature Search、Methodology Checks、memory）。
各応答に二つの自己説明ブロック（Intuition、Why this is principled）を設け、推論と正当性を開示。
arXiv/OpenReview のソースを用いたretrieval augmented generationによる grounding および実引用との評価。
Pre idea から Final までの六つの執筆段階 A–F と対応するプロンプトによる段階ベースの評価。
LLM-judge ペアワイズ好みと学生の視点によるルーブリックで性能と学習者の満足度を評価。
再現性のためのオープンマテリアル（プロンプト、ログ、スクリプト）を提供。

Figure 1: METIS architecture. Stage detector and tool router select tools (Research Guidelines, web/document search, attachment search, methodology checks) based on writing stage. The agent synthesizes a reply and surfaces two self‑explanations ( Intuition , Why this is principled ), plus next steps

実験結果

リサーチクエスチョン

RQ1AI メンターは初期アイデアから学術会議論文レベルの成果へ Undergraduate を動かせるか。
RQ2段階認識ルーティングと文書 grounding は強力なチャットベースの基準と比較して指導品質を向上させるか。
RQ3ツールルーティング、 grounding、段階分類のどこに失敗モードがあり、どのように緩和できるか。

主な発見

METIS は単一ターンの LLM-judge 選好で Claude Sonnet 4.5 を上回り（勝率71%）、GPT-5 を上回る勝率を示した。
学生視点の評価スコア（明確さ、実行可能性、制約適合、信頼感の向上など）は、段階を通じて METIS がベースラインより高かった。
多ターンセッションでは METIS が GPT-5 より最終品質でやや上回り、いくつかのシーンで METIS が少ないターン数で成功。
効果は文書基盤の段階（D–F）で最も大きく、 grounding と段階ルーティングの影響が大きい。
共通の失敗モードにはツールルーティングの早すぎる導入、浅い grounding、時折の段階誤分類が含まれる。
評価には 90 の単一ターンプロンプト（各段階 15 件）と各システムごとの 5 つの多ターンシナリオ、ヒューマン風 Judges、および 95% 信頼区間が含まれる。

Figure 2: LLM-judge pairwise preferences across stages ( $n{=}15$ prompts/stage; ties $\leq 8\%$ excluded). METIS wins $71\%$ vs Claude Sonnet 4.5 and $54\%$ vs GPT-5 overall; error bars show Wilson 95% CIs.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。

[論文レビュー] METIS: Mentoring Engine for Thoughtful Inquiry &amp; Solutions