QUICK REVIEW

[論文レビュー] Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation

Maja Stahl, Leon Biermann|arXiv (Cornell University)|Apr 24, 2024

Natural Language Processing Techniques被引用数 12

ひとこと要約

本論文は、LLMを用いてエッセイを共同で採点しフィードバックを生成する prompting 戦略を検討し、共同採点とフィードバックがAESを改善し得る一方、採点がフィードバックへ与える影響が限定的であってもフィードバックの質が高いことを示す。

ABSTRACT

Individual feedback can help students improve their essay writing skills. However, the manual effort required to provide such feedback limits individualization in practice. Automatically-generated essay feedback may serve as an alternative to guide students at their own pace, convenience, and desired frequency. Large language models (LLMs) have demonstrated strong performance in generating coherent and contextually relevant text. Yet, their ability to provide helpful essay feedback is unclear. This work explores several prompting strategies for LLM-based zero-shot and few-shot generation of essay feedback. Inspired by Chain-of-Thought prompting, we study how and to what extent automated essay scoring (AES) can benefit the quality of generated feedback. We evaluate both the AES performance that LLMs can achieve with prompting only and the helpfulness of the generated essay feedback. Our results suggest that tackling AES and feedback generation jointly improves AES performance. However, while our manual evaluation emphasizes the quality of the generated essay feedback, the impact of essay scoring on the generated feedback remains low ultimately.

研究の動機と目的

エッセイ作成における手動の教師フィードバックを補完または置換する自動で個別化されたフィードバックを促進する。
LLMを用いたエッセイ採点とフィードバック生成の共同作業のためのゼロショットおよび few-shot prompting 戦略を探る。
さまざまな prompting 設計が自動エッセイ採点（AES）性能とフィードバックの有用性にどのように影響するかを評価する。
採点がフィードバックの質に影響を及ぼすか、またその逆も検討し、最も効果的な prompting 設定を特定する。

提案手法

三つの prompting 要素を体系的に変化させる：プロンプトパターン（ベース型 vs ペルソナ型）、タスク指示の種類（採点、フィードバック、またはそれらを順番に両立）、および文脈内学習（ゼロショット、ワンショット、数ショット）.
可能な場合には、主要な LLM として Mistral-7B-Instruct-v0.2 を用い、スコアとフィードバックを JSON 形式で生成する際 greedy decoding を用いる。
ベースラインとして AES-Prompt Tao et al. (2022) および R2BERT などと、ゼロショットおよびワンショット/フェスショット設定の下で、プロンプトパターンとタスク指示を比較する。
ASAP データセット（8 セット）に対して二乗重み付き κ（QWK）で採点を評価し、クロスバリデーション分割を用いる。
ペルソナ型プロンプトパターンを比較しつつ、LLMs からの自動的な有用性スコアと手動注釈によるフィードバック生成の評価を行う。
採点とフィードバックを共同で解くことは、性能と有用性とみなされる影響に及ぼす影響を調査する。

Figure 1: Exemplary student essay on library censorship from the ASAP dataset Hamner et al. ( 2012 ) along with feedback and essay score generated by one of the methods evaluated in this paper. Explicit connections of the feedback to essay parts are color-coded.

実験結果

リサーチクエスチョン

RQ1ゼロショットおよび few-shot prompting の下で、LLM はエッセイを信頼性高く採点し、全体的および特性別の有用なフィードバックを生成できるか。
RQ2エッセイ採点とフィードバック生成を共同で解くことは、採点のみの場合と比べて AES の性能を向上させるか。
RQ3異なるプロンプトパターン、タスク指示、および文脈内学習の regime が、生成されるフィードバックの質と有用性にどのような影響を与えるか。

主な発見

エッセイ採点とフィードバック生成を共同で扱うことで、いくつかのプロンプト構成で AES の性能が向上する（採点重視の変種で平均 QWK が最も高くなることもある）。
ペルソナ役割（例：教育研究者、教育アシスタント）を用いたプロンプトパターンは、一般にベースパターンより採点性能が向上する。
フィードバック優先の prompting（例：Feedback → Scoring、または Feedback_dCoT → Scoring）は、採点優先の変種より高い採点性能を生む傾向にある。
フィードバック生成では、ペルソナベースのプロンプト（特に Educational Researcher および Creative Writing Mentor）が自動的な有用性スコアを高くし、手動評価でもフィードバックが有用とみなされる傾向。
手動評価では、フィードバックは学生の作文改善に有用だが、フィードバックの有用性に対する採点の影響は限定的である。

Figure 2: Overview of the main points of variation in our approach to predict a score and to generate feedback for a student essay: (a) Prompt pattern: Use of the base pattern or persona-specific pattern; (b) Task instruction type: Tasks to be tackled and their ordering; (c) In-context learning appr

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。