QUICK REVIEW

[論文レビュー] Automated Essay Scoring based on Two-Stage Learning

Jiawei Liu, Yang Xu|arXiv (Cornell University)|Jan 23, 2019

Adversarial Robustness in Machine Learning参考文献 16被引用数 45

ひとこと要約

本論文は、Two-Stage Learning Framework (TSLF) を提案します。深層モデルの semantic, coherence, および prompt-relevance signals を handcrafted features と組み合わせ、最終スコアリングに XGBoost を用いることで、ASAP における敵対的入力に対して強い頑健性を発揮します。

ABSTRACT

Current state-of-art feature-engineered and end-to-end Automated Essay Score (AES) methods are proven to be unable to detect adversarial samples, e.g. the essays composed of permuted sentences and the prompt-irrelevant essays. Focusing on the problem, we develop a Two-Stage Learning Framework (TSLF) which integrates the advantages of both feature-engineered and end-to-end AES models. In experiments, we compare TSLF against a number of strong baselines, and the results demonstrate the effectiveness and robustness of our models. TSLF surpasses all the baselines on five-eighths of prompts and achieves new state-of-the-art average performance when without negative samples. After adding some adversarial essays to the original datasets, TSLF outperforms the feature-engineered and end-to-end baselines to a great extent, and shows great robustness.

研究の動機と目的

handcrafted features と deep semantic representations の両方を活用して AES を改善する動機づけ。
段落の並べ替えやプロンプトと無関係なエッセイなど、 adversarial AES 入力を検出する。
ステージ1 のスコアと feature-engineered features を boosting モデルで結合して、頑健性と精度を向上させる。

提案手法

事前学習済み BERT モデルを用いて文埋め込みを導出し、 penultimate-layer の隠れ状態を平均化して文ベクトルを計算する。
Stage 1 では LSTM ベースのエンコーダを用いて semantic score Se、coherence score Ce、prompt-relevant score Pe の三つのスコアを計算し、それぞれ MSE ロスで学習する。
Stage 2 では Se、Ce、Pe を handcrafted features と結合し、それらを XGBoost 回帰モデルに入力して最終スコアを出力する。
Handcrafted feature パイプラインの一部として Grammar Error Correction (GEC) と Spell checking を導入する。
訓練時には ASAP スコアを (0,1) に正規化し、テスト時には予測値を元の範囲へスケールバックする。
stage-one コンポーネントには Adam を用いて学習し、 boosting stage には early stopping を適用する。

実験結果

リサーチクエスチョン

RQ1深層エンコード特徴と handcrafted features を統合することで、完全にエンドツーエンドまたは純粋な特徴ベースの方法より AES の性能を改善できるか。
RQ2 coherence および prompt-relevance シグナルは permuted paragraphs や prompt-irrelevant essays などの adversarial 入力を検出できるか。
RQ3 Se、Ce、Pe を handcrafted features と boosting モデルで結合することで、 adversarial 条件下で頑健な性能を示すか。

主な発見

Model	prompt1	prompt2	prompt3	prompt4	prompt5	prompt6	prompt7	prompt8	Average
EASE(SVR)	0.781	0.621	0.630	0.749	0.782	0.771	0.727	0.534	0.699
EASE(BLRR)	0.761	0.606	0.621	0.742	0.784	0.775	0.730	0.617	0.705
CNN	0.804	0.656	0.637	0.762	0.752	0.765	0.750	0.680	0.726
LSTM	0.808	0.697	0.689	0.805	0.818	0.827	0.811	0.598	0.756
CNN+LSTM	0.821	0.688	0.694	0.805	0.807	0.819	0.808	0.644	0.761
TSLF-1	0.757	0.698	0.725	0.796	0.810	0.783	0.727	0.544	0.730
TSLF-2	0.808	0.718	0.693	0.698	0.771	0.720	0.722	0.616	0.718
TSLF-ALL	0.852	0.736	0.731	0.801	0.823	0.792	0.762	0.684	0.773

TSLF-ALL は eight prompts のうち五つでベースラインを上回り、 adversarial サンプルなしで ASAP で最良の平均性能を達成する。
TSLF-ALL は coherence および prompt-relevant signals のため、 adversarial サンプルが追加されてもベースラインより頑健性を維持する。
Ablation では LSTM ベースのスコアの最後の隠れ状態が平均隠れ状態より良い性能を示す。
GEC 付き文法特徴と包括的な handcrafted features は、 spell-check のみより AES の有効性を高める。
adversarial 入力下では end-to-end および feature-based ベースラインは性能を維持できない一方、TSLF-ALL は強い頑健性を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。