QUICK REVIEW

[論文レビュー] Performance Comparison of Large Language Models on VNHSGE English Dataset: OpenAI ChatGPT, Microsoft Bing Chat, and Google Bard

Xuan-Quy Dao|arXiv (Cornell University)|Jul 5, 2023

Topic Modeling被引用数 26

ひとこと要約

BingChat は、ベトナムの VNHSGE 英語データセットで最高の正解率を達成します（平均 92.4%）。ChatGPT（79.2%）および Bard（86%）を上回ります。すべての LLM は、平均的な英語力においてベトナムの学生を上回ります。

ABSTRACT

This paper presents a performance comparison of three large language models (LLMs), namely OpenAI ChatGPT, Microsoft Bing Chat (BingChat), and Google Bard, on the VNHSGE English dataset. The performance of BingChat, Bard, and ChatGPT (GPT-3.5) is 92.4\%, 86\%, and 79.2\%, respectively. The results show that BingChat is better than ChatGPT and Bard. Therefore, BingChat and Bard can replace ChatGPT while ChatGPT is not yet officially available in Vietnam. The results also indicate that BingChat, Bard and ChatGPT outperform Vietnamese students in English language proficiency. The findings of this study contribute to the understanding of the potential of LLMs in English language education. The remarkable performance of ChatGPT, BingChat, and Bard demonstrates their potential as effective tools for teaching and learning English at the high school level.

研究の動機と目的

ベトナムの高校生レベルにおける VNHSGE 英語データセットに対して、三大先導 LLM がどのように性能を示すか評価する。
LLM の性能をベトナム人学生の成績と比較して、相対的な熟達度を評価する。
ベトナムにおける英語教育の教授学習への LLM の潜在的な活用と影響を探る。

提案手法

VNHSGE 英語データセット（2019–2023）から 250 問の MCQ に対してゼロショット prompting を用いて解答する。
出力を構造化するようプロンプトを整形する：選択肢（A–D）と説明。
正誤を二値評価関数 G で ground-truth 解と照合して解答を評価する。
ChatGPT、BingChat、Bard に跨る LLM_B（最良ケース）と LLM_W（最悪ケース）の境界を算出する。
年次間の安定性を分析し、モデルごとの集計成績（AVG）を報告する。

実験結果

リサーチクエスチョン

RQ1ChatGPT、BingChat、Bard のベトナムの高校生レベルにおける VNHSGE 英語データセットの成績はどうか？
RQ2これらの LLM はベトナム人学生の英語能力とどのように比較されるか？
RQ3ベトナムにおける英語教育・学習のために LLM はどのような潜在能力を持つか？

主な発見

2019	2020	2021	2022	2023	平均
ChatGPT	76	86	76	80	78	79.2
BingChat	92	96	86	94	94	92.4
Bard	82	94	82	86	86	86
LLM_W	66	82	68	74	70	72
LLM_B	96	100	94	96	100	97.2

BingChat が最高の平均正解率を達成（97.2% を LLM_B； AVG 92.4%）。
ChatGPT は 2019–2023 で平均 79.2%。
Bard は 2019–2023 で平均 86%。
LLM_W（モデル間の最悪ケース）は平均 72%。
提示された 10 点満点の英語スコアスペクトラムにおいて、3つの LLM はすべてベトナムの学生を上回る（平均 LLM スコアは約 7.92–9.24、ベトナムの AVS は年によって約 3.8–5.84）。
結果は LLM の年を跨いだ安定した性能を示唆するが、BingChat は年を追うごとに変動が大きい。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。