QUICK REVIEW

[論文レビュー] VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language Models

Dao Xuan-Quy, Ngoc-Bich Le|arXiv (Cornell University)|May 20, 2023

Topic Modeling被引用数 40

ひとこと要約

本論文は、約19,000問の選択式問題と300の文学エッセイを特徴とする、9科目で大規模言語モデルを評価するためのVNHSGEデータセットを紹介します。テキストデータと画像データを含み、ChatGPTとBingChatに対するベンチマーク結果を示します。

ABSTRACT

The VNHSGE (VietNamese High School Graduation Examination) dataset, developed exclusively for evaluating large language models (LLMs), is introduced in this article. The dataset, which covers nine subjects, was generated from the Vietnamese National High School Graduation Examination and comparable tests. 300 literary essays have been included, and there are over 19,000 multiple-choice questions on a range of topics. The dataset assesses LLMs in multitasking situations such as question answering, text generation, reading comprehension, visual question answering, and more by including both textual data and accompanying images. Using ChatGPT and BingChat, we evaluated LLMs on the VNHSGE dataset and contrasted their performance with that of Vietnamese students to see how well they performed. The results show that ChatGPT and BingChat both perform at a human level in a number of areas, including literature, English, history, geography, and civics education. They still have space to grow, though, especially in the areas of mathematics, physics, chemistry, and biology. The VNHSGE dataset seeks to provide an adequate benchmark for assessing the abilities of LLMs with its wide-ranging coverage and variety of activities. We intend to promote future developments in the creation of LLMs by making this dataset available to the scientific community, especially in resolving LLMs' limits in disciplines involving mathematics and the natural sciences.

研究の動機と目的

Vietnamese National High School Graduation Examination (VNHSGE)および同等の試験からベンチマークデータセットを作成する。
9科目（数学、文学、英語、物理、化学、生物、歴史、地理、倫理）を、さまざまな問題タイプでカバーする。
LLMを評価するために、300の文学エッセイと約19,000の選択式問題を提供する。
LLM（例：ChatGPT、BingChat）とベトナム人学生の間で比較を可能にし、ギャップと強みを特定する。
広範なアクセス性と評価を促進するために、ベトナム語–英語のバイリンガル版とフォーマットを提供する。

提案手法

VMET（2019–2023）および類似の試験から公式および参考問題を収集する。
すべての資料（公式・表、画像）をテキスト化し、別の画像フォルダに分離し、必要に応じてLaTeXへ翻訳する。
WordおよびJSON形式を提供し、GPT-4/ChatGPTによる翻訳でベトナム語版（VNHSGE-V）と英語版（VNHSGE-E）を作成する。
有資格の教師が執筆した詳細な手順解説と解答を含め、クラウドワーカーによるものではない。
データをLLMと互換性のある形式に翻訳・整形し、テキストのみ入力と画像を併用した入力を可能にする。
ChatGPTおよびBingChatを用いてLLMの性能を評価し、ベトナム人学生の得点分布と比較する。

実験結果

リサーチクエスチョン

RQ1VNHSGEベ Benchmarkの9科目領域でのLLMのパフォーマンスはどうか。
RQ2文学、英語、歴史、地理、倫理で人間レベルの性能に到達するか、数学と科学分野で遅れがどこにあるか。
RQ3現在のLLMがベトナムの高校試験内容を扱う際の強みと制限は何か。
RQ4特に数学と自然科学分野で、今後のLLM開発を導くためにVNHSGEを活用できるか。

主な発見

ChatGPTとBingChatは文学、英語、歴史、地理、倫理教育の分野で人間レベルの性能に達している。
LLMsは数学、物理、化学、生物の課題ではまだ人間に及ばない。
データセットは広範なカバレージと多様なタスクを提供し、実際のベトナムの試験でのLLMのベンチマークを堅牢に行える。
バイリンガル（ベトナム語–英語）版は、モデル間の言語を跨ぐ評価と比較を促進する。
質問には誤り分析と推論改善を支援するための解説と手順解答が付随している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。