QUICK REVIEW

[论文解读] VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language Models

Dao Xuan-Quy, Ngoc-Bich Le|arXiv (Cornell University)|May 20, 2023

Topic Modeling被引用 40

一句话总结

该论文为评估大型语言模型在九个学科上的表现引入了 VNHSGE 数据集，包含约 19,000 道选择题和 300 篇文学随笔，含文本数据与图像数据，并与 ChatGPT 和 BingChat 的基准结果进行比较。

ABSTRACT

The VNHSGE (VietNamese High School Graduation Examination) dataset, developed exclusively for evaluating large language models (LLMs), is introduced in this article. The dataset, which covers nine subjects, was generated from the Vietnamese National High School Graduation Examination and comparable tests. 300 literary essays have been included, and there are over 19,000 multiple-choice questions on a range of topics. The dataset assesses LLMs in multitasking situations such as question answering, text generation, reading comprehension, visual question answering, and more by including both textual data and accompanying images. Using ChatGPT and BingChat, we evaluated LLMs on the VNHSGE dataset and contrasted their performance with that of Vietnamese students to see how well they performed. The results show that ChatGPT and BingChat both perform at a human level in a number of areas, including literature, English, history, geography, and civics education. They still have space to grow, though, especially in the areas of mathematics, physics, chemistry, and biology. The VNHSGE dataset seeks to provide an adequate benchmark for assessing the abilities of LLMs with its wide-ranging coverage and variety of activities. We intend to promote future developments in the creation of LLMs by making this dataset available to the scientific community, especially in resolving LLMs' limits in disciplines involving mathematics and the natural sciences.

研究动机与目标

基于 Vietnamese National High School Graduation Examination (VNHSGE) 和可比较考试，创建一个基准数据集。
涵盖九个科目（数学、文学、英语、物理、化学、生物、历史、地理、公民教育）并包含多样化的题型。
提供 300 篇文学随笔和约 19,000 道选择题，以评估在各项任务中的 LLMs。
使 LLMs（例如 ChatGPT、BingChat）与越南学生之间的比较成为可能，以识别差距和优势。
提供越南语–英语双语版本及格式，以促进广泛的可访问性与评估。

提出的方法

从 VMET（2019–2023）及类似考试收集官方和示例考试题。
将所有材料（公式、表格、图像）转换为文本并放入单独的图像文件夹，在需要时翻译为 LaTeX。
提供 Word 与 JSON 格式；通过 GPT-4/ChatGPT 翻译生成越南语版本（VNHSGE-V）和英语版本（VNHSGE-E）。
包含由合格教师撰写的详细逐步解答与说明，而非众包人员。
对数据进行翻译与格式化，使其与 LLM 相兼容，支持文本输入与图像增强输入。
使用 ChatGPT 和 BingChat 评估 LLM 表现，并与越南学生的分数分布进行比较。

实验结果

研究问题

RQ1在 VNHSGE 基准测试的九个科目领域中，LLMs 的表现如何？
RQ2LLMs 在文学、英语、历史、地理和公民教育等领域是否达到人类水平？它们在哪些方面落后（例如数学和自然科学）？
RQ3当前 LLMs 在处理越南高中考试内容方面的优势与局限是什么？
RQ4是否可以利用 VNHSGE 指导未来的 LLM 发展，特别是在数学与自然科学领域？

主要发现

ChatGPT 和 BingChat 在文学、英语、历史、地理和公民教育方面达到人类水平。
LLMs 在数学、物理、化学和生物任务方面仍落后于人类。
该数据集覆盖面广、任务多样，使 LLMs 在真实世界越南考试上的基准测试更加稳健。
双语（越南语–英语）版本便于跨语言评估与模型之间的比较。
问题附带解释与逐步解答，支持错误分析与推理改进。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。