Skip to main content
QUICK REVIEW

[论文解读] Open Datasets in Learning Analytics: Trends, Challenges, and Best PRACTICE

Valdemar Švábenský, Brendan Flanagan|arXiv (Cornell University)|Feb 19, 2026
Online Learning and Analytics被引用 0
一句话总结

简要结论:本文对在LAK、EDM和AIED论文中发表的学习分析开放数据集开展系统性综述(2020–2024),识别出172个独立数据集,并提供实践指南以及附注数据集清单以促进开放数据实践。

ABSTRACT

Open datasets play a crucial role in three research domains that intersect data science and education: learning analytics, educational data mining, and artificial intelligence in education. Researchers in these domains apply computational methods to analyze data from educational contexts, aiming to better understand and improve teaching and learning. Providing open datasets alongside research papers supports reproducibility, collaboration, and trust in research findings. It also provides individual benefits for authors, such as greater visibility, credibility, and citation potential. Despite these advantages, the availability of open datasets and the associated practices within the learning analytics research communities, especially at their flagship conference venues, remain unclear. We surveyed available datasets published alongside research papers in learning analytics. We manually examined 1,125 papers from three flagship conferences (LAK, EDM, and AIED) over the past five years. We discovered, categorized, and analyzed 172 datasets used in 204 publications. Our study presents the most comprehensive collection and analysis of open educational datasets to date, along with the most detailed categorization. Of the 172 datasets identified, 143 were not captured in any prior survey of open data in learning analytics. We provide insights into the datasets' context, analytical methods, use, and other properties. Based on this survey, we summarize the current gaps in the field. Furthermore, we list practical recommendations, advice, and 8-item guidelines under the acronym PRACTICE with a checklist to help researchers publish their data. Lastly, we share our original dataset: an annotated inventory detailing the discovered datasets and the corresponding publications. We hope these findings will support further adoption of open data practices in learning analytics communities and beyond.

研究动机与目标

  • 评估在2020年至2024年之间用于前沿学习分析研究的开放数据集的可用性和特征。
  • 识别数据差距和阻碍LA研究开放数据实践的挑战。
  • 制定实用指南,帮助研究人员在LA中发表和再利用开放数据集。
  • 提供发现的数据集及相关出版物的带注释的清单,以提高可重复性和再利用性。

提出的方法

  • 按照PRISMA式流程进行面向实践的LA数据集系统性调查。
  • 审查3大LA旗舰场所(LAK、EDM、AIED)在2020–2024年间发表的全部论文(n=1125)。
  • 筛选符合条件的论文,仅包含利用数据集并提供可获得数据集的论文。
  • 区分候选论文与候选数据集,以及入选论文与入选数据集的区别。
  • 在结构化清单中提取并记录数据集信息,并由多位作者进行交叉校验。
Figure 1. PRISMA flow diagram. Generated using the tool by Haddaway et al. ( 2022 ) .
Figure 1. PRISMA flow diagram. Generated using the tool by Haddaway et al. ( 2022 ) .

实验结果

研究问题

  • RQ1RQ1: 在前沿LA研究中有哪些开放数据集及其特征?
  • RQ2RQ2: LA数据集在哪些情境或领域存在不足或数据缺失(研究空白)?
  • RQ3RQ3: 何种最佳实践指南能帮助LA研究者采用开放数据实践?
  • RQ4RQ4: 研究者如何从披露的数据集清单和相关材料中受益并实现再利用?

主要发现

  • 在LAK、EDM和AIED(2020–2024)中,共识别出204篇出版物使用的172个独特的开放数据集。
  • 在这172个数据集中,有143个未被先前调查覆盖,使本研究成为迄今为止最全面的LA数据集调查。
  • 提供了关于数据集背景、分析方法、使用情况及其他属性的洞察。
  • 提出实用建议和8条PRACTICE指南,帮助研究人员发表和分享数据。
  • 共享带注释的数据集清单及相关材料(分析代码、结构化引用),以支持可重复性。
  • 讨论了可重复性和开放科学的考量,包括隐私、去识别化和访问控制等障碍。
Figure 2. Distributions of dataset frequency across educational topics and levels of students.
Figure 2. Distributions of dataset frequency across educational topics and levels of students.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。