QUICK REVIEW

[论文解读] EEG Synthetic Data Generation Using Probabilistic Diffusion Models

Giulio Tosato, Cesare Maria Dalbagno|arXiv (Cornell University)|Mar 6, 2023

EEG and Brain-Computer Interfaces被引用 8

一句话总结

论文提出使用在电极-频率分布图上训练的去噪扩散概率模型来生成合成的 EEG 数据，以增强情感分类的训练数据，并显示合成数据可提高分类器准确性。

ABSTRACT

Electroencephalography (EEG) plays a significant role in the Brain Computer Interface (BCI) domain, due to its non-invasive nature, low cost, and ease of use, making it a highly desirable option for widespread adoption by the general public. This technology is commonly used in conjunction with deep learning techniques, the success of which is largely dependent on the quality and quantity of data used for training. To address the challenge of obtaining sufficient EEG data from individual participants while minimizing user effort and maintaining accuracy, this study proposes an advanced methodology for data augmentation: generating synthetic EEG data using denoising diffusion probabilistic models. The synthetic data are generated from electrode-frequency distribution maps (EFDMs) of emotionally labeled EEG recordings. To assess the validity of the synthetic data generated, both a qualitative and a quantitative comparison with real EEG data were successfully conducted. This study opens up the possibility for an open extendash source accessible and versatile toolbox that can process and generate data in both time and frequency dimensions, regardless of the number of channels involved. Finally, the proposed methodology has potential implications for the broader field of neuroscience research by enabling the creation of large, publicly available synthetic EEG datasets without privacy concerns.

研究动机与目标

缓解 EEG-BCI 数据稀缺性以及对高质量合成数据的需求。
开发一种基于扩散的从 EFDMs 生成 EEG 类样本的方法。
创建一个开源工具箱，能够处理时域和频域 EEG 数据。
评估合成数据是否提供了超出原始数据集的信息。
在仅使用真实数据与真实数据结合合成数据两种情形下评估分类器性能。

提出的方法

将 OpenAI 改进扩散模型应用于生成 128 通道、128x128 的 EFDM 派生图像。
利用 EEG 数据的 STFT（频率上限 100 Hz）构建电极-频率分布图（EFDMs）。
在 PyTorch 中使用 CrossEntropyLoss 对真实数据训练分类器，再评估数据增强效果。
以 diffusion_steps=1000 和线性噪声时间表训练扩散模型；图像大小 image_size=128；批量大小 batch_size=32。
通过与未见真实数据的分类器性能比较，评估扩散生成数据是再现真实数据还是对其进行增量。
提供一个托管在 GitHub 的工具箱并讨论潜在的未来优化。

Figure 1: Progressive addition of Gaussian noise.

实验结果

研究问题

RQ1扩散生成的 EEG 样本是否包含原始训练数据中不存在的信息？
RQ2与仅使用真实数据相比，使用合成样本进行数据增强是否提升分类器性能？
RQ3扩散模型是否能够生成新颖的 EEG 类样本，而非对训练集的简单记忆？
RQ4基于 EFDM 的数据表示在扩散式 EEG 合成中有多大有效性？
RQ5扩散式 EEG 数据增强的实际意义与局限性是什么？

主要发现

Classifier Type	Max Average Accuracy
Original	91.434
Augmented 40 epochs	92.634
Augmented 60 epochs	92.984

在真实数据上训练的分类器在合成样本上的平均准确率超过 90%。
将扩散生成的样本用于真实数据的扩充，最大平均准确率提升至 92.634%（40 轮训练）和 92.984%（60 轮训练）。
混合训练（真实数据+合成数据）持续优于仅用真实数据训练的模型。
证据表明合成数据包含超出原始数据集的信息，支持用于数据增强。
扩散模型训练到 60 轮时的性能优于仅使用真实数据的训练。
合成数据可以在不涉及隐私问题的情况下共享，因为它们不是直接来自个人的样本。

Figure 2: Progressive subtraction of Gaussian noise.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。