QUICK REVIEW

[论文解读] Uncovering bias in the PlantVillage dataset

Mehmet A. Noyan|arXiv (Cornell University)|Jun 9, 2022

Smart Agriculture and AI被引用 22

一句话总结

该研究通过仅在背景像素的前8个像素上训练模型，展示了 PlantVillage 数据集的显著偏差，模型在测试中达到 49.0% 的准确率，远高于随机猜测，暗示存在背景/捕获偏差。研究还显示去除背景并不能完全消除偏差，并讨论了缓解策略。

ABSTRACT

We report our investigation on the use of the popular PlantVillage dataset for training deep learning based plant disease detection models. We trained a machine learning model using only 8 pixels from the PlantVillage image backgrounds. The model achieved 49.0% accuracy on the held-out test set, well above the random guessing accuracy of 2.6%. This result indicates that the PlantVillage dataset contains noise correlated with the labels and deep learning models can easily exploit this bias to make predictions. Possible approaches to alleviate this problem are discussed.

研究动机与目标

评估 PlantVillage 数据集是否包含可被机器学习模型利用的偏差。
通过使用 8 像素背景子集来量化背景信息在疾病分类中的贡献程度。
在相关数据集中比较移除或操纵背景时偏差的存在情况。

提出的方法

通过从每张图像提取 8 个像素（四个角、四个边中点）来创建 PlantVillage_8px。
使用默认超参数，在 PlantVillage_8px 上以 80/20 的训练/测试划分训练随机森林分类器。
将性能与随机猜测基线进行比较（100/38 ≈ 2.6%）。
将分析扩展到 PlantVillage_blur、PlantVillage_fg_blur 和 PlantVillage_bg_blur，以评估背景/捕获偏差效应。
对 MNIST_8px 应用同样的评估作为对照，具有 10 个类别。
讨论偏差来源及其对数据集设计与模型评估的影响。

实验结果

研究问题

RQ1PlantVillage 数据集是否包含仅利用背景/捕获信息就能实现高准确性的偏差？
RQ2去除图像背景是否能消除 PlantVillage 推导模型中的偏差？
RQ3在将 PlantVillage 与像 MNIST 这样的无偏数据集进行比较时，背景操作如何影响偏差？
RQ4在植物病害检测中，数据集设计与报告模型性能的实际意义是什么？

主要发现

在 PlantVillage_8px 上训练的模型在测试集上达到 49.0% 的准确率，远高于随机猜测（2.6%）。
在 MNIST_8px 上，同一模型实现 11.7% 的准确率，接近随机猜测（10%）。
背景去除（PlantVillage_fg_blur）的偏差与带背景的数据集相似（11.7%、10.0%、10.8% 对比 2.6%。）
捕获偏差同时影响前景和背景，因此移除背景信息并不能完全消除 PlantVillage 的偏差。
在 PlantVillage 中扩展现场数据可能引入新的偏差（如果数据来源不同，例如野外与实验室），这并不能解决根本偏差问题。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。