QUICK REVIEW

[论文解读] Impact of the Number of Votes on the Reliability and Validity of Subjective Speech Quality Assessment in the Crowdsourcing Approach

Babak Naderi, Tobias Hosfeld|arXiv (Cornell University)|Mar 25, 2020

Speech and Audio Processing参考文献 12被引用 10

一句话总结

本研究探讨了众包评分数量对基于 ITU-T Rec. P.808 指南的主观语音质量评估可靠性与有效性的影响。通过在不同平台上开展三项众包实验，使用三个语音数据集，作者将 MOS 评分与实验室基准黄金标准进行对比，发现每种条件 60 个评分即可提供足够的可靠性与有效性，超过此阈值后收益微乎其微。

ABSTRACT

The subjective quality of transmitted speech is traditionally assessed in a controlled laboratory environment according to ITU-T Rec. P.800. In turn, with crowdsourcing, crowdworkers participate in a subjective online experiment using their own listening device, and in their own working environment. Despite such less controllable conditions, the increased use of crowdsourcing micro-task platforms for quality assessment tasks has pushed a high demand for standardized methods, resulting in ITU-T Rec. P.808. This work investigates the impact of the number of judgments on the reliability and the validity of quality ratings collected through crowdsourcing-based speech quality assessments, as an input to ITU-T Rec. P.808 . Three crowdsourcing experiments on different platforms were conducted to evaluate the overall quality of three different speech datasets, using the Absolute Category Rating procedure. For each dataset, the Mean Opinion Scores (MOS) are calculated using differing numbers of crowdsourcing judgements. Then the results are compared to MOS values collected in a standard laboratory experiment, to assess the validity of crowdsourcing approach as a function of number of votes. In addition, the reliability of the average scores is analyzed by checking inter-rater reliability, gain in certainty, and the confidence of the MOS. The results provide a suggestion on the required number of votes per condition, and allow to model its impact on validity and reliability.

研究动机与目标

评估评分数量对众包语音质量评估中可靠性与有效性的影响。
使用 ITU-T Rec. P.808 指南，将众包结果与实验室基准黄金标准进行对比。
确定确保可靠且有效 MOS 估计的每种条件的最低评分数量。
评估评分数量作为函数的评分者间一致性、置信区间宽度以及与实验室数据的相关性。

提出的方法

在 Amazon Mechanical Turk、Prolific 和一个德国平台开展三项众包实验，遵循 ITU-T Rec. P.808 的程序。
使用绝对类别评分（ACR）对三个 ITU-T P.863 数据集（401、501、701）进行质量评估。
通过重复抽样模拟，收集每种条件不同评分数量（n = 25 至 200）的 MOS 评分。
计算众包 MOS 与实验室 MOS 之间的斯皮尔曼等级相关系数和均方根误差（RMSE），以评估有效性。
使用非参数自助抽样法计算置信区间宽度，以衡量不确定性。
通过个体评分者与各条件组平均分之间的斯皮尔曼等级相关系数，评估评分者间一致性（IRR）

实验结果

研究问题

RQ1每种条件的评分数量如何影响众包 MOS 与实验室基准黄金标准相比的有效性？
RQ2在众包语音质量评估中，达到稳定可靠 MOS 估计所需的最低评分数量是多少？
RQ3随着评分数量的增加，置信区间宽度和评分者间一致性如何变化？
RQ4数据集特定差异（如语言、失真类型）在多大程度上影响实现可靠结果所需的评分数量？
RQ5数据清洗（如移除低准确率或听力受损的工作者）是否显著提升可靠性与有效性？

主要发现

众包 MOS 与实验室 MOS 之间的斯皮尔曼等级相关系数在 0.89 至 0.97 之间，表明所有数据集均具有高度有效性。
众包 MOS 与实验室 MOS 之间的 RMSE 从 0.48 降至 0.32（401）、0.48 降至 0.32（501）和 0.48 降至 0.32（701），超过 60 个评分后改善微乎其微。
每种条件超过 60 个评分后，置信区间宽度降至 0.4 以下并趋于稳定，W(n) < 0.3 要求至少 115 个评分。
评分者间一致性（IRR）在 60–100 个评分后趋于平稳，此后无显著提升。
数据集 501 的相关性较低（0.89），可能由于语言差异（瑞士德语与德国众包工作者），而数据集 701 的 IRR 最高（0.777），可能因数据清洗较严格。
一阶映射降低了数据集 401 的偏差，使 RMSE 降至 0.17，表明后处理可进一步提升有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。