QUICK REVIEW

[论文解读] WRENCH: A Comprehensive Benchmark for Weak Supervision

Jieyu Zhang, Yue Yu|arXiv (Cornell University)|Sep 23, 2021

Machine Learning and Data Classification参考文献 104被引用 39

一句话总结

WRENCH 提供一个标准化基准平台，包含 22 个真实世界数据集、各种弱监督来源（真实、合成、过程式），以及一个模块化框架，用于评估弱监督方法，使得在分类与序列标注领域具有超过 120 种方法变体。

ABSTRACT

Recent Weak Supervision (WS) approaches have had widespread success in easing the bottleneck of labeling training data for machine learning by synthesizing labels from multiple potentially noisy supervision sources. However, proper measurement and analysis of these approaches remain a challenge. First, datasets used in existing works are often private and/or custom, limiting standardization. Second, WS datasets with the same name and base data often vary in terms of the labels and weak supervision sources used, a significant "hidden" source of evaluation variance. Finally, WS studies often diverge in terms of the evaluation protocol and ablations used. To address these problems, we introduce a benchmark platform, WRENCH, for thorough and standardized evaluation of WS approaches. It consists of 22 varied real-world datasets for classification and sequence tagging; a range of real, synthetic, and procedurally-generated weak supervision sources; and a modular, extensible framework for WS evaluation, including implementations for popular WS methods. We use WRENCH to conduct extensive comparisons over more than 120 method variants to demonstrate its efficacy as a benchmark platform. The code is available at https://github.com/JieyuZ2/wrench.

研究动机与目标

解决弱监督（WS）缺乏标准化基准的问题，提供一个多样化的公开基准平台。
实现对 WS 方法在数据集、来源与评估协议上的全面评估。
使用过程式和合成生成器分析弱监督特性如何影响 WS 方法的性能。
提供一个模块化代码库，配有标准化评估脚本和基线，便于未来的比较。

提出的方法

引入面向分类与序列标注的22个真实世界数据集，覆盖多样领域和标注函数（LFs）。
提供过程式与合成 LF 生成器，以系统性地探索 LF 属性（准确性、倾向性、相关性、数据依赖性）。
提供一个统一、可扩展的 Python 框架，包含流行的 WS 方法实现和标准化评估指标。
通过将标签模型、端模型和联合模型与软标签/硬标签结合，支持超过 100 种方法变体。
提供分类与序列标注任务的基线方法（例如 MV, DS, DP, MeTaL, FS, HMM, CHMM, ConNet, BERT 变体）。
通过跨数据集的广泛实验，展示在比较 WS 方法和消融实验方面的效用。

实验结果

研究问题

RQ1将 WS 基准标准化如何影响跨方法和数据集的公平比较？
RQ2不同弱监督来源属性（准确性、倾向性、相关性、数据依赖性）对 WS 方法性能有何影响？
RQ3两阶段（标签模型 + 端模型）与一阶段（联合） WS 方法在不同任务和数据域上表现如何？
RQ4端模型的选择（如微调的语言模型）在多大程度上影响 WS 结果，与仅使用标签模型相比如何？
RQ5针对选择 LF 类型和评估协议以获得鲁棒的 WS 结果，可以提供哪些指南？

主要发现

数据集	指标	最佳金标签 EM	最佳 Top1 EM LM 值	最佳 Top2 EM LM 值	最佳 Top3 EM LM 值	备注
IMDb	Acc.	R	RC	MeTaL	RC	Top methods vary by dataset
Yelp	Acc.	R	RC	FS	RC	Soft labels beneficial in some cases
Youtube	Acc.	B	MV	MV	RC	End-model choices matter
SMS	F1	B	WMV	MeTaL	WMV	Soft labels often help
AGNews	Acc.	R	DS	MV	WMV	Dataset-dependent results
TREC	Acc.	R	DP	MeTaL	DP	LF types influence outcomes
Spouse	F1	–	FS	MeTaL	MV	Gold unavailable for training labels
CDR	F1	R	MeTaL	DP	DP	Dataset-specific performance
SemEval	Acc.	B	DP	MV	DP	Weak signals vary by dataset
ChemProt	Acc.	B	DP	MV	MV	LF quality varies
Commerical*	F1	MLP	MV	MV	MV	Non-textual data with features only
Tennis Rally*	F1	LR	FS	MeTaL	FS	Procedural LFs affect results
Basketball*	F1	MLP	FS	WMV	DP	LF quality impacts end models
Census*	F1	MLP	MeTaL	MeTaL	MeTaL	Correlations matter
CoNLL-03	Avg F1	–	LSTM-CNN (Gold)	BERT	ConNet	Sequence tagging baselines

没有一种单一的 WS 方法在所有数据集上都始终优于其他方法，这突出任务与 LF 依赖的性能。
微调大规模预训练语言模型通常在文本数据上获得强劲的端模型性能，常常超过仅使用标签模型的方法。
相较于硬标签，软标签往往能提升端模型性能，尤其是当端模型变得更深时。
LF 的质量、覆盖率和依赖性显著影响 WS 的有效性；嘈杂或稀疏的 LF 会拉大弱监督与金标签之间的差距。
过程式 LF 生成器揭示 LF 相关性和数据依赖性在标签模型的相对优势上具有实质影响。
序列标注结果显示具备依赖性感知的模型（如 HMM/CHMM）通常优于 MV，某些数据集根据覆盖率偏好更简单的方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。