QUICK REVIEW

[论文解读] TransTab: Learning Transferable Tabular Transformers Across Tables

Zifeng Wang, Jimeng Sun|arXiv (Cornell University)|May 19, 2022

Machine Learning in Healthcare被引用 37

一句话总结

TransTab 引入一个可迁移的表格Transformer，通过将列和单元格编码为标记来处理变量列的表格，从而实现跨表的有监督、增量、迁移和零样本学习，并结合垂直分区对比预训练。

ABSTRACT

Tabular data (or tables) are the most widely used data format in machine learning (ML). However, ML models often assume the table structure keeps fixed in training and testing. Before ML modeling, heavy data cleaning is required to merge disparate tables with different columns. This preprocessing often incurs significant data waste (e.g., removing unmatched columns and samples). How to learn ML models from multiple tables with partially overlapping columns? How to incrementally update ML models as more columns become available over time? Can we leverage model pretraining on multiple distinct tables? How to train an ML model which can predict on an unseen table? To answer all those questions, we propose to relax fixed table structures by introducing a Transferable Tabular Transformer (TransTab) for tables. The goal of TransTab is to convert each sample (a row in the table) to a generalizable embedding vector, and then apply stacked transformers for feature encoding. One methodology insight is combining column description and table cells as the raw input to a gated transformer model. The other insight is to introduce supervised and self-supervised pretraining to improve model performance. We compare TransTab with multiple baseline methods on diverse benchmark datasets and five oncology clinical trial datasets. Overall, TransTab ranks 1.00, 1.00, 1.78 out of 12 methods in supervised learning, feature incremental learning, and transfer learning scenarios, respectively; and the proposed pretraining leads to 2.3% AUC lift on average over the supervised learning.

研究动机与目标

解决在合并具有部分重叠列的表格时的数据浪费和低效问题。
开发适用于没有固定结构的变列表格的模型。
实现跨表的迁移学习、增量特征更新和零样本推理。
提出一种利用多张表格提升表格预测的预训练范式。

提出的方法

通过将单元格转换为包含列描述的标记级嵌入来对表格输入进行特征化。
使用列的垂直分区以实现可扩展的自监督对比学习（VPCL）。
利用带令牌级门控机制的门控Transformer层进行鲁棒特征编码。
通过有监督损失或对比学习（自监督VPCL或有监督VPCL）进行训练，以学习可迁移的表征。
支持四种场景：跨表迁移学习、增量学习、预训练+微调以及跨表的零样本推断。

实验结果

研究问题

RQ1TransTab 是否能够从具有部分重叠列的多张表中学习？
RQ2TransTab 是否支持在不从头重新训练的情况下增添列？
RQ3在垂直列分区上的VPCL预训练如何提升迁移和零样本性能？
RQ4在有监督、迁移和零样本/表格预训练设置下，TransTab 是否能超越基线？

主要发现

方法	N00041119	N00174655	N00312208	N00079274	N00694382	排名(标准差)
TransTab	0.6408	0.9428	0.7770	0.7281	0.7648	1.00(0.00)

TransTab 在有监督学习以及跨表迁移情景下对临床试验死亡率数据集均达到最高性能（表2–表4中名列前茅）。
在增量特征设置中，TransTab 通过利用所有可用特征显著超过基线。
跨表迁移实验表明，TransTab 受益于在一个表上进行预训练并在另一个表上微调，超过基线。
零样本实验表明，TransTab 与有监督基线相当或超过，且在多数情况下超越迁移基线且无需额外微调。
垂直分区对比学习（VPCL）提升了微调性能，相对于普通有监督预训练和标准自监督方法。
在无关表格数据上的预训练对微调收益有限，而VPCL在所研究的数据集上提供持续的增益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。