QUICK REVIEW

[Paper Review] Geometric Dataset Distances via Optimal Transport

David Alvarez-Melis, Nicolò Fusi|arXiv (Cornell University)|Feb 7, 2020

Domain Adaptation and Few-Shot Learning42 references65 citations

TL;DR

Introduces a model-agnostic, training-free distance between datasets using optimal transport by modelling labels as distributions over features; shows correlation with transfer learning difficulty across tasks and modalities.

ABSTRACT

The notion of task similarity is at the core of various machine learning paradigms, such as domain adaptation and meta-learning. Current methods to quantify it are often heuristic, make strong assumptions on the label sets across the tasks, and many are architecture-dependent, relying on task-specific optimal parameters (e.g., require training a model on each dataset). In this work we propose an alternative notion of distance between datasets that (i) is model-agnostic, (ii) does not involve training, (iii) can compare datasets even if their label sets are completely disjoint and (iv) has solid theoretical footing. This distance relies on optimal transport, which provides it with rich geometry awareness, interpretable correspondences and well-understood properties. Our results show that this novel distance provides meaningful comparison of datasets, and correlates well with transfer learning hardness across various experimental settings and datasets.

Motivation & Objective

Motivate and formalize a distance between datasets that is independent of a specific predictor or training on each dataset.
Propose a practical OT-based framework that compares joint distributions of features and labels even when label sets are disjoint.
Offer scalable algorithmic techniques to compute the distance on large datasets.
Empirically validate that the proposed distance correlates with transfer learning performance across domains and modalities.

Proposed method

Define a joint feature-label space and lift the distance to distributions over this space via optimal transport.
Model each label as a distribution over features and represent these via Gaussian approximations (mean and covariance) to enable analytic Wasserstein computations.
Compute a ground metric combining feature distance with label-distribution distance (Wasserstein between Gaussians).
Use entropy-regularized OT (Sinkhorn) for scalable computation of the dataset distance (OT and its Gaussian variant).
Precompute label-to-label distances to speed up the global OT problem and employ online batch statistics for scalability.

Experimental results

Research questions

RQ1Can a principled, training-free distance between datasets be defined that handles disjoint label sets and leverages dataset geometry?
RQ2Does an OT-based dataset distance predict transfer learning performance across diverse tasks and data modalities?
RQ3Is it feasible to scale OT-based dataset distances to large real-world datasets with practical computation times?
RQ4How well do Gaussian-approximated label distributions approximate the true label-conditional feature distributions for distance computation?

Key findings

The proposed OT-based dataset distance (OTDD) defines a valid metric between datasets in the space of feature-label distributions.
Representing label-conditional features as Gaussians yields tractable closed-form Wasserstein distances and a scalable distance (d_OTN) with exactness under Gaussian/elliptical labels.
Empirical results show strong correlations between OTDD and transfer learning transferability across MNIST variants, USPS, EMNIST, Fashion-MNIST, Tiny-ImageNet, CIFAR-10, and NLP datasets.
OTDD can guide data augmentation choices by predicting which transformations improve transferability.
Across text classification with embeddings (BERT), OTDD correlates with transferability, illustrating applicability to NLP.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.