QUICK REVIEW

[Paper Review] Entity Resolution and Federated Learning get a Federated Resolution

Richard Nock, Stephen Hardy|arXiv (Cornell University)|Mar 11, 2018

Data Quality and Management18 references75 citations

TL;DR

The paper formally analyzes how errors in entity resolution affect learning in vertically partitioned federated learning, derives bounds showing robustness for large-margin classifiers, and demonstrates that focusing ER on cross-class mistakes improves downstream learning with experiments on noisy, class-aware ER.

ABSTRACT

Consider two data providers, each maintaining records of different feature sets about common entities. They aim to learn a linear model over the whole set of features. This problem of federated learning over vertically partitioned data includes a crucial upstream issue: entity resolution, i.e. finding the correspondence between the rows of the datasets. It is well known that entity resolution, just like learning, is mistake-prone in the real world. Despite the importance of the problem, there has been no formal assessment of how errors in entity resolution impact learning. In this paper, we provide a thorough answer to this question, answering how optimal classifiers, empirical losses, margins and generalisation abilities are affected. While our answer spans a wide set of losses --- going beyond proper, convex, or classification calibrated ---, it brings simple practical arguments to upgrade entity resolution as a preprocessing step to learning. One of these suggests that entity resolution should be aimed at controlling or minimizing the number of matching errors between examples of distinct classes. In our experiments, we modify a simple token-based entity resolution algorithm so that it indeed aims at avoiding matching rows belonging to different classes, and perform experiments in the setting where entity resolution relies on noisy data, which is very relevant to real world domains. Notably, our approach covers the case where one peer extit{does not} have classes, or a noisy record of classes. Experiments display that using the class information during entity resolution can buy significant uplift for learning at little expense from the complexity standpoint.

Motivation & Objective

Motivation: federated learning with vertically partitioned data where different parties hold different feature sets for common entities.
Objective: quantify how entity-resolution mistakes affect optimal classifiers, empirical losses, margins, and generalization.
Goal: provide actionable guidance to upgrade entity resolution as a preprocessing step to learning, especially focusing on cross-class matching errors.
Scope: develop theoretical bounds for ridge-regularized and Taylor losses and relate them to practical ER strategies.

Proposed method

Model the data as vertically partitioned peers with a shared set of entities and a permutation-based representation of entity-resolution mistakes.
Use Ridge-regularized losses and Taylor losses to capture a broad class of learning objectives.
Derive bounds on the deviation between the ideal classifier and the one learned from error-prone data under an $(\varepsilon,\tau)$-accurate permutation.
Introduce key parameters (delta_theta, delta_P, delta_S) to summarize the effect of ER errors and dataset properties.
Prove immunity of large-margin classifiers to ER errors under certain conditions, linking margin to error tolerance.
Provide experimental validation by modifying a token-based ER algorithm to use class information and test on fifteen UCI domains.

Experimental results

Research questions

RQ1How do errors in entity resolution impact the optimal classifier, loss, and generalization in vertical federated learning?
RQ2Can large-margin classification provide immunity to entity-resolution mistakes, and under what conditions?
RQ3What ER design choices (especially cross-class matching errors) most influence downstream learning performance?
RQ4Does incorporating class information into entity resolution yield significant learning gains when data is noisy or partially labeled?
RQ5How do theoretical bounds translate to practical ER algorithms and real-world datasets?

Key findings

Theoretical bounds show that the drift between ideal and learned classifiers due to ER mistakes scales with the number of permutation steps and ER error magnitude, but diminishes with larger sample sizes when certain boundedness conditions hold.
Immunity to ER mistakes at a given margin is achievable for large-margin classifiers, with immunity improving as sample size grows and cross-class errors are controlled.
Learning with Taylor losses and ridge regularization can align with convex Taylor losses near optima, facilitating analysis and practical optimization.
Experiments indicate that incorporating class information into token-based ER yields substantial improvements over class-agnostic ER, sometimes matching results from ideally entity-resolved data.
Key ER-design insight: minimizing cross-class matching errors (rho=0) yields the strongest bounds and learning robustness.
The analysis highlights a small set of control parameters (delta_theta, delta_P, delta_S) that largely drive ER impact on learning.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.