QUICK REVIEW

[論文レビュー] Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption

Stephen Hardy, Wilko Henecka|arXiv (Cornell University)|Nov 29, 2017

Privacy-Preserving Technologies in Data参考文献 29被引用数 429

ひとこと要約

この論文は、縦分割データに対してプライバシー保護された三者解決策を提案します。2つの提供者がプライバシー保護エンティティ解決と加法同型暗号を用いて共同ロジスティック回帰モデルを学習し、エンティティ解決エラーが学習に与える影響を分析します。

ABSTRACT

Consider two data providers, each maintaining private records of different feature sets about common entities. They aim to learn a linear model jointly in a federated setting, namely, data is local and a shared model is trained from locally computed updates. In contrast with most work on distributed learning, in this scenario (i) data is split vertically, i.e. by features, (ii) only one data provider knows the target variable and (iii) entities are not linked across the data providers. Hence, to the challenge of private learning, we add the potentially negative consequences of mistakes in entity resolution. Our contribution is twofold. First, we describe a three-party end-to-end solution in two phases ---privacy-preserving entity resolution and federated logistic regression over messages encrypted with an additively homomorphic scheme---, secure against a honest-but-curious adversary. The system allows learning without either exposing data in the clear or sharing which entities the data providers have in common. Our implementation is as accurate as a naive non-private solution that brings all data in one place, and scales to problems with millions of entities with hundreds of features. Second, we provide what is to our knowledge the first formal analysis of the impact of entity resolution's mistakes on learning, with results on how optimal classifiers, empirical losses, margins and generalisation abilities are affected. Our results bring a clear and strong support for federated learning: under reasonable assumptions on the number and magnitude of entity resolution's mistakes, it can be extremely beneficial to carry out federated learning in the setting where each peer's data provides a significant uplift to the other.

研究の動機と目的

生データや共通エンティティマッピングを開示せずに、二つの提供者が保有する縦分割データからの学習を促進する。
識別子を秘匿したまま、提供者間でレコードを整列させるプライバシー保護エンティティ解決プロトコルを開発する。
加法同型暗号とサードパーティのコーディネータを用いた安全な連合ロジスティック回帰を実現する。
エンティティ解決の誤りが最適分類器、損失、マージン、汎化に与える影響について形式的な分析を提供する。
数百万のエンティティと数百の特徴を持つデータセットに対するスケーラビリティを示し、中央集権的な非プライベート解決策に近い精度を維持する。

提案手法

エンドツーエンドの三者パイプラインを提案し、コーディネータ(C)がプライバシー保護エンティティ解決と安全なロジスティック回帰を実行する。
暗号長期鍵(CLKs)とBloomフィルタベースのエンコーディングを用いて、Dice類似度を介してパーティ間のエンティティをプライベートにリンクする。
学習プロセスを加法同型暗号方式（例：Paillier）で暗号化し、生データを開示せずに勾配と更新を計算する。
Taylor級数に基づく損失近似（Taylor loss）を採用し、勾配の暗号計算と早期停止のホールドアウト損失を可能にする。
エンティティ解決結果を公開せずに処理するため、学習プロセスに暗号化マスクを組み込む。
縦分割特徴量上で、安全な連合SGD（SAGに特化したフォーカスを含む）を実装し、コーディネータへ送信されるのは暗号化された交換のみであることを保証する。

実験結果

リサーチクエスチョン

RQ1プライバシー保護エンティティ解決は、共同学習されたモデルの精度を中央集権的な非プライベート解決と比較してどのように影響するか？
RQ2エンティティ解決の誤りが発生した場合、最適分類器の偏差に対してどのような形式的界限を設定できるか？
RQ3大きなマージンの例について特に、エンティティ解決の誤差に対して学習済み分類器が頑健になる条件は何か？
RQ4エンティティ解決の誤差の下での安全な連合ロジスティック回帰の収束性と汎化挙動はどうなるか？
RQ5提案システムは、数百万のエンティティと数百の特徴を持つデータセットに対して、プライバシー保護を維持しつつどの程度のスケーラビリティを示すか？

主な発見

エンドツーエンドのシステムは、すべてのデータを集約する単純な非プライベート解決策と同じくらい正確な学習を達成し、巨大な問題にもスケールする。
本研究は、エンティティ解決の誤りが学習に与える影響について、分類器の偏差の境界や経験的損失・汎化への影響を含む、初めての形式的分析を提供する。
合理的な仮定の下、エンティティ解決の誤差にもかかわらず大きなマージンの例は正しく分類されるままであり、頑健性を示す。
エンティティ解決の誤りが少数である場合、汎化への影響は顕著ではなく、経験的損失の収束は、最適分類器、解決誤差、クラス統計の三つのペナルティに依存する速さで収束する。
データパートナーの特徴が互補的である場合に、分類精度を大幅に向上させる連合学習をサポートし、プライバシー保護に基づく協力を正当化する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。