QUICK REVIEW

[论文解读] Substra: a framework for privacy-preserving, traceable and collaborative Machine Learning

Mathieu Galtier, Camille Marini|arXiv (Cornell University)|Oct 25, 2019

Privacy-Preserving Technologies in Data被引用 35

一句话总结

Substra 提供一个去中心化、隐私保护的框架，用于协作式机器学习，其中数据保留在本地节点，计算通过分布式账本进行编排，模型资产由显式权限治理。

ABSTRACT

Machine learning is promising, but it often needs to process vast amounts of sensitive data which raises concerns about privacy. In this white-paper, we introduce Substra, a distributed framework for privacy-preserving, traceable and collaborative Machine Learning. Substra gathers data providers and algorithm designers into a network of nodes that can train models on demand but under advanced permission regimes. To guarantee data privacy, Substra implements distributed learning: the data never leave their nodes; only algorithms, predictive models and non-sensitive metadata are exchanged on the network. The computations are orchestrated by a Distributed Ledger Technology which guarantees traceability and authenticity of information without needing to trust a third party. Although originally developed for Healthcare applications, Substra is not data, algorithm or programming language specific. It supports many types of computation plans including parallel computation plan commonly used in Federated Learning. With appropriate guidelines, it can be deployed for numerous Machine Learning use-cases with data or algorithm providers where trust is limited.

研究动机与目标

在数据敏感或分布式的情况下，推动ML中的隐私保护和协作。
提出一个在数据保留在拥有者节点的同时实现协作模型训练的框架。
通过有权限的资产和账本，提供一个无需信任、可审计的ML工作流平台。
展示计算计划如何实现灵活、可扩展的联邦学习/联盟式训练用于ML任务。

提出的方法

定义四种资产类型（Objectives、Datasets、Algorithms、Models），具备明确元数据和互操作性约定。
对资产的处理与下载实施权限制度，由私有分布式账本中的智能合约执行。
在计算计划内将ML计算编排为 train tuples 和 test tuples，实现顺序或并行训练与求平均。
利用基于 Hyperledger Fabric 的账本的去中心化节点网络来跟踪操作并执行权限。
通过 trunk-head 架构和模块化计算计划支持模型组合和迁移学习。
提供三种界面（web, CLI, Python SDK）来创建资产、管理权限和运行计算计划。

实验结果

研究问题

RQ1如何在分布式、私有数据上训练ML，而不在各方之间传输原始数据？
RQ2在协作环境中，私有分布式账本能否为ML计算提供可追溯性和真实性？
RQ3在多个组织之间管理复杂的联邦式ML任务，哪些资产与权限抽象是足够的？
RQ4计算计划如何在保持隐私的前提下支持顺序、并行和混合式联邦训练与评估？

主要发现

Substra 实现了远程、隐私保护的 ML，其中数据从不离开发有者节点。
带有智能合约的私有分布式账本在事前强制执行资产权限并记录可追溯性所需的非敏感元数据。
计算计划实现了灵活的联邦训练模式，包括顺序、并行和求平均步骤。
资产（Objectives、Datasets、Algorithms、Models）及其权限支持协作数据/合作算法用例（数据/算法协作、数据联盟、训练/评估协作）。
通过 trunk 和 private heads 在保护数据隐私的同时支持模型组合和迁移学习。
该框架是开源的，设计为与数据/算法/语言无关，并提供多种交互接口。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。