QUICK REVIEW

[论文解读] Is a Green Screen Really Necessary for Real-Time Portrait Matting?

Zhanghan Ke, Kaican Li|arXiv (Cornell University)|Nov 24, 2020

Image Enhancement Techniques参考文献 46被引用 45

一句话总结

本文提出MODNet，一种轻量级、实时的人像抠图网络，能够在无需绿幕或Trimap的情况下，仅从单张图像实现高质量的alpha抠图。通过联合优化多个子目标并施加显式约束，结合自监督适应与单帧延迟技巧，MODNet实现63 FPS的推理速度，并在真实世界图像与视频上优于以往无需Trimap的方法。

ABSTRACT

For portrait matting without the green screen, existing works either require auxiliary inputs that are costly to obtain or use multiple models that are computationally expensive. Consequently, they are unavailable in real-time applications. In contrast, we present a light-weight matting objective decomposition network (MODNet), which can process portrait matting from a single input image in real time. The design of MODNet benefits from optimizing a series of correlated sub-objectives simultaneously via explicit constraints. Moreover, since trimap-free methods usually suffer from the domain shift problem in practice, we introduce (1) a self-supervised strategy based on sub-objectives consistency to adapt MODNet to real-world data and (2) a one-frame delay trick to smooth the results when applying MODNet to portrait video sequence. MODNet is easy to be trained in an end-to-end style. It is much faster than contemporaneous matting methods and runs at 63 frames per second. On a carefully designed portrait matting benchmark newly proposed in this work, MODNet greatly outperforms prior trimap-free methods. More importantly, our method achieves remarkable results in daily photos and videos. Now, do you really need a green screen for real-time portrait matting?

研究动机与目标

消除实时人像抠图中对绿幕或昂贵辅助输入的依赖。
解决无Trimap方法在应用于真实世界图像时常见的领域偏移问题。
开发一种轻量级、单模型解决方案，以支持视频应用中的实时推理。
提升在非受限、日常照片与视频中抠图结果的鲁棒性与一致性。

提出的方法

设计一个多任务学习框架，通过约束优化方法显式联合优化相关子目标（如粗粒度、细粒度和深层特征）。
引入基于子目标预测的自监督一致性损失，使模型在无需真实Trimap标注的情况下适应真实世界数据。
在视频序列推理过程中应用单帧延迟技巧，以平滑时间不一致性并提升视觉质量。
仅使用单张输入图像及其对应的alpha抠图，端到端训练整个网络。
采用轻量级网络结构，确保高速推理，在标准硬件上实现63帧每秒的性能。
构建一个新颖的人像抠图基准，用于评估真实世界非受限数据上的性能。

实验结果

研究问题

RQ1单一轻量级深度学习模型能否在不依赖绿幕或Trimap的情况下实现实时人像抠图？
RQ2在无Trimap人像抠图中，如何缓解训练数据与真实世界数据之间的领域偏移问题？
RQ3哪些技术可在不增加计算成本的前提下，提升视频序列中人像抠图的时间一致性？
RQ4多目标网络的端到端训练能否优于级联或多阶段模型，从而获得更优性能？

主要发现

MODNet实现63帧每秒的推理速度，显著优于同期其他方法的实时性能。
在新提出的portrait matting基准上，MODNet在定量指标与视觉质量上均超越所有先前的无Trimap方法。
自监督一致性策略有效缓解了领域偏移，使模型在无需额外标注的情况下实现对真实世界照片的强大泛化能力。
单帧延迟技巧成功提升了视频抠图的时间平滑性，同时保持了实时推理速度。
MODNet在日常照片与视频上生成了高质量的alpha抠图，证明了其在无需绿幕条件下的实际可用性。
多目标分解网络的端到端训练相比多阶段或依赖辅助输入的方法，取得了更优的性能表现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。