Skip to main content
QUICK REVIEW

[论文解读] Silent Data Corruptions at Scale

Harish Dattatraya Dixit, Sneha Pendharkar|arXiv (Cornell University)|Feb 22, 2021
Radiation Effects in Electronics参考文献 21被引用 68
一句话总结

这篇论文分析数据中心 CPU 中的 silent data corruptions (SDCs),提出一个真实世界的调试案例,并讨论大规模的检测与缓解策略。

ABSTRACT

Silent Data Corruption (SDC) can have negative impact on large-scale infrastructure services. SDCs are not captured by error reporting mechanisms within a Central Processing Unit (CPU) and hence are not traceable at the hardware level. However, the data corruptions propagate across the stack and manifest as application-level problems. These types of errors can result in data loss and can require months of debug engineering time. In this paper, we describe common defect types observed in silicon manufacturing that leads to SDCs. We discuss a real-world example of silent data corruption within a datacenter application. We provide the debug flow followed to root-cause and triage faulty instructions within a CPU using a case study, as an illustration on how to debug this class of errors. We provide a high-level overview of the mitigations to reduce the risk of silent data corruptions within a large production fleet. In our large-scale infrastructure, we have run a vast library of silent error test scenarios across hundreds of thousands of machines in our fleet. This has resulted in hundreds of CPUs detected for these errors, showing that SDCs are a systemic issue across generations. We have monitored SDCs for a period longer than 18 months. Based on this experience, we determine that reducing silent data corruptions requires not only hardware resiliency and production detection mechanisms, but also robust fault-tolerant software architectures.

研究动机与目标

  • 识别在硅制造过程中导致 SDCs 的缺陷类型。
  • 使用真实世界的案例研究演示 SDCs 如何传播到应用层。
  • 描述在整批系统中进行根因分析 SDCs 的调试工作流与工具。
  • 概述在生产环境中降低 SDC 风险的硬件与软件策略。

提出的方法

  • 将硅缺陷类别分类:设备错误、早期故障、劣化,以及寿命末期的磨损。
  • 分析一个基于 Spark 的真实世界应用,展示 SDC 传播导致缺失数据和可能的数据丢失。
  • 详细说明从 Scala 到 Java 字节码再到汇编的多语言重现者工作流,用于根因分析。
  • 提出组装确定性重现程序和指令级调试的最佳实践指南。

实验结果

研究问题

  • RQ1哪些硅及制造缺陷类别会导致数据中心 CPU 的 silent data corruptions?
  • RQ2SDCs 如何从硬件通过软件栈传播,导致应用层失败?
  • RQ3哪些调试工作流与工具能够实现对 SDCs 的大规模根因分析?
  • RQ4哪些检测与容错的软件/硬件策略可以在大规模设备中缓解 SDCs?

主要发现

  • 数据中心 CPU 中的 SDCs 发生率高于传统软错误 FIT 模型,并且在大规模上具有可重复性。
  • 一个真实案例显示 SDCs 可能在解压缩和数据处理工作流中导致数据缺失或损坏。
  • 大规模调试需要跨语言的重现器与汇编级跟踪来识别有问题的指令。
  • 缓解措施包括硬件保护、定向测试、检测机制以及软件容错设计。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。