Skip to main content
QUICK REVIEW

[论文解读] Errors in Flash-Memory-Based Solid-State Drives: Analysis, Mitigation, and Recovery

Yu Cai, Saugata Ghose|arXiv (Cornell University)|Nov 28, 2017
Advanced Data Storage Technologies参考文献 114被引用 53
一句话总结

本文综述基于 NAND 闪存的 SSD 的可靠性挑战,提供 MLC/TLC 设备的实验表征数据,并回顾用于延长 SSD 使用寿命的缓解和数据恢复技术。

ABSTRACT

NAND flash memory is ubiquitous in everyday life today because its capacity has continuously increased and cost has continuously decreased over decades. This positive growth is a result of two key trends: (1) effective process technology scaling; and (2) multi-level (e.g., MLC, TLC) cell data coding. Unfortunately, the reliability of raw data stored in flash memory has also continued to become more difficult to ensure, because these two trends lead to (1) fewer electrons in the flash memory cell floating gate to represent the data; and (2) larger cell-to-cell interference and disturbance effects. Without mitigation, worsening reliability can reduce the lifetime of NAND flash memory. As a result, flash memory controllers in solid-state drives (SSDs) have become much more sophisticated: they incorporate many effective techniques to ensure the correct interpretation of noisy data stored in flash memory cells. In this chapter, we review recent advances in SSD error characterization, mitigation, and data recovery techniques for reliability and lifetime improvement. We provide rigorous experimental data from state-of-the-art MLC and TLC NAND flash devices on various types of flash memory errors, to motivate the need for such techniques. Based on the understanding developed by the experimental characterization, we describe several mitigation and recovery techniques, including (1) cell-tocell interference mitigation; (2) optimal multi-level cell sensing; (3) error correction using state-of-the-art algorithms and methods; and (4) data recovery when error correction fails. We quantify the reliability improvement provided by each of these techniques. Looking forward, we briefly discuss how flash memory and these techniques could evolve into the future.

研究动机与目标

  • 在制程缩放和多级单元使用增加错误率的背景下,推动对 NAND 闪存存储的可靠性改进的需求。
  • 利用最先进的 MLC 与 TLC 设备的实验数据表征闪存错误的根本原因。
  • 描述并量化缓解技术,包括干扰抑制、最优感知、ECC 与数据恢复流程。
  • 解释在控制器层面的策略(垃圾收集、 wears leveling、坏块管理)如何延长 SSD 使用寿命。
  • 勾画闪存及相关存储的可靠性未来方向。

提出的方法

  • 评审当代 SSD 的体系结构与组织,将可靠性机制与系统组件相关联。
  • 提供来自真实 NAND 闪存设备的实验表征数据,以推动缓解技术的应用。
  • 描述一组缓解技术:单元到单元干扰缓解、最优多级感知、ECC 方案以及数据恢复方法。
  • 在可靠性背景下解释总线/主机接口和控制器职责(FTL、垃圾收集、磨损均衡)。
  • 在 SSD 控制器内详细说明数据路径保护和元数据保护策略。
  • 讨论坏块管理和超页级奇偶校验等作为可靠性策略。

实验结果

研究问题

  • RQ1在用于 SSD 的 NAND 闪存中,错误的根本来源是什么?
  • RQ2现实世界中的 MLC 与 TLC NAND 设备如何出现错误,以及有哪些定量数据支持?
  • RQ3哪些缓解技术可以有效减少或容忍 SSD 的闪存错误?
  • RQ4在纠错失败时如何执行数据恢复?
  • RQ5哪些未来方向和技术可能影响 SSD 的可靠性和寿命?

主要发现

  • 随着工艺缩放和更高等级的单元编码(MLC/TLC),NAND 闪存的可靠性下降,原始错误率上升。
  • 控制器层面的技术,如磨损均衡、垃圾收集优化和坏块管理,是延长 SSD 使用寿命的核心。
  • ECC(BCH/LDPC)和 CRC 校验对于在高原始错误率下读取后纠正与验证数据至关重要。
  • 数据扰码减少数据相关的错误模式,且加密(SED)在不影响可靠性的前提下提供额外的数据安全性。
  • 控制器内的数据路径和元数据保护降低 SRAM/DRAM 的错误并确保主机数据和映射的一致性。
  • 超页级校验和其他类似 RAID 的策略为应对块级错误提供额外的鲁棒性。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。