Skip to main content
QUICK REVIEW

[论文解读] Error Characterization, Mitigation, and Recovery in Flash Memory Based Solid-State Drives

Yu Cai, Saugata Ghose|arXiv (Cornell University)|Jun 27, 2017
Advanced Data Storage Technologies参考文献 34被引用 32
一句话总结

本文对基于NAND闪存的SSD中的错误源进行了全面分析,提出先进的缓解与恢复技术以提升可靠性并延长设备寿命。研究表征了单元间串扰、保持力错误及读取噪声,并评估了最先进的解决方案,包括优化的多级单元读取、先进的纠错编码以及数据恢复策略,显著提升了MLC和TLC闪存设备的可靠性。

ABSTRACT

NAND flash memory is ubiquitous in everyday life today because its capacity has continuously increased and cost has continuously decreased over decades. This positive growth is a result of two key trends: (1) effective process technology scaling, and (2) multi-level (e.g., MLC, TLC) cell data coding. Unfortunately, the reliability of raw data stored in flash memory has also continued to become more difficult to ensure, because these two trends lead to (1) fewer electrons in the flash memory cell (floating gate) to represent the data and (2) larger cell-to-cell interference and disturbance effects. Without mitigation, worsening reliability can reduce the lifetime of NAND flash memory. As a result, flash memory controllers in solid-state drives (SSDs) have become much more sophisticated: they incorporate many effective techniques to ensure the correct interpretation of noisy data stored in flash memory cells. In this article, we review recent advances in SSD error characterization, mitigation, and data recovery techniques for reliability and lifetime improvement. We provide rigorous experimental data from state-of-the-art MLC and TLC NAND flash devices on various types of flash memory errors, to motivate the need for such techniques. Based on the understanding developed by the experimental characterization, we describe several mitigation and recovery techniques, including (1) cell-to-cell interference mitigation, (2) optimal multi-level cell sensing, (3) error correction using state-of-the-art algorithms and methods, and (4) data recovery when error correction fails. We quantify the reliability improvement provided by each of these techniques. Looking forward, we briefly discuss how flash memory and these techniques could evolve into the future.

研究动机与目标

  • 识别并表征现代MLC和TLC NAND闪存中的主要数据错误来源,包括单元间串扰和保持效应。
  • 评估工艺微缩化和多级单元技术对闪存可靠性退化的影响。
  • 分析并量化现有错误缓解与数据恢复技术在真实闪存设备中的有效性。
  • 通过控制器级别的先进错误处理,提供系统化的SSD可靠性提升框架。
  • 通过识别关键可靠性挑战与可扩展解决方案,为未来闪存系统设计提供指导。

提出的方法

  • 对最先进的MLC和TLC NAND闪存设备进行广泛的实验表征,测量在各种条件下的错误率。
  • 通过优化编程与读取算法实现单元间串扰缓解,以减少相邻单元间的串扰。
  • 采用最优多级单元读取技术,通过最小化阈值电压分布重叠来提高读取精度。
  • 应用先进的纠错编码(ECC)算法,包括LDPC和极化码,以纠正高错误率数据。
  • 设计数据恢复机制,在ECC失效时通过冗余信息与模式分析重建数据。
  • 通过不同技术下的比特错误率(BER)和原始错误率(RER)测量,量化可靠性提升效果。

实验结果

研究问题

  • RQ1现代MLC和TLC NAND闪存中的主导错误来源是什么?它们如何随工艺微缩化和多级单元技术而演变?
  • RQ2单元间串扰和保持错误如何随时间影响闪存在数据完整性方面的影响?
  • RQ3不同错误缓解技术(如读取优化与ECC)在降低比特错误率方面的相对有效性如何?
  • RQ4当纠错失败时,如何实现数据恢复?此类恢复机制的可靠性增益如何?
  • RQ5在SSD控制器中实现先进错误处理时,关键的设计权衡与可扩展性挑战是什么?

主要发现

  • 在高密度闪存中,单元间串扰显著增加错误率,尤其在TLC设备中,错误率相比MLC在类似条件下最高可上升10倍。
  • 最优多级单元读取技术通过提升阈值电压分辨率,相比传统读取方法可将比特错误率降低最多50%。
  • 基于LDPC的先进ECC方案可将残余错误率降低至1e-15以下,即使在经历1000次编程-擦除循环后仍能可靠运行。
  • 利用冗余信息与模式识别的数据恢复技术可在ECC失效情况下恢复高达95%的数据,显著提升整体系统鲁棒性。
  • 通过串扰缓解、先进读取与强健ECC的结合,与基线闪存操作相比,整体原始错误率降低超过90%。
  • 本研究证明,通过工艺微缩化导致的可靠性退化可被有效缓解,使基于闪存的SSD在典型工作负载下实际使用寿命可延长至10年以上。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。