Skip to main content
QUICK REVIEW

[论文解读] A Data-driven Prognostic Architecture for Online Monitoring of Hard Disks Using Deep LSTM Networks.

Sanchita Basak, Saptarshi Sengupta|arXiv (Cornell University)|Oct 21, 2018
Traffic Prediction and Management Techniques参考文献 23被引用 1
一句话总结

本文提出一种基于深度长短期记忆(LSTM)网络的双层、数据驱动的剩余使用寿命(RUL)预测架构,用于预测云后端服务器中硬盘的剩余使用寿命。该模型利用在线数据流、有效的特征提取和稳健的预处理方法,在关键的10天故障窗口期内实现了高精度的早期故障预测,平均精度达0.8435。

ABSTRACT

With the advent of pervasive cloud computing technologies, service reliability and availability are becoming major concerns,especially as we start to integrate cyber-physical systems with the cloud networks. A number of smart and connected community systems such as emergency response systems utilize cloud networks to analyze real-time data streams and provide context-sensitive decision support.Improving overall system reliability requires us to study all the aspects of the end-to-end of this distributed system,including the backend data servers. In this paper, we describe a bi-layered prognostic architecture for predicting the Remaining Useful Life (RUL) of components of backend servers,especially those that are subjected to degradation. We show that our architecture is especially good at predicting the remaining useful life of hard disks. A Deep LSTM Network is used as the backbone of this fast, data-driven decision framework and dynamically captures the pattern of the incoming data. In the article, we discuss the architecture of the neural network and describe the mechanisms to choose the various hyper-parameters. We describe the challenges faced in extracting effective training sets from highly unorganized and class-imbalanced big data and establish methods for online predictions with extensive data pre-processing, feature extraction and validation through test sets with unknown remaining useful lives of the hard disks. Our algorithm performs especially well in predicting RUL near the critical zone of a device approaching failure.The proposed architecture is able to predict whether a disk is going to fail in next ten days with an average precision of 0.8435.In future, we will extend this architecture to learn and predict the RUL of the edge devices in the end-to-end distributed systems of smart communities, taking into consideration context-sensitive external features such as weather.

研究动机与目标

  • 通过实现硬盘故障的早期预测,提升基于云的网络物理系统中的系统可靠性。
  • 解决训练可靠RUL预测模型时面临的非结构化、类别不平衡的大数据挑战。
  • 开发一种动态、在线的 prognostic 框架,能够在分布式云环境中提供实时决策支持。
  • 将该方法扩展至智能社区中的边缘设备,整合天气等上下文敏感的外部因素。

提出的方法

  • 设计了一种双层神经网络架构,其中深度LSTM网络作为核心组件,用于从流式磁盘遥测数据中学习时间模式。
  • 应用了广泛的预处理方法,以处理非结构化且类别不平衡的数据集,从而提升训练集质量。
  • 采用特征提取技术,将原始磁盘健康指标转换为LSTM模型可理解的有意义表征。
  • 系统性地进行超参数调优,以优化模型性能和泛化能力。
  • 通过在未知RUL的测试集上进行验证,实现实时在线预测,持续对流入的数据流进行推理。
  • 使用真实世界中的磁盘故障数据进行模型训练与验证,重点聚焦于接近故障阈值时的准确预测。

实验结果

研究问题

  • RQ1基于深度LSTM的架构能否在实时云存储系统中有效预测硬盘的剩余使用寿命(RUL)?
  • RQ2如何将来自磁盘监控的非结构化且类别不平衡的大数据转化为有效的RUL预测训练集?
  • RQ3所提出的模型在预测关键10天故障窗口期内RUL的表现如何?
  • RQ4该架构如何支持在分布式网络物理系统中实现在线、动态的预测?
  • RQ5该框架能否扩展以包含外部上下文因素(如天气)用于边缘设备的故障预测?

主要发现

  • 所提出的基于深度LSTM的架构在预测未来十天内硬盘故障的平均精度达到0.8435。
  • 该模型在关键故障区域表现出色,其中早期检测对系统可靠性最为关键。
  • 有效的数据预处理和特征提取显著提升了模型在非结构化且类别不平衡的真实数据集上的鲁棒性。
  • 该架构支持实时在线预测,适用于在生产级云环境中部署。
  • 该框架具有可扩展性,未来有望集成天气等外部上下文特征,用于边缘设备监控。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。