QUICK REVIEW

[论文解读] Communication-avoiding micro-architecture to compute Xcorr scores for peptide identification

Sumesh Kumar, Fahad Saeed|arXiv (Cornell University)|Jul 31, 2021

Advanced Proteomics Techniques and Applications参考文献 21被引用 2

一句话总结

该论文提出了一种通信优化的FPGA微架构，通过将完整的实验质谱图缓存在片上块RAM中，并使用专用的肽段广播总线实现双向数据重用，从而加速Xcorr评分计算以进行肽段鉴定。该设计将DRAM访问次数减少了600倍，并在配备16GB内存的3.6 GHz Intel i7-4970处理器上，相较于基于CPU的Crux实现，实现了24倍的加速。

ABSTRACT

Database algorithms play a crucial part in systems biology studies by identifying proteins from mass spectrometry data. Many of these database search algorithms incur huge computational costs by computing similarity scores for each pair of sparse experimental spectrum and candidate theoretical spectrum vectors. Modern MS instrumentation techniques which are capable of generating high-resolution spectrometry data require comparison against an enormous search space, further emphasizing the need of efficient accelerators. Recent research has shown that the overall cost of scoring, and deducing peptides is dominated by the communication costs between different hierarchies of memory and processing units. However, these communication costs are seldom considered in accelerator-based architectures leading to inefficient DRAM accesses, and poor data-utilization due to irregular memory access patterns. In this paper, we propose a novel communication-avoiding micro-architecture to compute cross-correlation based similarity score by utilizing efficient local cache, and peptide pre-fetching to minimize DRAM accesses, and a custom-designed peptide broadcast bus to allow input reuse. An efficient bus arbitration scheme was designed, and implemented to minimize synchronization cost and exploit parallelism of processing elements. Our simulation results show that the proposed micro-architecture performs on average 24x better than a CPU implementation running on a 3.6 GHz Intel i7-4970 processor with 16GB memory.

研究动机与目标

解决质谱蛋白质组学中Xcorr评分计算在内存层次结构之间的高通信开销问题。
通过缓存完整实验质谱图并实现高效输入重用，最小化DRAM访问次数。
设计专用的肽段广播总线及总线仲裁方案，以减少同步开销并提升并行性。
在内存受限的工作负载中，实现显著优于基于CPU的实现（如Crux）的性能提升。
展示在加速SEQUEST核心的内存密集型点积计算方面的可扩展性和效率。

提出的方法

实现一个2kB的块RAM缓存以存储完整的实验质谱图，减少冗余的DRAM访问。
对肽段数据库进行预排序，以支持二分查找和候选肽段的预取，提升输入局部性。
设计专用的肽段广播总线，使所有处理单元能够重用同一份肽段数据，避免重复内存访问。
采用先到先服务（FCFS）总线仲裁方案，最小化处理单元之间的同步延迟。
通过PCIe DMA将微架构与主机CPU集成，实现主内存到FPGA外部内存的数据传输。
通过最小化数据移动并最大化各级别的数据重用，优化系统在点积计算中的性能。

实验结果

研究问题

RQ1定制化微架构能否在肽段鉴定的Xcorr评分计算中减少DRAM访问开销？
RQ2在内存受限的蛋白质组学工作负载中，对实验质谱图进行片上缓存对最小化通信成本有多有效？
RQ3专用肽段广播总线在并行处理肽段候选物时，能在多大程度上提升数据重用并减少同步开销？
RQ4缓存大小和处理单元数量的变化对通信和计算瓶颈有何影响？
RQ5所提出的架构能否在保持可扩展性的前提下，实现显著超越基于CPU的软件（如Crux）的加速效果？

主要发现

所提出的微架构在3.6 GHz Intel i7-4970 CPU上运行Crux（配备16GB内存）时，平均实现24倍加速。
2kB片上缓存相比无缓存方案，将平均DRAM访问次数减少了600倍。
每个处理单元的平均I/O时间从512B缓存时的1.01秒降至2kB缓存时的0.86毫秒，减少1170倍。
当缓存大小低于2kB时，处理单元数量增加会导致同步等待时间呈指数增长；但在2kB和4kB缓存下，该时间稳定在2.2毫秒。
系统在最多16个处理单元时实现接近线性的加速，总处理时间随处理单元数量增加而减少。
该设计在最多32个处理单元时表现出良好的可扩展性和效率，且在不同前体质量窗口容差下均保持一致的性能提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。