QUICK REVIEW

[论文解读] Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM

Vivek Seshadri, Donghyuk Lee|arXiv (Cornell University)|Nov 30, 2016

Parallel Computing and Optimization Techniques参考文献 36被引用 54

一句话总结

Buddy-RAM 提出了一种新颖的机制，通过利用感测放大器的模拟行为和三行激活，在 DRAM 内部直接执行功能完整的批量按位操作（AND、OR、NOT），与传统方法相比，吞吐量提高 10.9X–25.6X，能耗降低 25.1X–59.5X，在数据库查询和集合运算等实际工作负载中，性能最高提升 7.0X。

ABSTRACT

Bitwise operations are an important component of modern day programming. Many widely-used data structures (e.g., bitmap indices in databases) rely on fast bitwise operations on large bit vectors to achieve high performance. Unfortunately, in existing systems, regardless of the underlying architecture (e.g., CPU, GPU, FPGA), the throughput of such bulk bitwise operations is limited by the available memory bandwidth. We propose Buddy, a new mechanism that exploits the analog operation of DRAM to perform bulk bitwise operations completely inside the DRAM chip. Buddy consists of two components. First, simultaneous activation of three DRAM rows that are connected to the same set of sense amplifiers enables us to perform bitwise AND and OR operations. Second, the inverters present in each sense amplifier enables us to perform bitwise NOT operations, with modest changes to the DRAM array. These two components make Buddy functionally complete. Our implementation of Buddy largely exploits the existing DRAM structure and interface, and incurs low overhead (1% of DRAM chip area). Our evaluations based on SPICE simulations show that, across seven commonly-used bitwise operations, Buddy provides between 10.9X---25.6X improvement in raw throughput and 25.1X---59.5X reduction in energy consumption. We evaluate three real-world data-intensive applications that exploit bitwise operations: 1) bitmap indices, 2) BitWeaving, and 3) bitvector-based implementation of sets. Our evaluations show that Buddy significantly outperforms the state-of-the-art.

研究动机与目标

解决现代系统中因内存带宽限制导致的批量按位操作性能瓶颈。
克服为执行按位操作而将大型位向量在处理器与主内存之间频繁传输所造成的低效问题。
利用现有硬件组件，在 DRAM 内部直接实现高吞吐量、低能耗的按位操作。
设计一种低成本、实用的解决方案，可无缝集成至标准 DRAM 架构和接口中。
在多样化数据密集型工作负载中，展示所提机制的适用性与性能优势。

提出的方法

利用共享同一感测放大器阵列的三行 DRAM 同时激活，实现按位多数函数，通过控制行的初始状态来实现 AND 和 OR 操作。
通过将双接触 2T-1C 存储单元同时连接到每个感测放大器中的两个反相器输入，利用感测放大器中固有的反相器实现按位 NOT 操作。
将 Buddy-AND/OR 和 Buddy-NOT 组件与 RowClone 集成，实现所有按位操作的功能完整性。
将三行激活限制在每个子阵列的预定义三行集合内，以最小化面积和控制复杂度，避免完整复制地址总线和解码器。
通过 SPICE 仿真确保在工艺失配下的可靠性，验证在多个工艺角下的正确运行。
通过使用标准 DRAM 命令和接口，保持向后兼容性，最大限度减少对现有内存控制器和系统软件的修改。

实验结果

研究问题

RQ1能否在不引入显著面积或功耗开销的前提下，利用现有模拟电路在 DRAM 内部高效执行按位操作？
RQ2如何利用感测放大器的模拟行为在行级别实现 AND、OR 和 NOT 等逻辑操作？
RQ3与片外执行相比，在 DRAM 内部执行批量按位操作在性能和能效方面能获得多大提升？
RQ4该机制在数据库查询、集合运算和 DNA 序列分析等多样化工作负载中具有多大可扩展性？
RQ5该机制在现有 DRAM 架构中可实现多大程度的集成，且仅需极少修改和成本？

主要发现

与最先进的 SIMD 基线相比，Buddy-RAM 在七种常见按位操作中实现了 10.9X 到 25.6X 的原始吞吐量提升。
与传统方法相比，按位操作的能耗降低了 25.1X 到 59.5X。
使用位图索引的数据库查询在 Buddy-RAM 加速下性能提升 6.0X。
在使用 Buddy-RAM 的情况下，用于快速数据库扫描的 BitWeaving 技术在各种扫描参数下平均获得 7.0X 的加速。
集合操作（交集、并集、差集）相比传统实现性能提升 3.0X。
该机制在 DRAM 芯片上仅引入 1% 的面积开销，证明了其低成本和实际可行性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。