QUICK REVIEW

[论文解读] Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks

Charith Mendis, Alex Renda|arXiv (Cornell University)|Aug 21, 2018

Parallel Computing and Optimization Techniques参考文献 44被引用 49

一句话总结

Ithemal 使用分层多尺度的 LSTM 根据汇编序列预测基本块吞吐量，在手写模型（IACA/llvm-mca）之上实现更高的准确性，同时在跨多种微架构上仍然快速且可移植。

ABSTRACT

Predicting the number of clock cycles a processor takes to execute a block of assembly instructions in steady state (the throughput) is important for both compiler designers and performance engineers. Building an analytical model to do so is especially complicated in modern x86-64 Complex Instruction Set Computer (CISC) machines with sophisticated processor microarchitectures in that it is tedious, error prone, and must be performed from scratch for each processor generation. In this paper we present Ithemal, the first tool which learns to predict the throughput of a set of instructions. Ithemal uses a hierarchical LSTM--based approach to predict throughput based on the opcodes and operands of instructions in a basic block. We show that Ithemal is more accurate than state-of-the-art hand-written tools currently used in compiler backends and static machine code analyzers. In particular, our model has less than half the error of state-of-the-art analytical models (LLVM's llvm-mca and Intel's IACA). Ithemal is also able to predict these throughput values just as fast as the aforementioned tools, and is easily ported across a variety of processor microarchitectures with minimal developer effort.

研究动机与目标

为编译器优化和性能工程的准确吞吐量估算的需要提供动力。
提出一种数据驱动的方法，以预测基本块吞吐量，而不依赖手工设计的微架构模型。
开发一个分层多尺度的 RNN 架构，以学习指令嵌入并预测吞吐量。
证明在 Haswell、Ivy Bridge 和 Skylake 微架构上的可移植性。
提供一个开源实现以惠及社区。

提出的方法

将输入汇编规范化为令牌化的、结构化的指令和操作数表示。
使用令牌层和指令级 LSTM 对令牌和每条指令表示进行嵌入，生成逐指令的嵌入。
使用分层 RNN 将指令嵌入汇聚成基本块嵌入，并通过最终线性层预测吞吐量。
在带有真实吞吐量的目标 CPU 的大规模标注数据集上端到端训练模型。
在多种微架构下对比手写模型 IACA 和 llvm-mca 的精度和速度。
探索包括 DAG-RNN 和令牌级 RNN 的架构变体，以评估性能收益。

实验结果

研究问题

RQ1一个数据驱动模型是否能够从汇编序列预测基本块吞吐量，且比手工分析模型更高的准确性？
RQ2分层多尺度 RNN 是否比扁平型或基于图的架构更好地捕获块内依赖和微架构效应？
RQ3学到的模型是否能够在不同的 x86-64 微架构之间无需架构特定的重新设计而保持可移植性？
RQ4预测器是否能够达到接近现有工具的实时速度，同时保持更高的准确性？

主要发现

Ithemal 在 Haswell、Ivy Bridge 和 Skylake 上的预测准确性高于 IACA 和 llvm-mca（平均误差分别约为 Haswell/Ivy Bridge/Skylake 的 0.079–0.089）。
模型对真实值的 Spearman 和 Pearson 相关性优于基线，提高了优化器的排序实用性。
吞吐量估算速度与 LLVM-mca 和 IACA 相当，远快于经验测量，使吞吐量估算成为工具链中的即插即用替代方案。
方法对微架构具有可移植性：在新 CPU 上重新训练不需要架构特定的重新设计，仍然优于手写模型。
架构变体（DAG-RNN、token-level RNN）表明分层 LSTM 架构在准确性与建模成本之间提供了最佳折衷。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。