QUICK REVIEW

[论文解读] RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model

Yao Lu, Shang Liu|arXiv (Cornell University)|Aug 10, 2023

Ferroelectric and Negative Capacitance Devices被引用 12

一句话总结

RTLLM 提供了一个开源基准测试，涵盖从自然语言产生的30个设计RTL生成任务，以及一种自规划提示技术以提升LLM性能，从而能够评估语法、功能性和设计质量。

ABSTRACT

Inspired by the recent success of large language models (LLMs) like ChatGPT, researchers start to explore the adoption of LLMs for agile hardware design, such as generating design RTL based on natural-language instructions. However, in existing works, their target designs are all relatively simple and in a small scale, and proposed by the authors themselves, making a fair comparison among different LLM solutions challenging. In addition, many prior works only focus on the design correctness, without evaluating the design qualities of generated design RTL. In this work, we propose an open-source benchmark named RTLLM, for generating design RTL with natural language instructions. To systematically evaluate the auto-generated design RTL, we summarized three progressive goals, named syntax goal, functionality goal, and design quality goal. This benchmark can automatically provide a quantitative evaluation of any given LLM-based solution. Furthermore, we propose an easy-to-use yet surprisingly effective prompt engineering technique named self-planning, which proves to significantly boost the performance of GPT-3.5 in our proposed benchmark.

研究动机与目标

提供一个公平、可扩展的从自然语言生成 RTL 的基准测试，覆盖语法、功能性和设计质量。
使任何基于 LLM 的 RTL 生成解决方案都能对照手工设计的 RTL 进行自动化评估。
引入自规划提示工程以提升 RTL 代码生成中的 LLM 性能。

提出的方法

定义三个逐步的评估目标：语法、功能性和设计质量。
组建30个多样化的 RTL 设计，附带真实参考的 VHs、测试基准和 HDL，以及描述性的 L 文件。
使用自动化工具对生成的 RTL 进行综合、仿真并与参考进行比较。
提出自规划作为两步提示技术，包含推理步骤和语法安全性检查。
评估五种 LLM（GPT-3.5、GPT-4、Thakur 等人 2023、StarCoder，以及带自规划的 GPT-3.5/4）。
在综合后提供真实基线和用于设计质量的自动化指标（面积、功耗、时序）。

实验结果

研究问题

RQ1在标准化基准下，LLMs 如何从自然语言描述生成正确的 Verilog/VHDL/Chisel RTL？
RQ2提示工程，尤其是自规划，在多大程度上改善生成 RTL 的语法正确性和功能正确性？
RQ3LLM 生成的设计在综合指标（PPA）和功能有效性方面与人工设计参考相比如何？
RQ4哪些 LLM 与提示策略在多样化的设计类型和规模下能够提供最佳的整体 RTL 设计质量？

主要发现

GPT-4 实现了最高的语法正确性（81%）和功能正确性（15/30），在评估模型中表现最佳。
带自规划的 GPT-3.5 相比纯 GPT-3.5 显著提升（73% 语法、14/30 功能）接近 GPT-4 的性能。
自规划显著提升了若干设计的 RTL 生成准确性，相对于未规划的 GPT-3.5。
学术模型（Thakur 等、StarCoder）在 RTLLM 下在语法和功能性方面的表现均低于商业 LLM。
该基准测试使对比 30 项多样 RTL 任务的语法、功能性和设计质量与真实设计的自动评估成为可能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。