QUICK REVIEW

[論文レビュー] RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model

Yao Lu, Shang Liu|arXiv (Cornell University)|Aug 10, 2023

Ferroelectric and Negative Capacitance Devices被引用数 12

ひとこと要約

RTLLMは自然言語からのRTL設計生成タスク30件のオープンソースベンチマークを提供し、LLMの性能を向上させる自己計画プロンプト手法を補助します。構文、機能、設計品質の評価を可能にします。

ABSTRACT

Inspired by the recent success of large language models (LLMs) like ChatGPT, researchers start to explore the adoption of LLMs for agile hardware design, such as generating design RTL based on natural-language instructions. However, in existing works, their target designs are all relatively simple and in a small scale, and proposed by the authors themselves, making a fair comparison among different LLM solutions challenging. In addition, many prior works only focus on the design correctness, without evaluating the design qualities of generated design RTL. In this work, we propose an open-source benchmark named RTLLM, for generating design RTL with natural language instructions. To systematically evaluate the auto-generated design RTL, we summarized three progressive goals, named syntax goal, functionality goal, and design quality goal. This benchmark can automatically provide a quantitative evaluation of any given LLM-based solution. Furthermore, we propose an easy-to-use yet surprisingly effective prompt engineering technique named self-planning, which proves to significantly boost the performance of GPT-3.5 in our proposed benchmark.

研究の動機と目的

Provide a fair, scalable benchmark for RTL generation from natural language that covers syntax, functionality, and design quality.
Enable automatic evaluation of any LLM-based RTL generation solution against ground-truth hand-crafted RTLs.
Introduce self-planning prompt engineering to improve LLM performance in RTL code generation.

提案手法

Define three progressive evaluation goals: syntax, functionality, and design quality.
Assemble 30 diverse RTL designs with ground-truth VHs, testbenches, and HDL, plus descriptive L files.
Use automated tooling to synthesize, simulate, and compare generated RTL against references.
Propose self-planning as a two-step prompt technique that includes reasoning steps and syntax-safety checks.
Evaluate five LLMs (GPT-3.5, GPT-4, Thakur et al. 2023, StarCoder, and GPT-3.5/4 with self-planning).
Provide ground-truth baselines and automatic metrics for design quality (area, power, timing) after synthesis.

実験結果

リサーチクエスチョン

RQ1How well do LLMs generate correct Verilog/VHDL/Chisel RTL from natural-language descriptions under a standardized benchmark?
RQ2To what extent can prompt engineering, especially self-planning, improve syntax correctness and functional correctness of generated RTL?
RQ3How do LLM-generated designs compare to human-crafted references in terms of synthesis metrics (PPA) and functional validity?
RQ4Which LLMs and prompting strategies yield the best overall RTL design quality across diverse design types and scales?

主な発見

GPT-4 achieves the highest syntax correctness (81%) and functionality correctness (15/30) among evaluated models.
GPT-3.5 with self-planning substantially improves over plain GPT-3.5 (73% syntax, 14/30 functionality) approaching GPT-4 performance.
Self-planning significantly enhances RTL generation accuracy for several designs compared to GPT-3.5 without planning.
Academic models (Thakur et al., StarCoder) show lower performance than commercial LLMs in both syntax and functionality under RTLLM.
The benchmark enables automatic evaluation of syntax, functionality, and design quality against ground-truth designs across 30 diverse RTL tasks.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。