QUICK REVIEW

[论文解读] VeriGen: A Large Language Model for Verilog Code Generation

Shailja Thakur, Baleegh Ahmad|arXiv (Cornell University)|Jul 28, 2023

Ferroelectric and Negative Capacitance Devices被引用 9

一句话总结

该论文对开源 LLM 进行 Verilog 数据微调，以生成 Verilog 代码，结果显示微调后的 CodeGen-16B 可与或超过 GPT-3.5-turbo 在功能性 Verilog 生成方面，并在与预训练模型相比实现显著提升。

ABSTRACT

In this study, we explore the capability of Large Language Models (LLMs) to automate hardware design by generating high-quality Verilog code, a common language for designing and modeling digital systems. We fine-tune pre-existing LLMs on Verilog datasets compiled from GitHub and Verilog textbooks. We evaluate the functional correctness of the generated Verilog code using a specially designed test suite, featuring a custom problem set and testing benches. Here, our fine-tuned open-source CodeGen-16B model outperforms the commercial state-of-the-art GPT-3.5-turbo model with a 1.1% overall increase. Upon testing with a more diverse and complex problem set, we find that the fine-tuned model shows competitive performance against state-of-the-art gpt-3.5-turbo, excelling in certain scenarios. Notably, it demonstrates a 41% improvement in generating syntactically correct Verilog code across various problem categories compared to its pre-trained counterpart, highlighting the potential of smaller, in-house LLMs in hardware design automation.

研究动机与目标

通过使用 LLM 生成 Verilog 代码来推动硬件设计的自动化。
从 GitHub 和 Verilog 课本中创建一个大型面向 Verilog 的训练语料库。
在 Verilog 语料库上对多种开源 LLM 进行微调并评估它们的代码生成质量。
开发一个包含 Verilog 编译和单元测试检查的自动化评估流程。
以开源资源形式发布训练/评估脚本和检查点。

提出的方法

从 GitHub（约 50k 个文件）和 Verilog 课本（PDF）中整理 Verilog 训练语料，形成约 400 MB 的 Verilog 数据集。
使用 DeepSpeed/ZeRO-3 策略的多 GPU 训练，在 Verilog 语料上对五个预训练的 LLM（参数规模从 345M 到 16B）进行微调。
用两组问题集对模型进行评估：Set I（手工设计的 Verilog 挑战）和 Set II（扩展自 HDLBits 的问题）以及自定义测试平台。
使用 Icarus Verilog 编译生成的 Verilog，并通过单元测试和基于 HDLBits 的仿真验证功能正确性。
将微调后的模型与大型商业 LLM（GPT-3.5-turbo、GPT-4、PALM2）以及开源基线进行比较，分析温度/提示效应。
报告微调后的 CodeGen-16B 在所评估的模型中表现最佳，并且在与更大规模的商业 LLMs 相竞争。

实验结果

研究问题

RQ1RQ1：基线 LLM 在 Verilog 生成集上的表现如何？
RQ2RQ2：微调是否提高了 Verilog 生成性能？
RQ3RQ3：参数更多的更大规模 LLM 是否更适合 Verilog 生成？
RQ4RQ4：问题描述的变化如何影响质量和正确完成？
RQ5RQ5：在不同难度下，微调后的 LLM 与 GPT-3.5-turbo、GPT-4、PALM2、Claude 的性能对比如何？
RQ6RQ6：在哪些难度水平下大型 LLM 表现出色或需要改进才能达到最佳模型？
RQ7RQ7：纳入多样的训练数据（如课本）是否提升模型性能？

主要发现

微调后的 LLM 在所有数据集上均优于其预训练版本。
微调后的 CodeGen-16B MSV 胜过其他评估的 LLM，包括一些更大模型。
微调模型在生成符合语法的 Verilog 代码方面相比其预训练版本提升了 41%。
在最佳配置中，微调模型相较于 GPT-3.5-turbo 的总体提升为 1.1%。
评估显示，参数量更大通常与通过测试基准和综合检查的完成数更多相关。
提示细节和温度对性能有负向影响，较低的温度和每个提示的更多完成导致更好的结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。