QUICK REVIEW

[论文解读] Exploring Software Naturalness through Neural Language Models

Luca Buratti, Saurabh Pujar|arXiv (Cornell University)|Jun 22, 2020

Software Engineering Research参考文献 37被引用 54

一句话总结

本论文通过在原始 C 代码上预训练变换器语言模型（C-BERT），使用不同的分词器和预训练策略，然后评估 AST 节点标注和漏洞识别，而不依赖编译器派生特征。

ABSTRACT

The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing. We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks. Present approaches to code analysis depend heavily on features derived from the Abstract Syntax Tree (AST) while our transformer-based language models work on raw source code. This work is the first to investigate whether such language models can discover AST features automatically. To achieve this, we introduce a sequence labeling task that directly probes the language models understanding of AST. Our results show that transformer based language models achieve high accuracy in the AST tagging task. Furthermore, we evaluate our model on a software vulnerability identification task. Importantly, we show that our approach obtains vulnerability identification results comparable to graph based approaches that rely heavily on compilers for feature extraction.

研究动机与目标

测试基于变换器的语言模型是否能够在没有 AST 特征的情况下从原始代码中学习 AST 式结构。
评估分词策略对学习以及对 C 代码的下游任务表现的影响。
在漏洞识别方面评估语言模型方法相对于基于图、依赖编译器的方法的表现。

提出的方法

从头在 100 个开源 C 仓库上预训练一个 BERT-like 变换器（C-BERT）。
探索三种分词策略：字符(Char)，核心字符(KeyChar（Char + C 关键字）)，以及 SentencePiece (SPE)。
采用三种预训练目标：MLM，Whole Word Masking (WWM) 用于对字符串进行更强的屏蔽，以及针对任务的微调目标。
引入一个 AST 节点标注任务，通过 Clang 派生标签探查模型对 token_kind 和 cursor_kind 的理解。
在漏洞识别 (VI) 上对模型进行微调，并与基于图的基线如 GGNN 进行比较。
在 FFmpeg 和 QEMU 数据集上评估，使用固定宽度 250-token 窗口并对长输入采用滑动窗口聚合。

实验结果

研究问题

RQ1一个直接在原始 C 源代码上训练的变换器语言模型是否能够发现类似 AST 的特征，而不需要显式的结构信息？
RQ2分词选择（Char、KeyChar、SPE）和预训练目标如何影响代码的句法/语义方面的学习？
RQ3相比于基于图、依赖编译器的方法，基于语言模型的方法在漏洞识别上的表现如何？

主要发现

Model	Tokenizer	Acc_FFmpeg	F1_FFmpeg	Acc_QEMU	F1_QEMU
Char MLM	Char	94.96	95.71	71.68	80.53
BiLSTM SPE	SPE	94.69	95.52	74.12	81.58
KeyChar MLM	KeyChar	95.68	96.58	66.20	76.19
Char MLM (C-BERT)	Char	97.10	97.72	81.06	87.43
C-BERT SPE MLM	SPE	97.72	98.29	81.11	87.79
KeyChar MLM	KeyChar	97.73	98.31	80.78	87.49

在不同分词器下对原始 C 代码进行训练的变换器在跨数据集的 AST cursor_kind 标注上超过了 BiLSTM 基线。
C-BERT 结合 SPE 在 FFmpeg 上通常达到高准确率（97.72–97.73）和 F1（98.29–98.31），在 QEMU 上有较强的结果（81.11–87.79）。
基于字符的分词结合 Whole Word Masking 取得较强结果，且 WWM 一般提升 VI 性能并降低 OOV 问题。
在 AST cursor_kind 标注方面，使用 Char 或 SPE 分词的 C-BERT 取得最佳分数，FFmpeg 一直比 QEMU 更容易。
在 VI 方面，采用 MLM 为预训练的 C-BERT 模型在完整和简化数据集上均优于 Naive、BiLSTM、CNN 基线和 GGNN 基线，WWM 有助于 Char/KeyChar 变体。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。