QUICK REVIEW

[论文解读] Leveraging AI to optimize website structure discovery during Penetration Testing

Diego Antonelli, Roberta Cascella|arXiv (Cornell University)|Jan 18, 2021

Web Application Security Vulnerabilities参考文献 27被引用 4

一句话总结

本文提出了一种结合人工智能的dirbusting方法，通过词表的语义聚类来优化渗透测试中的网站结构发现。通过按语义意义对词语进行分组，并应用下一词智能策略，该方法减少了发现有效路径所需的HTTP请求数量，在八个Web应用程序中相比传统暴力破解技术实现了最高50%的性能提升。

ABSTRACT

Dirbusting is a technique used to brute force directories and file names on web servers while monitoring HTTP responses, in order to enumerate server contents. Such a technique uses lists of common words to discover the hidden structure of the target website. Dirbusting typically relies on response codes as discovery conditions to find new pages. It is widely used in web application penetration testing, an activity that allows companies to detect websites vulnerabilities. Dirbusting techniques are both time and resource consuming and innovative approaches have never been explored in this field. We hence propose an advanced technique to optimize the dirbusting process by leveraging Artificial Intelligence. More specifically, we use semantic clustering techniques in order to organize wordlist items in different groups according to their semantic meaning. The created clusters are used in an ad-hoc implemented next-word intelligent strategy. This paper demonstrates that the usage of clustering techniques outperforms the commonly used brute force methods. Performance is evaluated by testing eight different web applications. Results show a performance increase that is up to 50% for each of the conducted experiments.

研究动机与目标

解决传统dirbusting技术在黑盒渗透测试中依赖随机或启发式词表使用所导致的低效问题。
通过利用词表的语义理解，减少暴力破解Web目录结构所消耗的时间和资源。
证明基于AI的语义聚类优于标准的随机或顺序词表方法，在发现隐藏Web应用路径方面表现更优。
提供一个可扩展、可重用的框架，利用自然语言处理和基于嵌入的聚类技术优化网站结构发现。

提出的方法

通过从八个不同Web应用程序的Docker容器中提取绝对路径，构建统一的词表。
应用预训练的句子嵌入（Universal Sentence Encoder）将词语向量化，并计算路径组件之间的语义相似度。
采用层次聚类算法将语义相关的路径（如 'admin'、'login'、'dashboard'）聚合成簇，以指导搜索顺序。
采用智能的下一词策略，根据语义相关性对簇进行优先级排序，减少冗余或低概率的请求。
以响应码（非404）作为发现标准，将该方法与标准的随机暴力破解dirbusting方法进行性能对比。
在八个真实世界的Web应用程序上进行实验，每项测试重复30次以确保统计显著性。

实验结果

研究问题

RQ1语义聚类词表能否提升Web应用渗透测试中dirbusting的效率？
RQ2与传统的随机或顺序暴力破解方法相比，基于语义聚类的dirbusting策略在发现所有有效路径时，其请求数量的性能表现如何？
RQ3该聚类方法在发现隐藏或私有Web应用路径时，能将所需HTTP请求数量减少多少？
RQ4性能提升是否因不同Web应用框架而异？如果是，原因是什么？

主要发现

在所有八个测试的Web应用程序中，语义聚类方法相比暴力破解方法，发现有效路径的性能提升最高达50%。
基于聚类的方法仅需约一半于随机暴力破解方法的请求数量，即可检测到几乎所有有效URL。
在词表覆盖度较低或碎片化的应用中（如Bodgeit、Bricks和DVWS），性能提升最为显著。
Joomla的改进幅度较低（约减少2000次请求），原因是其词表覆盖度较高（8367个词中占4672个），导致语义优先排序的相对收益降低。
随着时间推移，语义聚类方法发现有效请求的曲线上升更陡峭，表明其能更快收敛至完整的路径发现。
结果表明，对词表的语义理解可实现更智能、更高效的渗透测试期间Web应用结构探索。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。