[论文解读] Pushan: Trace-Free Deobfuscation of Virtualization-Obfuscated Binaries
Pushan 引入了一个无轨迹的去混淆框架,能够在不进行路径可行性求解的情况下恢复完整的 CFG,并将虚拟化混淆的二进制反编译为 C 伪代码,优于基于轨迹的前序方法,且对商业级混淆具备可扩展性。
In the ever-evolving battle against malware, binary obfuscation techniques are a formidable barrier to effective analysis by both human security analysts and automated systems. In particular, virtualization or VM-based obfuscation is one of the strongest protection mechanisms that evade automated analysis. Despite widespread use of virtualization, existing automated deobfuscation techniques suffer from three major drawbacks. First, they only work on execution traces, which prevents them from recovering all logic in an obfuscated binary. Second, they depend on dynamic symbolic execution, which is expensive and does not scale in practice. Third, they cannot generate "well-formed" code, which prevents existing binary decompilers from generating human-friendly output. This paper introduces PUSHAN, a novel and generic technique for deobfuscating virtualization-obfuscated binaries while overcoming the limitations of existing techniques. PUSHAN is trace-free and avoids path-constraint accumulation by using VPC-sensitive, constraint-free symbolic emulation to recover a complete CFG of the virtualized function. It is the first approach that also decompiles the protected code into high-quality C pseudocode to enable effective analysis. Crucially, PUSHAN circumvents reliance on path satisfiability, a known NP-hard problem that hampers scalability. We evaluate PUSHAN on more than 1,000 binaries, including targets protected by academic state of the art (Tigress) and commercial-strength obfuscators VMProtect and Themida. PUSHAN successfully deobfuscates these binaries, retrieves their complete CFGs, and decompiles them to C pseudocode. We further demonstrate applicability by analyzing a previously unanalyzed VMProtect-obfuscated malware sample from VirusTotal, where our decompiled output enables LLM-assisted code simplification, reuse, and program understanding.
研究动机与目标
- 激励并解决现有 VM 去混淆技术的局限性(CFG 不完整、可扩展性差、输出不友好)。
- 提出一种无轨迹 CFG 恢复方法,采用对 VPC 敏感且无约束的符号仿真来恢复完整的 CFG。
- 生成具有语义意义的去编译 C 伪代码,便于人工和自动分析。
提出的方法
- 开发一个三阶段管线:(1) 基于无约束符号仿真的 VPC 敏感 CFG 恢复;(2) 保留语义的 CFG 简化;(3) 去编译为类似 C 的伪代码。
- 利用 VPC 敏感性将虚拟机解释器逻辑与原程序逻辑区分开。
- 使用无约束符号仿真在不进行路径可行性检查的情况下枚举间接跳转目标。
- 通过符号化和迭代 CFG 精化来恢复初步仿真未覆盖的边。
- 进行语义化简以修剪虚拟机机构件并揭示更高层次的逻辑。
- 增强去编译器的整合,使输出可读的 C 伪代码,且正确标注函数边界与栈跟踪。
实验结果
研究问题
- RQ1无轨迹 CFG 恢复方法是否能够从商业强度的混淆中恢复完整的控制流?
- RQ2无约束符号仿真是否能扩展到大规模真实世界二进制并避免路径可满足性瓶颈?
- RQ3所恢复的 CFG 是否可以去编译为高质量的 C 伪代码,适用于恶意软件分析?
- RQ4与现有前沿基于轨迹的去混淆技术相比,Pushan 在多样数据集上的表现如何?
主要发现
- Pushan 能恢复包含 VM 解释器逻辑与原始程序逻辑在内的 VPC 敏感平面 CFG。
- 在 17/28 的 VMProtect/Themida/Tigress 混淆二进制中实现了 CFG 的 100% 相似性,其余部分也显示出高相似性。
- 在 Tigress 生成的虚拟机上,Pushan 对 1000 个二进制中的 988 个进行了完全分析与去混淆,显著超越先前的状态-艺术,即仅恢复了 68 个完整 CFG。
- Pushan 在五个 CTF 挑战中取得成功,并能通过去编译输出揭示嵌入的标志。
- Pushan 实现对虚拟化保护代码的端到端分析,已在 VirusTotal 的一个真实 VMProtect 混淆样本上证明可用。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。