QUICK REVIEW

[论文解读] Content-based data leakage detection using extended fingerprinting

Yuri Shapira, Bracha Shapira|arXiv (Cornell University)|Feb 8, 2013

Advanced Malware Detection Techniques参考文献 38被引用 25

一句话总结

本文提出了一种基于排序 k-跳 n-gram 的扩展指纹技术，用于增强基于内容的数据泄露检测方法。该方法通过隔离核心机密内容，减少非机密文本引起的误报，并提高对改写和新型文档泄露的鲁棒性，从而增强对故意数据外泄事件的检测能力。

ABSTRACT

Protecting sensitive information from unauthorized disclosure is a major concern of every organization. As an organizations employees need to access such information in order to carry out their daily work, data leakage detection is both an essential and challenging task. Whether caused by malicious intent or an inadvertent mistake, data loss can result in significant damage to the organization. Fingerprinting is a content-based method used for detecting data leakage. In fingerprinting, signatures of known confidential content are extracted and matched with outgoing content in order to detect leakage of sensitive content. Existing fingerprinting methods, however, suffer from two major limitations. First, fingerprinting can be bypassed by rephrasing (or minor modification) of the confidential content, and second, usually the whole content of document is fingerprinted (including non-confidential parts), resulting in false alarms. In this paper we propose an extension to the fingerprinting approach that is based on sorted k-skip-n-grams. The proposed method is able to produce a fingerprint of the core confidential content which ignores non-relevant (non-confidential) sections. In addition, the proposed fingerprint method is more robust to rephrasing and can also be used to detect a previously unseen confidential document and therefore provide better detection of intentional leakage incidents.

研究动机与目标

为解决传统指纹技术在检测数据泄露时的局限性，特别是由非机密内容引起的误报问题。
克服现有方法在面对机密内容的改写或微小修改时的脆弱性。
实现对原始数据库中未收录的机密文档的检测，支持识别故意泄露行为。
开发一种更鲁棒且精确的基于内容的指纹技术，以提高数据泄露检测的准确性。

提出的方法

该方法使用排序 k-跳 n-gram 从机密内容中提取指纹，仅关注相关且敏感的片段。
通过预处理步骤，过滤掉文档中不相关或非机密的部分，以隔离核心机密内容。
指纹提取过程中对 k-跳 n-gram 进行排序，以增强对词序变化和改写内容的鲁棒性。
即使机密内容被改写或轻微修改，该方法仍能实现匹配。
通过将指纹与已知敏感内容模式的数据库进行比对，支持对未知机密文档的检测。
系统通过在指纹生成过程中排除非机密部分，降低误报率。

实验结果

研究问题

RQ1能否设计一种指纹方法，通过从分析中排除非机密内容来减少误报？
RQ2如何使指纹技术对机密内容的改写或微小修改更具鲁棒性？
RQ3该方法能否检测到原始数据库中未收录的机密文档的泄露？
RQ4与传统指纹技术相比，使用排序 k-跳 n-gram 在多大程度上提升了检测准确率？
RQ5所提出的方法在提升故意泄露事件召回率的同时，是否仍能保持高精确率？

主要发现

所提出的方法通过在指纹生成中排除非机密内容，显著减少了误报，提升了检测精确率。
使用排序 k-跳 n-gram 显著增强了对改写和微小文本修改的鲁棒性。
该方法通过识别结构和语义相似性，能够检测到此前未见过的机密文档。
即使机密内容被改写或修改，该方法仍能保持高检测准确率。
与传统指纹技术相比，该系统在识别故意数据泄露事件方面表现出更优的性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。