Skip to main content
QUICK REVIEW

[论文解读] Deceiving Google's Perspective API Built for Detecting Toxic Comments

Hossein Hosseini, Sreeram Kannan|arXiv (Cornell University)|Feb 27, 2017
Adversarial Robustness in Machine Learning参考文献 10被引用 132
一句话总结

论文展示了对 Google’s Perspective 毒性检测器的对抗扰动,显著降低高毒性短语的毒性分数,并展示检测器对误报及其他弱点的易感性。

ABSTRACT

Social media platforms provide an environment where people can freely engage in discussions. Unfortunately, they also enable several problems, such as online harassment. Recently, Google and Jigsaw started a project called Perspective, which uses machine learning to automatically detect toxic language. A demonstration website has been also launched, which allows anyone to type a phrase in the interface and instantaneously see the toxicity score [1]. In this paper, we propose an attack on the Perspective toxic detection system based on the adversarial examples. We show that an adversary can subtly modify a highly toxic phrase in a way that the system assigns significantly lower toxicity score to it. We apply the attack on the sample phrases provided in the Perspective website and show that we can consistently reduce the toxicity scores to the level of the non-toxic phrases. The existence of such adversarial examples is very harmful for toxic detection systems and seriously undermines their usability.

研究动机与目标

  • Motivate the need for robust toxic content detection in online platforms.
  • Demonstrate that Perspective can be deceived by subtle text perturbations while preserving toxicity.
  • Characterize the detector’s false alarm rate and robustness to random misspellings.
  • Discuss potential defense strategies for improving robustness of toxic language detection systems.

提出的方法

  • Formulate adversarial examples in text by perturbing toxic words (e.g., inserting dots, spaces, or misspellings).
  • Query Perspective with original and perturbed phrases to compare toxicity scores in a black-box setting.
  • Show transferability of perturbations across different phrases.
  • Present qualitative and quantitative demonstrations using sample phrases from Perspective’s demo site.

实验结果

研究问题

  • RQ1Can small textual perturbations reduce Perspective’s toxicity scores for inherently toxic phrases in a black-box setting?
  • RQ2Do perturbations cause high false positives on benign phrases?
  • RQ3What perturbation patterns (dot insertions, spacing, misspellings) are most effective and do perturbations transfer across phrases?
  • RQ4What defenses might mitigate adversarial manipulation of toxicity scores?

主要发现

  • Adversarial perturbations consistently reduce toxicity scores for highly toxic phrases to levels of non-toxic phrases.
  • Perturbations such as inserting dots between letters, adding spaces, or misspelling words are effective across multiple examples.
  • The same perturbations often transfer to other phrases, enabling an attacker to build a reusable perturbation dictionary.
  • The Perspective system exhibits a false-alarm tendency, assigning high toxicity to apparently benign phrases after perturbations.
  • The system shows robustness to random misspellings but remains vulnerable to targeted perturbations and potential poisoning through user feedback.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。