QUICK REVIEW

[論文レビュー] Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs

Jingyuan Xie, Wenjie Wang|arXiv (Cornell University)|Feb 28, 2026

Adversarial Robustness in Machine Learning被引用数 0

ひとこと要約

要約: 本論文は医療分野のLLMの監視下微調整（SFT）時に合理性を poisons する攻撃を提案し、few-shot の poisoned rationale が対象の医療トピックで性能を stealthily degrade できることを示し、正しいサンプルが影響を緩和する。

ABSTRACT

Supervised fine-tuning (SFT) is essential for the development of medical large language models (LLMs), yet prior poisoning studies have mainly focused on the detectable backdoor attacks. We propose a novel poisoning attack targeting the reasoning process of medical LLMs during SFT. Unlike backdoor attacks, our method injects poisoned rationales into few-shot training data, leading to stealthy degradation of model performance on targeted medical topics. Results showed that knowledge overwriting was ineffective, while rationale poisoning caused significant decline on the accuracy of the target subject, as long as no correct samples of the same subject appear in the dataset. A minimum number and ratio of poisoned samples was needed to carry out an effective and stealthy attack, which was more efficient and accurate than catastrophic forgetting. We demonstrate though this study the risk of SFT-stage poisoning, hoping to spur more studies of defense in the sensitive medical domain.

研究の動機と目的

Motivate and formalize a poisoning threat during SFT in medical LLMs.
Show that poisoning rationales, not simple knowledge overwriting, degrades target reasoning.
Identify minimum poisoned sample numbers and ratios for effective, stealthy attacks.
Compare poisoning with catastrophic forgetting to assess efficiency and stealth.

提案手法

MedQA (simplified Chinese) を fever-related questions の評価データセットとして使用する。
few-shot 学習セットに faulty rationale を含む poisoned fever-related QAs を注入する。
忘却を制御するために fever-related および non-fever の QAs を rationales とともに正しく生成する。
LoRA を用いて GPU 上で Qwen3-4B-Base を微調整し、poisoning の影響を評価する。
targeted な効果と stealth を測るため fever-related vs non-fever の accuracy を評価する。

実験結果

リサーチクエスチョン

RQ1Can rationale poisoning during SFT degrade reasoning on a target medical topic more effectively than simple knowledge overwriting?
RQ2What is the minimum number and ratio of poisoned samples required to meaningfully degrade fever-related accuracy?
RQ3How do correct samples from the target subject influence the success of rationale poisoning?
RQ4Is rationale poisoning more efficient and stealthy than catastrophic forgetting via knowledge injection?
RQ5How does reasoning depth (shallow vs deep) affect forgetting and poisoning efficacy?

主な発見

Rationale poisoning significantly degrades fever-related accuracy (8.2% drop) with 125 poisoned samples and 1,300 correct samples (8.8% poison ratio).
Correct samples of the target subject can offset poisoning effects, reducing the attack’s impact when present.
Knowledge overwriting poisoning failed to degrade fever-related accuracy, highlighting the need to poison reasoning rather than simple mappings.
Deep reasoning in poisoned rationales caused more catastrophic forgetting than shallow reasoning, guiding the choice to shallow reasoning for attacks.
Poisoning efficiency shows a minimum poison count and ratio; beyond a point, adding more poisoned samples yields diminishing or negative stealth benefits.
Compared to injecting correct knowledge, rationale poisoning can achieve targeted forgetting with far fewer poisoned samples, indicating higher efficiency but potential detectability if overdone.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。