QUICK REVIEW

[논문 리뷰] Cross-Modal Robustness Transfer (CMRT): Training Robust Speech Translation Models Using Adversarial Text

Abderrahmane Issam, Yusuf Can Semerci|arXiv (Cornell University)|2026. 02. 12.

Adversarial Robustness in Machine Learning인용 수 0

한 줄 요약

CMRT는 엔드-투-엔드 음성 번역(E2E-ST)에서 텍스트로부터 음성으로의 적대적 강건성 전이를 가능하게 하여, 적대적 음성 데이터 없이도 음성에서의 강건성을 향상시키고 3 BLEU 포인트 이상 개선합니다.

ABSTRACT

End-to-End Speech Translation (E2E-ST) has seen significant advancements, yet current models are primarily benchmarked on curated, "clean" datasets. This overlooks critical real-world challenges, such as morphological robustness to inflectional variations common in non-native or dialectal speech. In this work, we adapt a text-based adversarial attack targeting inflectional morphology to the speech domain and demonstrate that state-of-the-art E2E-ST models are highly vulnerable it. While adversarial training effectively mitigates such risks in text-based tasks, generating high-quality adversarial speech data remains computationally expensive and technically challenging. To address this, we propose Cross-Modal Robustness Transfer (CMRT), a framework that transfers adversarial robustness from the text modality to the speech modality. Our method eliminates the requirement for adversarial speech data during training. Extensive experiments across four language pairs demonstrate that CMRT improves adversarial robustness by an average of more than 3 BLEU points, establishing a new baseline for robust E2E-ST without the overhead of generating adversarial speech.

연구 동기 및 목표

E2E-ST의 강건성에 대한 명확한 동기를 제시합니다: 굴절형 형태소 변화와 비원어민 음성에 대한 대응.
텍스트로부터 음성으로의 적대적 강건성 전이를 위한 Cross-Modal Robustness Transfer 프레임워크를 제안합니다.
음성 및 텍스트 표현을 정렬하여 크로스-모달 강건성 전이를 가능하게 합니다.
적대적 음성 데이터 없이도 텍스트 전용 적대적 미세조정 단계를 제공합니다.

제안 방법

두 단계 CMRT 학습: CMRT-TR은 WACO(단어 정렬 대조학습)와 Mixup을 사용해 강력한 음성-텍스트 의미 정합을 구축하고 혼합 표현을 생성합니다.
CMRT-TR은 ST 및 MT 목표와 대조손실을 최적화하고 모달리티를 정렬하며, Mixup 호환성을 위한 대칭 KL 발산 항을 추가합니다.
CMRT-FN은 음성 매니폴드에 적대적 텍스트 임베딩을 주입한 상태로 모델을 미세 조정하고(적대적 Mixup), 비대칭 KL을 사용해 깨끗한 입력과 적대적 입력 사이의 출력을 정렬합니다.
Speech-MORPHEUS(Speech-MORPHEUS)는 MORPHEUS를 음성으로 확장하여 TTS 기반 음성 입력을 통해 굴절형 교란을 생성하고 이를 강건성 평가에 사용합니다.
아키텍처에는 음성 인코더(HuBERT/mHuBERT)와 번역 인코더-디코더가 포함되며 MT 및 ST 손실로 학습되고 교차 모달 목표로 강화됩니다.
최종 목표는 ST 손실, MT 손실, CTR 손실, 적대적 Mixup 손실 및 KL 정규화 항들(λ_ctr, λ_kl)을 결합합니다.

Figure 1: CMRT aligns speech and text semantic spaces (right). To simulate adversarial speech inflections, clean speech embeddings (e.g., "happiness") are replaced with adversarial text embeddings (e.g., "happy") during robustness fine-tuning. 2 2 2 The example "you bring us happy" is taken from Eav

실험 결과

연구 질문

RQ1강건한 음성 번역 모델이 적대적 음성 데이터를 생성하지 않고 굴절형 교란에 견디도록 학습될 수 있습니까?
RQ2음성 및 텍스트 임베딩의 정렬이 텍스트 기반 적대적 강건성을 음성 모듈로 효과적으로 전이하도록 할 수 있습니까?
RQ3WACO와 Mixup의 결합이 크로스-모달 정합성과 강건성에 미치는 영향은 무엇입니까?

주요 결과

CMRT-FN은 En-De, En-Ca, En-Ar, Fr-En에서 MORPHEUS-공격 음성에 대한 평균 강건성 향상을 3 BLEU 포인트 이상 제공.
CMRT-FN은 적대적 음성 데이터를 사용하지 않는 baselines보다 우수하며 합성 적대적 음성으로 학습된 방법과도 경쟁력이 있습니다.
CMRT-FN은 정제된 CoVoST 2 테스트 데이터에 대한 성능을 유지하거나 개선하면서 일부 적대적 미세조정 방법들보다 강건성을 향상시킵니다.
음성-텍스트 잠재 공간 정합(코사인 유사도 및 BLEU와의 상관관계 측정)이 더 나은 적대적 강건성과 양의 상관관계를 보입니다.
Mixup과 WACO의 조합은 두 기술 중 하나만 사용할 때보다 더 나은 의미적 정합을 제공하여 강건성 전이 효과를 높입니다.

Figure 2: An overview illustration of our proposed method. Since our method is composed of two steps, CMRT-TR (§ 3.2.1 and § 3.2.2 ) followed by CMRT-FN (§ 3.2.3 ), we refer to them as TR and FN respectively in the figure.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.