QUICK REVIEW

[논문 리뷰] Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

Jordan J. Bird, Ahmad Lotfi|arXiv (Cornell University)|2023. 08. 24.

Speech Recognition and Synthesis인용 수 9

한 줄 요약

이 논문은 실제 음성과 AI가 생성한 음성의 DEEP-VOICE 데이터세트를 만들고, 통계적 오디오 특징을 분석하며, XGBoost가 실시간 0.004 ms 추론으로 1초 간격의 음성을 감지하는 데 99.3%의 정확도를 달성하고 RVC 기반 음성 변환을 탐지합니다.

ABSTRACT

There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion. To address the above emerging issues, the DEEP-VOICE dataset is generated in this study, comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion. Presenting as a binary classification problem of whether the speech is real or AI-generated, statistical analysis of temporal audio features through t-testing reveals that there are significantly different distributions. Hyperparameter optimisation is implemented for machine learning models to identify the source of speech. Following the training of 208 individual machine learning models over 10-fold cross validation, it is found that the Extreme Gradient Boosting model can achieve an average classification accuracy of 99.3% and can classify speech in real-time, at around 0.004 milliseconds given one second of speech. All data generated for this study is released publicly for future research on AI speech detection.

연구 동기 및 목표

실시간 커뮤니케이션에서 개인 정보 침해와 허위 표현을 방지하기 위해 AI 생성 음성 탐지의 필요성을 입증한다.
Retrieval-based Voice Conversion(RVC)을 사용하여 8명의 공적 인물의 실제 음성 및 AI 생성 음성을 포함하는 원본 데이터셋(DEEP-VOICE)을 개발한다.
실제 음성과 AI 생성 음성을 구분하기 위한 오디오 특징의 통계적 유의성을 분석한다.
실시간 탐지를 가능하게 하기 위해 하이퍼파라미터 최적화를 포함한 여러 머신러닝 모델을 평가한다.

제안 방법

8명의 개인으로부터 실제 음성 62분 22초와 Retrieval-based Voice Conversion을 통해 생성된 AI 음성을 포함하는 DEEP-VOICE 데이터셋을 생성한다.
1초 구간마다 크로마그램, MFCC, 스펙트럴 특성, ZCR, RMS 등을 포함한 26개의 오디오 특징을 추출한다.
실제 샘플과 동일한 비율인 1:1을 만들기 위해 가짜(AI 생성) 샘플을 언샘플링하여 균형을 맞춘다.
하이퍼파라미터 최적화를 통해 XGBoost(330 라운드 최적), Random Forest(310 트리), KNN(최적 1 이웃)를 포함한 ML 모델군을 10-폴드 교차검증과 시드 42로 학습하고 평가한다.
추정된 추론 시간(1초당 0.004~0.057 ms)을 보고한다.
정확도, 정밀도, 재현율, F1, MCC, ROC AUC를 포함한 지표로 성능을 평가한다.

Figure 1: Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.

실험 결과

연구 질문

RQ1실시간 탐지가 RVC가 생성한 AI 음성을 실제 인간 음성과 구별할 수 있는가?
RQ2실제 음성과 AI 생성 음성을 가장 잘 구분하는 오디오 특징은 무엇이며, 다양한 ML 모델이 실시간 제약하에서 어떤 성능을 보이는가?
RQ3실시간 탐지를 위한 DeepFake 음성 변환의 정확도와 추론 시간에 대해 하이퍼파라미터가 어떤 영향을 미치는가?
RQ4통화나 화상회의 중 실시간 경고 시스템을 배포하여 AI 생성 음성을 표시하는 것이 가능한가?

주요 결과

XGBoost는 330번 부스팅 라운드에서 10-폴드 교차검증에서 99.3%의 정확도(정밀도 0.995, 재현율 0.991, F1 0.993, MCC 0.986, ROC AUC 0.993)를 달성한다.
XGBoost의 1초 오디오에 대한 추론 시간은 평균 0.004 ms이다.
Random Forest는 310개의 트리에서 98.89%의 정확도(정밀도 0.995, 재현율 0.983, F1 0.989, MCC 0.978, ROC AUC 0.989)와 0.057 ms per 1-second를 달성한다.
KNN(1 이웃)은 0.143 ms per 1-second에서 81.48%의 정확도를 달성하고, QDA는 0.002 ms 추론으로 94.8%의 정확도를 제공한다.
전반적으로 XGBoost와 Random Forest는 교차검증 전체에 걸쳐 강력한 일반화 성능을 보이며 RVC 기반 음성 변환의 실시간 탐지를 가능하게 한다.

Figure 2: Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling’s speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.