Bài báo quốc tế
Kho tri thức
/
Bài báo quốc tế
/
MIST: A Multilingual Dataset and Benchmark for Fine-Grained Audio Inpainting Tampering Localization
MIST: A Multilingual Dataset and Benchmark for Fine-Grained Audio Inpainting Tampering Localization
Vũ Sơn Tùng
Partial speech inpainting, replacing a few words within a genuine utterance via voice cloning to alter its meaning, is an emerging audio deepfake threat. Unlike fully synthesized speech, the manipulated content constitutes only 2–7% of the utterance, and existing benchmarks are limited to single-region tampering with utterance-level labels. To enable systematic study of this problem, we introduce MIST, a large-scale multilingual dataset spanning six languages with 1–3 independently inpainted word-level segments per utterance, generated through LLM-guided semantic replacement and neural voice cloning, totaling approximately 497k tampered utterances with precise temporal annotations. To establish a benchmark for this task, we further introduce ISA, a backbone-agnostic coarse-to-fine localization framework that recovers all tampered regions without prior knowledge of their count; and SF1@τ, a segment-level F1 metric based on temporal IoU matching that jointly evaluates region counting and boundary precision. Zero-shot experiments confirm that utterance-level classifiers trained on fully synthesized speech fail on MIST (SF1@0.5 ≤ 1.7% across all backbones), while fine-tuning on MIST annotations dramatically improves performance: the Wav2Vec2-AASIST backbone improves from 1.2% to 29.2%, and the strongest backbone (WavLM-AASIST) reaches SF1@0.5 = 33.5%, demonstrating that the primary bottleneck is training data rather than inference methodology. The dataset, code, and evaluation toolkit are publicly available.
Xuất bản trên:
MIST: A Multilingual Dataset and Benchmark for Fine-Grained Audio Inpainting Tampering Localization
Ngày đăng:
2026
Nhà xuất bản:
IEEE Access
Địa điểm:
Từ khoá:
Audio deepfake detection , audio forensics , multi-region tampering localization , partial speech inpainting , speech manipulation dataset , temporal localization , voice cloning
Bài báo liên quan
NF-DCL: Enhancing video anomaly detection with synthetic normal features and Debiased Contrastive Learning
Nguyễn Thu NgaA Workflow-Oriented Architecture Integrating Large Language Models for Automated Multi-Platform Content Management
Nguyễn Tất ThắngDevelopment of an Offline RAG Chatbot for Answering Food Hygiene and Safety Questions Based on Vietnamese Legal Frameworks
Nguyễn Tất Thắng