Cổng tri thức PTIT

Trang chủ

Giới thiệu

AI Cộng đồng

Kho tri thức

Tin tức

Liên hệ

Bài báo quốc tế

Kho tri thức

Bài báo quốc tế

MIST: A Multilingual Dataset and Benchmark for Fine-Grained Audio Inpainting Tampering Localization

Vũ Sơn Tùng

Partial speech inpainting, replacing a few words within a genuine utterance via voice cloning to alter its meaning, is an emerging audio deepfake threat. Unlike fully synthesized speech, the manipulated content constitutes only 2–7% of the utterance, and existing benchmarks are limited to single-region tampering with utterance-level labels. To enable systematic study of this problem, we introduce MIST, a large-scale multilingual dataset spanning six languages with 1–3 independently inpainted word-level segments per utterance, generated through LLM-guided semantic replacement and neural voice cloning, totaling approximately 497k tampered utterances with precise temporal annotations. To establish a benchmark for this task, we further introduce ISA, a backbone-agnostic coarse-to-fine localization framework that recovers all tampered regions without prior knowledge of their count; and SF1@τ, a segment-level F1 metric based on temporal IoU matching that jointly evaluates region counting and boundary precision. Zero-shot experiments confirm that utterance-level classifiers trained on fully synthesized speech fail on MIST (SF1@0.5 ≤ 1.7% across all backbones), while fine-tuning on MIST annotations dramatically improves performance: the Wav2Vec2-AASIST backbone improves from 1.2% to 29.2%, and the strongest backbone (WavLM-AASIST) reaches SF1@0.5 = 33.5%, demonstrating that the primary bottleneck is training data rather than inference methodology. The dataset, code, and evaluation toolkit are publicly available.

Xuất bản trên:

MIST: A Multilingual Dataset and Benchmark for Fine-Grained Audio Inpainting Tampering Localization

Ngày đăng:

2026

DOI:

https://ieeexplore.ieee.org/document/11522802/

Nhà xuất bản:

IEEE Access

Địa điểm:

Từ khoá:

Audio deepfake detection , audio forensics , multi-region tampering localization , partial speech inpainting , speech manipulation dataset , temporal localization , voice cloning

Bài báo liên quan

LiveNeRF: Efficient face replacement through Neural Radiance Fields integration

Vũ Sơn Tùng

Training dynamics and state taxonomy in deep visual recognition networks

Lã Quang Hải

SiCLIP: An explainable multimodal framework for silicosis diagnosis

Lê Minh Duy

Sepsis detection using biomarkers and machine learning

Vũ Tuấn Anh

NF-DCL: Enhancing video anomaly detection with synthetic normal features and Debiased Contrastive Learning

Nguyễn Thu Nga

A Workflow-Oriented Architecture Integrating Large Language Models for Automated Multi-Platform Content Management

Nguyễn Tất Thắng

Development of an Offline RAG Chatbot for Answering Food Hygiene and Safety Questions Based on Vietnamese Legal Frameworks

Nguyễn Tất Thắng