Loading…

Transformer Ensemble for Synthesized Speech Detection

As voice synthesis systems and deep learning tools continue to improve, so does the possibility that synthesized speech can be used for nefarious purposes. Methods that determine if audio signals contain synthesized or authentic speech are needed. In this paper, we investigate three transformers to...

Full description

Saved in:
Bibliographic Details
Main Authors: Bartusiak, Emily R., Bhagtani, Kratika, Singh Yadav, Amit Kumar, Delp, Edward J.
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:As voice synthesis systems and deep learning tools continue to improve, so does the possibility that synthesized speech can be used for nefarious purposes. Methods that determine if audio signals contain synthesized or authentic speech are needed. In this paper, we investigate three transformers to detect synthesized speech: Compact Convolutional Transformer (CCT), Patchout faSt Spectrogram Transformer (PaSST), and Self-Supervised Audio Spectrogram Transformer (SSAST). We show that each transformer independently detects synthesized speech well. Then, we propose an ensemble of transformers that can provide even better performance. Finally, we explore how much of an audio signal is needed for high synthesized speech detection. Evaluated on the ASVspoof2019 dataset, we demonstrate that our transformer ensemble detects synthesized speech from shorter segments of audio signals, even on a highly imbalanced dataset.
ISSN:2576-2303
DOI:10.1109/IEEECONF59524.2023.10477041