Loading…
Transformer Ensemble for Synthesized Speech Detection
As voice synthesis systems and deep learning tools continue to improve, so does the possibility that synthesized speech can be used for nefarious purposes. Methods that determine if audio signals contain synthesized or authentic speech are needed. In this paper, we investigate three transformers to...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | As voice synthesis systems and deep learning tools continue to improve, so does the possibility that synthesized speech can be used for nefarious purposes. Methods that determine if audio signals contain synthesized or authentic speech are needed. In this paper, we investigate three transformers to detect synthesized speech: Compact Convolutional Transformer (CCT), Patchout faSt Spectrogram Transformer (PaSST), and Self-Supervised Audio Spectrogram Transformer (SSAST). We show that each transformer independently detects synthesized speech well. Then, we propose an ensemble of transformers that can provide even better performance. Finally, we explore how much of an audio signal is needed for high synthesized speech detection. Evaluated on the ASVspoof2019 dataset, we demonstrate that our transformer ensemble detects synthesized speech from shorter segments of audio signals, even on a highly imbalanced dataset. |
---|---|
ISSN: | 2576-2303 |
DOI: | 10.1109/IEEECONF59524.2023.10477041 |