Loading…

Deep4SNet: deep learning for fake speech classification

•Deep4SNet is a text-independent classifier of original/fake speech recordings.•It is based on a customized deep learning architecture.•Speech recordings are transformed into histograms to feed the model.•Experimental results are performed on Deep Voice and Imitation datasets.•The accuracy of the cl...

Full description

Saved in:
Bibliographic Details
Published in:Expert systems with applications 2021-12, Vol.184, p.115465, Article 115465
Main Authors: Ballesteros, Dora M., Rodriguez-Ortega, Yohanna, Renza, Diego, Arce, Gonzalo
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•Deep4SNet is a text-independent classifier of original/fake speech recordings.•It is based on a customized deep learning architecture.•Speech recordings are transformed into histograms to feed the model.•Experimental results are performed on Deep Voice and Imitation datasets.•The accuracy of the classifier is over 98%. Fake speech consists on voice recordings created even by artificial intelligence or signal processing techniques. Among the methods for generating false voice recordings are Deep Voice and Imitation. In Deep voice, the recordings sound slightly synthesized, whereas in Imitation, they sound natural. On the other hand, the task of detecting fake content is not trivial considering the large number of voice recordings that are transmitted over the Internet. In order to detect fake voice recordings obtained by Deep Voice and Imitation, we propose a solution based on a Convolutional Neural Network (CNN), using image augmentation and dropout. The proposed architecture was trained with 2092 histograms of both original and fake voice recordings and cross-validated with 864 histograms. 476 new histograms were used for external validation, and Precision (P) and Recall (R) were calculated. Detection of fake audios reached P=0.997,R=0.997 for Imitation-based recordings, and P=0.985,R=0.944 for Deep Voice-based recordings. The global accuracy was 0.985. According to the results, the proposed system is successful in detecting fake voice content.
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2021.115465