Loading…

X-vector anonymization using autoencoders and adversarial training for preserving speech privacy

The rapid increase in web services and mobile apps, which collect personal data from users, has also increased the risk that their privacy may be severely compromised. In particular, the increasing variety of spoken language interfaces and voice assistants empowered by the vertiginous breakthroughs...

Full description

Saved in:
Bibliographic Details
Published in:Computer speech & language 2022-07, Vol.74, p.101351, Article 101351
Main Authors: Perero-Codosero, Juan M., Espinoza-Cuadros, Fernando M., Hernández-Gómez, Luis A.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The rapid increase in web services and mobile apps, which collect personal data from users, has also increased the risk that their privacy may be severely compromised. In particular, the increasing variety of spoken language interfaces and voice assistants empowered by the vertiginous breakthroughs in deep learning have prompted important concerns in the European Union in terms of preserving the privacy of speech data. For instance, an attacker can record speech from users and impersonate them to obtain access to systems that require voice identification. By extracting speaker, linguistic (e.g., dialect), and paralinguistic features (e.g., age) from a speech signal, the speaker profiles can also be hacked from users through existing technology. To mitigate these weaknesses, in this study, we present a speech anonymization method based on autoencoders and adversarial training. Given an utterance, we first extract an x-vector, which is a powerful utterance-level embedding used in state-of-the-art speaker recognition. This original x-vector is transformed by an autoencoder producing a new x-vector, where speaker, gender, and accent information are suppressed through adversarial training. The anonymized speech is finally generated through a neural speech synthesizer driven by the anonymized x-vector, fundamental frequency, and phoneme information extracted from the input speech. For the evaluation, we followed the VoicePrivacy Challenge framework, where anonymization or privacy is measured using automatic speaker verification and the preservation of the intelligibility is evaluated through automatic speech recognition. Our experimental results show that the proposed method achieves higher privacy than the VoicePrivacy baseline (i.e., a higher speaker verification error) while preserving a similar intelligibility for the spoken content (i.e., a similar word error rate).
ISSN:0885-2308
1095-8363
DOI:10.1016/j.csl.2022.101351