Loading…

On Generative Spoken Language Modeling from Raw Audio

We introduce , the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems...

Full description

Saved in:

Bibliographic Details
Published in:	Transactions of the Association for Computational Linguistics 2021-01, Vol.9, p.1336-1354
Main Authors:	Lakhotia, Kushal, Kharitonov, Eugene, Hsu, Wei-Ning, Adi, Yossi, Polyak, Adam, Bolte, Benjamin, Nguyen, Tu-Anh, Copet, Jade, Baevski, Alexei, Mohamed, Abdelrahman, Dupoux, Emmanuel
Format:	Article
Language:	English
Subjects:	Acoustics Automatic text generation Computation and Language Computer Science Language Language modeling Learning Linguistics Modelling Speech Speech encoders Spoken language Waveforms
Citations:	Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	We introduce , the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo- text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder- dependent way, and that some combinations approach text-based systems.
ISSN:	2307-387X 2307-387X
DOI:	10.1162/tacl_a_00430