Loading…

Learning a representation of tongue dynamics from unlabeled ultrasound videos

Ultrasound imaging of the tongue has been used for decades in studies of speech production and speech motor control, for silent speech interfaces, and in numerous other areas. Despite substantial efforts, however, extraction of reliable features from ultrasound tongue data remains a challenge due to...

Full description

Saved in:
Bibliographic Details
Published in:The Journal of the Acoustical Society of America 2019-10, Vol.146 (4), p.3087-3087
Main Authors: Wang, Hongcui, Roussel, Pierre, Denby, Bruce
Format: Article
Language:English
Citations: Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Ultrasound imaging of the tongue has been used for decades in studies of speech production and speech motor control, for silent speech interfaces, and in numerous other areas. Despite substantial efforts, however, extraction of reliable features from ultrasound tongue data remains a challenge due to speckle noise and acoustic propagation issues. Recently, Representation Learning has emerged in a variety of fields as a powerful means of generating useful representations of underlying structure in raw, high-dimensional data. In its unsupervised form, Representation Learning discovers structures in unlabelled data, thereby eliminating the need for a time-consuming labelling step. The present work is believed to be the first use of unsupervised Representation Learning to reveal structures related to tongue dynamics in unlabelled ultrasound video. A 3-D Convolutional Neural Network examining a series of unlabelled 60 Hz tongue images is found to accurately predict unseen future images even for large interframe tongue displacements. By comparing the 3DCNN prediction error to that of a simple previous-frame predictor, tongue trajectories containing transitions between regions of acoustic stability can be identified and correlated with formant trajectories in a spectrogram. Prospects for leveraging the tongue dynamic representation for use in subsequent speech processing tasks will be discussed.
ISSN:0001-4966
1520-8524
DOI:10.1121/1.5137727