Loading…

Self-Motion As Supervision For Egocentric Audiovisual Localization

Sound source localization is a key requirement for many assistive applications of augmented reality, such as speech enhancement. In conversational settings, potential sources of interest may be approximated by active speaker detection. However, localizing speakers in crowded, noisy environments is c...

Full description

Saved in:

Bibliographic Details
Main Authors:	Murdock, Calvin, Ananthabhotla, Ishwarya, Lu, Hao, Ithapu, Vamsi Krishna
Format:	Conference Proceeding
Language:	English
Subjects:	active speaker localization Annotations audiovisual learning Behavioral sciences conversational understanding egocentric learning eye tracking Location awareness Oral communication Signal processing Speech enhancement Training
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Sound source localization is a key requirement for many assistive applications of augmented reality, such as speech enhancement. In conversational settings, potential sources of interest may be approximated by active speaker detection. However, localizing speakers in crowded, noisy environments is challenging, particularly without extensive ground truth annotations. Still, people are often able to communicate effectively in these scenarios through orienting behavioral responses, such as head motion and eye gaze, which have been shown to correlate with directions of auditory sources. In the absence of ground truth annotations, we propose joint training of egocentric audiovisual localization with behavioral pseudolabels to relate audiovisual stimuli with directional information extracted from future behavior. We evaluate this method as a technique for unsupervised egocentric active speaker localization and compare pseudolabels derived from head and gaze directions against fully-supervised alternatives.
ISSN:	2379-190X
DOI:	10.1109/ICASSP48485.2024.10447683