Loading…
Self-Motion As Supervision For Egocentric Audiovisual Localization
Sound source localization is a key requirement for many assistive applications of augmented reality, such as speech enhancement. In conversational settings, potential sources of interest may be approximated by active speaker detection. However, localizing speakers in crowded, noisy environments is c...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Sound source localization is a key requirement for many assistive applications of augmented reality, such as speech enhancement. In conversational settings, potential sources of interest may be approximated by active speaker detection. However, localizing speakers in crowded, noisy environments is challenging, particularly without extensive ground truth annotations. Still, people are often able to communicate effectively in these scenarios through orienting behavioral responses, such as head motion and eye gaze, which have been shown to correlate with directions of auditory sources. In the absence of ground truth annotations, we propose joint training of egocentric audiovisual localization with behavioral pseudolabels to relate audiovisual stimuli with directional information extracted from future behavior. We evaluate this method as a technique for unsupervised egocentric active speaker localization and compare pseudolabels derived from head and gaze directions against fully-supervised alternatives. |
---|---|
ISSN: | 2379-190X |
DOI: | 10.1109/ICASSP48485.2024.10447683 |