Loading…
Lip and Speech Synchronization using Supervised Contrastive Learning and Cross-Modal Attention
The temporal consistency between audio and video is of utmost concern for understanding the lip movements and the corresponding speech content. Learning this temporal relationship between the video and audio can be used for classification, detection, or generation purposes. So, identifying if both t...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The temporal consistency between audio and video is of utmost concern for understanding the lip movements and the corresponding speech content. Learning this temporal relationship between the video and audio can be used for classification, detection, or generation purposes. So, identifying if both the modalities are temporally consistent or not i.e., synchronization detection of modalities, is a very crucial problem that can be used to solve various downstream tasks. In this work, we are learning the temporal correlation/mapping between a speaker's speech and the sequence of lip movement in an unconstrained and large vocabulary to identify if both are synchronized. We model the frame sequence using an encoder network and use cross-attention between the frame sequences and audio to learn the embeddings jointly. We learn temporal synchronization using supervised contrastive learning with a hard negative sampling strategy for differentiating the embedding of two modalities with different alignments while improving the similarity for temporally aligned ones. We have trained our model on the LRS2 and a dataset of singing voices to evaluate synchronization tasks in an unconstrained natural setting. Extensive evaluation using qualitative metrics shows that our method outperforms on Acappella dataset by a good margin of more than 1.5% and 0.5 % on LRS2 across a frame count of 5 with the state-of-the-art model. |
---|---|
ISSN: | 2770-8330 |
DOI: | 10.1109/FG59268.2024.10581985 |