Loading…

KAN-AV dataset for audio-visual face and speech analysis in the wild

Human-computer interaction is becoming increasingly prevalent in daily life with the adoption of intelligent devices. These devices must be capable of interacting in diverse settings, such as environments with noise, music and differing illumination and occlusion conditions. They must also interact...

Full description

Saved in:

Bibliographic Details
Published in:	Image and vision computing 2023-12, Vol.140, p.104839, Article 104839
Main Authors:	Kefalas, Triantafyllos, Fotiadou, Eftychia, Georgopoulos, Markos, Panagakis, Yannis, Ma, Pingchuan, Petridis, Stavros, Stafylakis, Themos, Pantic, Maja
Format:	Article
Language:	English
Subjects:	Age-invariant Audio-visual Cross-modal matching KAN-AV Kinship verification Speaker verification
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Human-computer interaction is becoming increasingly prevalent in daily life with the adoption of intelligent devices. These devices must be capable of interacting in diverse settings, such as environments with noise, music and differing illumination and occlusion conditions. They must also interact with a variety of end users across ages and backgrounds. Therefore, the machine learning community needs in-the-wild multi-modal datasets to develop models for face and speech analysis so that they can be applicable in most real world scenarios. However, most existing audio and audio-visual databases are captured in controlled conditions with few or no age and kinship labels. In this paper, we introduce the KAN-AV dataset which contains 98 h of audio-visual data from 970 identities across ages. Two thirds of the identities have kin relations in the dataset. The dataset is manually annotated with labels for kinship, age, and gender and is intended to drive future research in face and speech analysis. •We introduce a large-scale, in-the-wild, audio-visual face and speech dataset•It contains 970 identities across ages, with age labels, and kinship labels for ⅔•We introduce the task of kinship verification with speech•We also introduce cross-modal kinship matching (from face to voice and vice versa)•Dataset is suitable for classification tasks involving age, identity and kinship
ISSN:	0262-8856 1872-8138
DOI:	10.1016/j.imavis.2023.104839