Loading…

SingAvatar: High-fidelity Audio-driven Singing Avatar Synthesis

Generating photo-realistic avatars from audio plays an important role in extended reality (XR) and metaverse. In this paper, we lift the input audio from speech to singing, which has been rarely studied. The significant distinction between singing and talking poses great challenges for adapting talk...

Full description

Saved in:
Bibliographic Details
Main Authors: Ma, Wentao, Tang, Anni, Ling, Jun, Xue, Han, Liao, Huiheng, Zhu, Yunhui, Song, Li
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Generating photo-realistic avatars from audio plays an important role in extended reality (XR) and metaverse. In this paper, we lift the input audio from speech to singing, which has been rarely studied. The significant distinction between singing and talking poses great challenges for adapting talking face generation methods to the singing regime. To address this, we propose a high-fidelity singing avatar synthesis method called SingAvatar. Besides the audio, we incorporate vocal conditions involving phonemes and variance to alleviate the ambiguity of learning the singing-to-face mapping. Concretely, we tailor a two-stage pipeline: singing voice synthesis and portrait generation from the synthesized audio and auxiliary vocal conditions. Further, we curate a fine-grained singing head dataset containing singing videos with synchronized audio and accurate vocal conditions. In experiments, SingAvatar outperforms competing methods regarding audio-mouth synchronization, the naturalness of head movements, and controllability over the results. The code and dataset will be made publicly available.
ISSN:1945-788X
DOI:10.1109/ICME57554.2024.10687925