Loading…
SingAvatar: High-fidelity Audio-driven Singing Avatar Synthesis
Generating photo-realistic avatars from audio plays an important role in extended reality (XR) and metaverse. In this paper, we lift the input audio from speech to singing, which has been rarely studied. The significant distinction between singing and talking poses great challenges for adapting talk...
Saved in:
Main Authors: | , , , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Generating photo-realistic avatars from audio plays an important role in extended reality (XR) and metaverse. In this paper, we lift the input audio from speech to singing, which has been rarely studied. The significant distinction between singing and talking poses great challenges for adapting talking face generation methods to the singing regime. To address this, we propose a high-fidelity singing avatar synthesis method called SingAvatar. Besides the audio, we incorporate vocal conditions involving phonemes and variance to alleviate the ambiguity of learning the singing-to-face mapping. Concretely, we tailor a two-stage pipeline: singing voice synthesis and portrait generation from the synthesized audio and auxiliary vocal conditions. Further, we curate a fine-grained singing head dataset containing singing videos with synchronized audio and accurate vocal conditions. In experiments, SingAvatar outperforms competing methods regarding audio-mouth synchronization, the naturalness of head movements, and controllability over the results. The code and dataset will be made publicly available. |
---|---|
ISSN: | 1945-788X |
DOI: | 10.1109/ICME57554.2024.10687925 |