Loading…
Pixel-Aligned Multi-View Generation with Depth Guided Decoder
The task of image-to-multi-view generation refers to generating novel views of an instance from a single image. Recent methods achieve this by extending text-to-image latent diffusion models to multi-view version, which contains an VAE image encoder and a U-Net diffusion model. Specifically, these g...
Saved in:
Published in: | arXiv.org 2024-08 |
---|---|
Main Authors: | , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | |
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Tang, Zhenggang Zhuang, Peiye Wang, Chaoyang Siarohin, Aliaksandr Kant, Yash Schwing, Alexander Tulyakov, Sergey Hsin-Ying, Lee |
description | The task of image-to-multi-view generation refers to generating novel views of an instance from a single image. Recent methods achieve this by extending text-to-image latent diffusion models to multi-view version, which contains an VAE image encoder and a U-Net diffusion model. Specifically, these generation methods usually fix VAE and finetune the U-Net only. However, the significant downscaling of the latent vectors computed from the input images and independent decoding leads to notable pixel-level misalignment across multiple views. To address this, we propose a novel method for pixel-level image-to-multi-view generation. Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model. Specifically, we introduce a depth-truncated epipolar attention, enabling the model to focus on spatially adjacent regions while remaining memory efficient. Applying depth-truncated attn is challenging during inference as the ground-truth depth is usually difficult to obtain and pre-trained depth estimation models is hard to provide accurate depth. Thus, to enhance the generalization to inaccurate depth when ground truth depth is missing, we perturb depth inputs during training. During inference, we employ a rapid multi-view to 3D reconstruction approach, NeuS, to obtain coarse depth for the depth-truncated epipolar attention. Our model enables better pixel alignment across multi-view images. Moreover, we demonstrate the efficacy of our approach in improving downstream multi-view to 3D reconstruction tasks. |
format | article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3097624000</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3097624000</sourcerecordid><originalsourceid>FETCH-proquest_journals_30976240003</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwDcisSM3RdczJTM9LTVHwLc0pydQNy0wtV3BPzUstSizJzM9TKM8syVBwSS0Aku6lmSlAdS6pyfkpqUU8DKxpiTnFqbxQmptB2c01xNlDt6Aov7A0tbgkPiu_tCgPKBVvbGBpbmZkArTbmDhVAC48Nzs</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3097624000</pqid></control><display><type>article</type><title>Pixel-Aligned Multi-View Generation with Depth Guided Decoder</title><source>Publicly Available Content (ProQuest)</source><creator>Tang, Zhenggang ; Zhuang, Peiye ; Wang, Chaoyang ; Siarohin, Aliaksandr ; Kant, Yash ; Schwing, Alexander ; Tulyakov, Sergey ; Hsin-Ying, Lee</creator><creatorcontrib>Tang, Zhenggang ; Zhuang, Peiye ; Wang, Chaoyang ; Siarohin, Aliaksandr ; Kant, Yash ; Schwing, Alexander ; Tulyakov, Sergey ; Hsin-Ying, Lee</creatorcontrib><description>The task of image-to-multi-view generation refers to generating novel views of an instance from a single image. Recent methods achieve this by extending text-to-image latent diffusion models to multi-view version, which contains an VAE image encoder and a U-Net diffusion model. Specifically, these generation methods usually fix VAE and finetune the U-Net only. However, the significant downscaling of the latent vectors computed from the input images and independent decoding leads to notable pixel-level misalignment across multiple views. To address this, we propose a novel method for pixel-level image-to-multi-view generation. Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model. Specifically, we introduce a depth-truncated epipolar attention, enabling the model to focus on spatially adjacent regions while remaining memory efficient. Applying depth-truncated attn is challenging during inference as the ground-truth depth is usually difficult to obtain and pre-trained depth estimation models is hard to provide accurate depth. Thus, to enhance the generalization to inaccurate depth when ground truth depth is missing, we perturb depth inputs during training. During inference, we employ a rapid multi-view to 3D reconstruction approach, NeuS, to obtain coarse depth for the depth-truncated epipolar attention. Our model enables better pixel alignment across multi-view images. Moreover, we demonstrate the efficacy of our approach in improving downstream multi-view to 3D reconstruction tasks.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Decoding ; Diffusion layers ; Image enhancement ; Image reconstruction ; Inference ; Misalignment ; Pixels</subject><ispartof>arXiv.org, 2024-08</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3097624000?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Tang, Zhenggang</creatorcontrib><creatorcontrib>Zhuang, Peiye</creatorcontrib><creatorcontrib>Wang, Chaoyang</creatorcontrib><creatorcontrib>Siarohin, Aliaksandr</creatorcontrib><creatorcontrib>Kant, Yash</creatorcontrib><creatorcontrib>Schwing, Alexander</creatorcontrib><creatorcontrib>Tulyakov, Sergey</creatorcontrib><creatorcontrib>Hsin-Ying, Lee</creatorcontrib><title>Pixel-Aligned Multi-View Generation with Depth Guided Decoder</title><title>arXiv.org</title><description>The task of image-to-multi-view generation refers to generating novel views of an instance from a single image. Recent methods achieve this by extending text-to-image latent diffusion models to multi-view version, which contains an VAE image encoder and a U-Net diffusion model. Specifically, these generation methods usually fix VAE and finetune the U-Net only. However, the significant downscaling of the latent vectors computed from the input images and independent decoding leads to notable pixel-level misalignment across multiple views. To address this, we propose a novel method for pixel-level image-to-multi-view generation. Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model. Specifically, we introduce a depth-truncated epipolar attention, enabling the model to focus on spatially adjacent regions while remaining memory efficient. Applying depth-truncated attn is challenging during inference as the ground-truth depth is usually difficult to obtain and pre-trained depth estimation models is hard to provide accurate depth. Thus, to enhance the generalization to inaccurate depth when ground truth depth is missing, we perturb depth inputs during training. During inference, we employ a rapid multi-view to 3D reconstruction approach, NeuS, to obtain coarse depth for the depth-truncated epipolar attention. Our model enables better pixel alignment across multi-view images. Moreover, we demonstrate the efficacy of our approach in improving downstream multi-view to 3D reconstruction tasks.</description><subject>Decoding</subject><subject>Diffusion layers</subject><subject>Image enhancement</subject><subject>Image reconstruction</subject><subject>Inference</subject><subject>Misalignment</subject><subject>Pixels</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwDcisSM3RdczJTM9LTVHwLc0pydQNy0wtV3BPzUstSizJzM9TKM8syVBwSS0Aku6lmSlAdS6pyfkpqUU8DKxpiTnFqbxQmptB2c01xNlDt6Aov7A0tbgkPiu_tCgPKBVvbGBpbmZkArTbmDhVAC48Nzs</recordid><startdate>20240826</startdate><enddate>20240826</enddate><creator>Tang, Zhenggang</creator><creator>Zhuang, Peiye</creator><creator>Wang, Chaoyang</creator><creator>Siarohin, Aliaksandr</creator><creator>Kant, Yash</creator><creator>Schwing, Alexander</creator><creator>Tulyakov, Sergey</creator><creator>Hsin-Ying, Lee</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240826</creationdate><title>Pixel-Aligned Multi-View Generation with Depth Guided Decoder</title><author>Tang, Zhenggang ; Zhuang, Peiye ; Wang, Chaoyang ; Siarohin, Aliaksandr ; Kant, Yash ; Schwing, Alexander ; Tulyakov, Sergey ; Hsin-Ying, Lee</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30976240003</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Decoding</topic><topic>Diffusion layers</topic><topic>Image enhancement</topic><topic>Image reconstruction</topic><topic>Inference</topic><topic>Misalignment</topic><topic>Pixels</topic><toplevel>online_resources</toplevel><creatorcontrib>Tang, Zhenggang</creatorcontrib><creatorcontrib>Zhuang, Peiye</creatorcontrib><creatorcontrib>Wang, Chaoyang</creatorcontrib><creatorcontrib>Siarohin, Aliaksandr</creatorcontrib><creatorcontrib>Kant, Yash</creatorcontrib><creatorcontrib>Schwing, Alexander</creatorcontrib><creatorcontrib>Tulyakov, Sergey</creatorcontrib><creatorcontrib>Hsin-Ying, Lee</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Tang, Zhenggang</au><au>Zhuang, Peiye</au><au>Wang, Chaoyang</au><au>Siarohin, Aliaksandr</au><au>Kant, Yash</au><au>Schwing, Alexander</au><au>Tulyakov, Sergey</au><au>Hsin-Ying, Lee</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Pixel-Aligned Multi-View Generation with Depth Guided Decoder</atitle><jtitle>arXiv.org</jtitle><date>2024-08-26</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>The task of image-to-multi-view generation refers to generating novel views of an instance from a single image. Recent methods achieve this by extending text-to-image latent diffusion models to multi-view version, which contains an VAE image encoder and a U-Net diffusion model. Specifically, these generation methods usually fix VAE and finetune the U-Net only. However, the significant downscaling of the latent vectors computed from the input images and independent decoding leads to notable pixel-level misalignment across multiple views. To address this, we propose a novel method for pixel-level image-to-multi-view generation. Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model. Specifically, we introduce a depth-truncated epipolar attention, enabling the model to focus on spatially adjacent regions while remaining memory efficient. Applying depth-truncated attn is challenging during inference as the ground-truth depth is usually difficult to obtain and pre-trained depth estimation models is hard to provide accurate depth. Thus, to enhance the generalization to inaccurate depth when ground truth depth is missing, we perturb depth inputs during training. During inference, we employ a rapid multi-view to 3D reconstruction approach, NeuS, to obtain coarse depth for the depth-truncated epipolar attention. Our model enables better pixel alignment across multi-view images. Moreover, we demonstrate the efficacy of our approach in improving downstream multi-view to 3D reconstruction tasks.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-08 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_3097624000 |
source | Publicly Available Content (ProQuest) |
subjects | Decoding Diffusion layers Image enhancement Image reconstruction Inference Misalignment Pixels |
title | Pixel-Aligned Multi-View Generation with Depth Guided Decoder |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T20%3A06%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Pixel-Aligned%20Multi-View%20Generation%20with%20Depth%20Guided%20Decoder&rft.jtitle=arXiv.org&rft.au=Tang,%20Zhenggang&rft.date=2024-08-26&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3097624000%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_30976240003%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3097624000&rft_id=info:pmid/&rfr_iscdi=true |