Loading…

Pixel-Aligned Multi-View Generation with Depth Guided Decoder

The task of image-to-multi-view generation refers to generating novel views of an instance from a single image. Recent methods achieve this by extending text-to-image latent diffusion models to multi-view version, which contains an VAE image encoder and a U-Net diffusion model. Specifically, these g...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2024-08
Main Authors:	Tang, Zhenggang, Zhuang, Peiye, Wang, Chaoyang, Siarohin, Aliaksandr, Kant, Yash, Schwing, Alexander, Tulyakov, Sergey, Hsin-Ying, Lee
Format:	Article
Language:	English
Subjects:	Decoding Diffusion layers Image enhancement Image reconstruction Inference Misalignment Pixels
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Tang, Zhenggang Zhuang, Peiye Wang, Chaoyang Siarohin, Aliaksandr Kant, Yash Schwing, Alexander Tulyakov, Sergey Hsin-Ying, Lee
description	The task of image-to-multi-view generation refers to generating novel views of an instance from a single image. Recent methods achieve this by extending text-to-image latent diffusion models to multi-view version, which contains an VAE image encoder and a U-Net diffusion model. Specifically, these generation methods usually fix VAE and finetune the U-Net only. However, the significant downscaling of the latent vectors computed from the input images and independent decoding leads to notable pixel-level misalignment across multiple views. To address this, we propose a novel method for pixel-level image-to-multi-view generation. Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model. Specifically, we introduce a depth-truncated epipolar attention, enabling the model to focus on spatially adjacent regions while remaining memory efficient. Applying depth-truncated attn is challenging during inference as the ground-truth depth is usually difficult to obtain and pre-trained depth estimation models is hard to provide accurate depth. Thus, to enhance the generalization to inaccurate depth when ground truth depth is missing, we perturb depth inputs during training. During inference, we employ a rapid multi-view to 3D reconstruction approach, NeuS, to obtain coarse depth for the depth-truncated epipolar attention. Our model enables better pixel alignment across multi-view images. Moreover, we demonstrate the efficacy of our approach in improving downstream multi-view to 3D reconstruction tasks.
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3097624000</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3097624000</sourcerecordid><originalsourceid>FETCH-proquest_journals_30976240003</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwDcisSM3RdczJTM9LTVHwLc0pydQNy0wtV3BPzUstSizJzM9TKM8syVBwSS0Aku6lmSlAdS6pyfkpqUU8DKxpiTnFqbxQmptB2c01xNlDt6Aov7A0tbgkPiu_tCgPKBVvbGBpbmZkArTbmDhVAC48Nzs</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3097624000</pqid></control><display><type>article</type><title>Pixel-Aligned Multi-View Generation with Depth Guided Decoder</title><source>Publicly Available Content (ProQuest)</source><creator>Tang, Zhenggang ; Zhuang, Peiye ; Wang, Chaoyang ; Siarohin, Aliaksandr ; Kant, Yash ; Schwing, Alexander ; Tulyakov, Sergey ; Hsin-Ying, Lee</creator><creatorcontrib>Tang, Zhenggang ; Zhuang, Peiye ; Wang, Chaoyang ; Siarohin, Aliaksandr ; Kant, Yash ; Schwing, Alexander ; Tulyakov, Sergey ; Hsin-Ying, Lee</creatorcontrib><description>The task of image-to-multi-view generation refers to generating novel views of an instance from a single image. Recent methods achieve this by extending text-to-image latent diffusion models to multi-view version, which contains an VAE image encoder and a U-Net diffusion model. Specifically, these generation methods usually fix VAE and finetune the U-Net only. However, the significant downscaling of the latent vectors computed from the input images and independent decoding leads to notable pixel-level misalignment across multiple views. To address this, we propose a novel method for pixel-level image-to-multi-view generation. Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model. Specifically, we introduce a depth-truncated epipolar attention, enabling the model to focus on spatially adjacent regions while remaining memory efficient. Applying depth-truncated attn is challenging during inference as the ground-truth depth is usually difficult to obtain and pre-trained depth estimation models is hard to provide accurate depth. Thus, to enhance the generalization to inaccurate depth when ground truth depth is missing, we perturb depth inputs during training. During inference, we employ a rapid multi-view to 3D reconstruction approach, NeuS, to obtain coarse depth for the depth-truncated epipolar attention. Our model enables better pixel alignment across multi-view images. Moreover, we demonstrate the efficacy of our approach in improving downstream multi-view to 3D reconstruction tasks.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Decoding ; Diffusion layers ; Image enhancement ; Image reconstruction ; Inference ; Misalignment ; Pixels</subject><ispartof>arXiv.org, 2024-08</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3097624000?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Tang, Zhenggang</creatorcontrib><creatorcontrib>Zhuang, Peiye</creatorcontrib><creatorcontrib>Wang, Chaoyang</creatorcontrib><creatorcontrib>Siarohin, Aliaksandr</creatorcontrib><creatorcontrib>Kant, Yash</creatorcontrib><creatorcontrib>Schwing, Alexander</creatorcontrib><creatorcontrib>Tulyakov, Sergey</creatorcontrib><creatorcontrib>Hsin-Ying, Lee</creatorcontrib><title>Pixel-Aligned Multi-View Generation with Depth Guided Decoder</title><title>arXiv.org</title><description>The task of image-to-multi-view generation refers to generating novel views of an instance from a single image. Recent methods achieve this by extending text-to-image latent diffusion models to multi-view version, which contains an VAE image encoder and a U-Net diffusion model. Specifically, these generation methods usually fix VAE and finetune the U-Net only. However, the significant downscaling of the latent vectors computed from the input images and independent decoding leads to notable pixel-level misalignment across multiple views. To address this, we propose a novel method for pixel-level image-to-multi-view generation. Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model. Specifically, we introduce a depth-truncated epipolar attention, enabling the model to focus on spatially adjacent regions while remaining memory efficient. Applying depth-truncated attn is challenging during inference as the ground-truth depth is usually difficult to obtain and pre-trained depth estimation models is hard to provide accurate depth. Thus, to enhance the generalization to inaccurate depth when ground truth depth is missing, we perturb depth inputs during training. During inference, we employ a rapid multi-view to 3D reconstruction approach, NeuS, to obtain coarse depth for the depth-truncated epipolar attention. Our model enables better pixel alignment across multi-view images. Moreover, we demonstrate the efficacy of our approach in improving downstream multi-view to 3D reconstruction tasks.</description><subject>Decoding</subject><subject>Diffusion layers</subject><subject>Image enhancement</subject><subject>Image reconstruction</subject><subject>Inference</subject><subject>Misalignment</subject><subject>Pixels</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwDcisSM3RdczJTM9LTVHwLc0pydQNy0wtV3BPzUstSizJzM9TKM8syVBwSS0Aku6lmSlAdS6pyfkpqUU8DKxpiTnFqbxQmptB2c01xNlDt6Aov7A0tbgkPiu_tCgPKBVvbGBpbmZkArTbmDhVAC48Nzs</recordid><startdate>20240826</startdate><enddate>20240826</enddate><creator>Tang, Zhenggang</creator><creator>Zhuang, Peiye</creator><creator>Wang, Chaoyang</creator><creator>Siarohin, Aliaksandr</creator><creator>Kant, Yash</creator><creator>Schwing, Alexander</creator><creator>Tulyakov, Sergey</creator><creator>Hsin-Ying, Lee</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240826</creationdate><title>Pixel-Aligned Multi-View Generation with Depth Guided Decoder</title><author>Tang, Zhenggang ; Zhuang, Peiye ; Wang, Chaoyang ; Siarohin, Aliaksandr ; Kant, Yash ; Schwing, Alexander ; Tulyakov, Sergey ; Hsin-Ying, Lee</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30976240003</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Decoding</topic><topic>Diffusion layers</topic><topic>Image enhancement</topic><topic>Image reconstruction</topic><topic>Inference</topic><topic>Misalignment</topic><topic>Pixels</topic><toplevel>online_resources</toplevel><creatorcontrib>Tang, Zhenggang</creatorcontrib><creatorcontrib>Zhuang, Peiye</creatorcontrib><creatorcontrib>Wang, Chaoyang</creatorcontrib><creatorcontrib>Siarohin, Aliaksandr</creatorcontrib><creatorcontrib>Kant, Yash</creatorcontrib><creatorcontrib>Schwing, Alexander</creatorcontrib><creatorcontrib>Tulyakov, Sergey</creatorcontrib><creatorcontrib>Hsin-Ying, Lee</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Tang, Zhenggang</au><au>Zhuang, Peiye</au><au>Wang, Chaoyang</au><au>Siarohin, Aliaksandr</au><au>Kant, Yash</au><au>Schwing, Alexander</au><au>Tulyakov, Sergey</au><au>Hsin-Ying, Lee</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Pixel-Aligned Multi-View Generation with Depth Guided Decoder</atitle><jtitle>arXiv.org</jtitle><date>2024-08-26</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>The task of image-to-multi-view generation refers to generating novel views of an instance from a single image. Recent methods achieve this by extending text-to-image latent diffusion models to multi-view version, which contains an VAE image encoder and a U-Net diffusion model. Specifically, these generation methods usually fix VAE and finetune the U-Net only. However, the significant downscaling of the latent vectors computed from the input images and independent decoding leads to notable pixel-level misalignment across multiple views. To address this, we propose a novel method for pixel-level image-to-multi-view generation. Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model. Specifically, we introduce a depth-truncated epipolar attention, enabling the model to focus on spatially adjacent regions while remaining memory efficient. Applying depth-truncated attn is challenging during inference as the ground-truth depth is usually difficult to obtain and pre-trained depth estimation models is hard to provide accurate depth. Thus, to enhance the generalization to inaccurate depth when ground truth depth is missing, we perturb depth inputs during training. During inference, we employ a rapid multi-view to 3D reconstruction approach, NeuS, to obtain coarse depth for the depth-truncated epipolar attention. Our model enables better pixel alignment across multi-view images. Moreover, we demonstrate the efficacy of our approach in improving downstream multi-view to 3D reconstruction tasks.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-08
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3097624000
source	Publicly Available Content (ProQuest)
subjects	Decoding Diffusion layers Image enhancement Image reconstruction Inference Misalignment Pixels
title	Pixel-Aligned Multi-View Generation with Depth Guided Decoder
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T20%3A06%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Pixel-Aligned%20Multi-View%20Generation%20with%20Depth%20Guided%20Decoder&rft.jtitle=arXiv.org&rft.au=Tang,%20Zhenggang&rft.date=2024-08-26&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3097624000%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_30976240003%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3097624000&rft_id=info:pmid/&rfr_iscdi=true