MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantia...

Full description

Saved in:

Bibliographic Details
Main Authors:	Li, Yanghao, Wu, Chao-Yuan, Fan, Haoqi, Mangalam, Karttikeya, Xiong, Bo, Malik, Jitendra, Feichtenhofer, Christoph
Format:	Conference Proceeding
Language:	English
Subjects:	categorization Computer architecture Image recognition Image segmentation Object detection Recognition: detection Representation learning retrieval Deep learning architectures and techniques Representation learning Video analysis and understanding Transformers Visualization
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Staff View