MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantia...

Full description

Saved in:
Bibliographic Details
Main Authors: Li, Yanghao, Wu, Chao-Yuan, Fan, Haoqi, Mangalam, Karttikeya, Xiong, Bo, Malik, Jitendra, Feichtenhofer, Christoph
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!