Video Swin Transformer

The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. These video models are all built on Transformer layers that globally connect patches across the spatial and tempor...

Full description

Saved in:

Bibliographic Details
Main Authors:	Liu, Ze, Ning, Jia, Cao, Yue, Wei, Yixuan, Zhang, Zheng, Lin, Stephen, Hu, Han
Format:	Conference Proceeding
Language:	English
Subjects:	Adaptation models Benchmark testing categorization Computational modeling Computer architecture Image recognition retrieval Solids Transformers Video analysis and understanding Recognition: detection
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Staff View