Video Swin Transformer
The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. These video models are all built on Transformer layers that globally connect patches across the spatial and tempor...
Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Conference Proceeding |
| Language: | English |
| Subjects: | |
| Online Access: | Request full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|