Loading…

Motion-Augmented Self-Training for Video Recognition at Smaller Scale

The goal of this paper is to self-train a 3D convolutional neural network on an unlabeled video collection for deployment on small-scale video collections. As smaller video datasets benefit more from motion than appearance, we strive to train our network using optical flow, but avoid its computation...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2021-05
Main Authors: Gavrilyuk, Kirill, Jain, Mihir, Karmanov, Ilia, Snoek, Cees G M
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Gavrilyuk, Kirill
Jain, Mihir
Karmanov, Ilia
Snoek, Cees G M
description The goal of this paper is to self-train a 3D convolutional neural network on an unlabeled video collection for deployment on small-scale video collections. As smaller video datasets benefit more from motion than appearance, we strive to train our network using optical flow, but avoid its computation during inference. We propose the first motion-augmented self-training regime, we call MotionFit. We start with supervised training of a motion model on a small, and labeled, video collection. With the motion model we generate pseudo-labels for a large unlabeled video collection, which enables us to transfer knowledge by learning to predict these pseudo-labels with an appearance model. Moreover, we introduce a multi-clip loss as a simple yet efficient way to improve the quality of the pseudo-labeling, even without additional auxiliary tasks. We also take into consideration the temporal granularity of videos during self-training of the appearance model, which was missed in previous works. As a result we obtain a strong motion-augmented representation model suited for video downstream tasks like action recognition and clip retrieval. On small-scale video datasets, MotionFit outperforms alternatives for knowledge transfer by 5%-8%, video-only self-supervision by 1%-7% and semi-supervised learning by 9%-18% using the same amount of class labels.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2522251183</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2522251183</sourcerecordid><originalsourceid>FETCH-proquest_journals_25222511833</originalsourceid><addsrcrecordid>eNqNykELgjAYgOERBEn5Hz7oPNBvrrxGGF26pHSVoZ8ymVtt8_9X0A_o9B7eZ8USFCLnZYG4YWkIU5ZleDiilCJh1c1F7Sw_LeNMNlIPNZmBN15pq-0Ig_Pw0D05uFPnRqu_GlSEelbGkIe6U4Z2bD0oEyj9dcv2l6o5X_nTu9dCIbaTW7z9rBYlIso8L4X4T70BSdI6ZA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2522251183</pqid></control><display><type>article</type><title>Motion-Augmented Self-Training for Video Recognition at Smaller Scale</title><source>Publicly Available Content (ProQuest)</source><creator>Gavrilyuk, Kirill ; Jain, Mihir ; Karmanov, Ilia ; Snoek, Cees G M</creator><creatorcontrib>Gavrilyuk, Kirill ; Jain, Mihir ; Karmanov, Ilia ; Snoek, Cees G M</creatorcontrib><description>The goal of this paper is to self-train a 3D convolutional neural network on an unlabeled video collection for deployment on small-scale video collections. As smaller video datasets benefit more from motion than appearance, we strive to train our network using optical flow, but avoid its computation during inference. We propose the first motion-augmented self-training regime, we call MotionFit. We start with supervised training of a motion model on a small, and labeled, video collection. With the motion model we generate pseudo-labels for a large unlabeled video collection, which enables us to transfer knowledge by learning to predict these pseudo-labels with an appearance model. Moreover, we introduce a multi-clip loss as a simple yet efficient way to improve the quality of the pseudo-labeling, even without additional auxiliary tasks. We also take into consideration the temporal granularity of videos during self-training of the appearance model, which was missed in previous works. As a result we obtain a strong motion-augmented representation model suited for video downstream tasks like action recognition and clip retrieval. On small-scale video datasets, MotionFit outperforms alternatives for knowledge transfer by 5%-8%, video-only self-supervision by 1%-7% and semi-supervised learning by 9%-18% using the same amount of class labels.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Artificial neural networks ; Collection ; Datasets ; Downstream effects ; Knowledge management ; Labels ; Optical flow (image analysis) ; Recognition ; Semi-supervised learning ; Training</subject><ispartof>arXiv.org, 2021-05</ispartof><rights>2021. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2522251183?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Gavrilyuk, Kirill</creatorcontrib><creatorcontrib>Jain, Mihir</creatorcontrib><creatorcontrib>Karmanov, Ilia</creatorcontrib><creatorcontrib>Snoek, Cees G M</creatorcontrib><title>Motion-Augmented Self-Training for Video Recognition at Smaller Scale</title><title>arXiv.org</title><description>The goal of this paper is to self-train a 3D convolutional neural network on an unlabeled video collection for deployment on small-scale video collections. As smaller video datasets benefit more from motion than appearance, we strive to train our network using optical flow, but avoid its computation during inference. We propose the first motion-augmented self-training regime, we call MotionFit. We start with supervised training of a motion model on a small, and labeled, video collection. With the motion model we generate pseudo-labels for a large unlabeled video collection, which enables us to transfer knowledge by learning to predict these pseudo-labels with an appearance model. Moreover, we introduce a multi-clip loss as a simple yet efficient way to improve the quality of the pseudo-labeling, even without additional auxiliary tasks. We also take into consideration the temporal granularity of videos during self-training of the appearance model, which was missed in previous works. As a result we obtain a strong motion-augmented representation model suited for video downstream tasks like action recognition and clip retrieval. On small-scale video datasets, MotionFit outperforms alternatives for knowledge transfer by 5%-8%, video-only self-supervision by 1%-7% and semi-supervised learning by 9%-18% using the same amount of class labels.</description><subject>Artificial neural networks</subject><subject>Collection</subject><subject>Datasets</subject><subject>Downstream effects</subject><subject>Knowledge management</subject><subject>Labels</subject><subject>Optical flow (image analysis)</subject><subject>Recognition</subject><subject>Semi-supervised learning</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNykELgjAYgOERBEn5Hz7oPNBvrrxGGF26pHSVoZ8ymVtt8_9X0A_o9B7eZ8USFCLnZYG4YWkIU5ZleDiilCJh1c1F7Sw_LeNMNlIPNZmBN15pq-0Ig_Pw0D05uFPnRqu_GlSEelbGkIe6U4Z2bD0oEyj9dcv2l6o5X_nTu9dCIbaTW7z9rBYlIso8L4X4T70BSdI6ZA</recordid><startdate>20210504</startdate><enddate>20210504</enddate><creator>Gavrilyuk, Kirill</creator><creator>Jain, Mihir</creator><creator>Karmanov, Ilia</creator><creator>Snoek, Cees G M</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20210504</creationdate><title>Motion-Augmented Self-Training for Video Recognition at Smaller Scale</title><author>Gavrilyuk, Kirill ; Jain, Mihir ; Karmanov, Ilia ; Snoek, Cees G M</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_25222511833</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Artificial neural networks</topic><topic>Collection</topic><topic>Datasets</topic><topic>Downstream effects</topic><topic>Knowledge management</topic><topic>Labels</topic><topic>Optical flow (image analysis)</topic><topic>Recognition</topic><topic>Semi-supervised learning</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Gavrilyuk, Kirill</creatorcontrib><creatorcontrib>Jain, Mihir</creatorcontrib><creatorcontrib>Karmanov, Ilia</creatorcontrib><creatorcontrib>Snoek, Cees G M</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Gavrilyuk, Kirill</au><au>Jain, Mihir</au><au>Karmanov, Ilia</au><au>Snoek, Cees G M</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Motion-Augmented Self-Training for Video Recognition at Smaller Scale</atitle><jtitle>arXiv.org</jtitle><date>2021-05-04</date><risdate>2021</risdate><eissn>2331-8422</eissn><abstract>The goal of this paper is to self-train a 3D convolutional neural network on an unlabeled video collection for deployment on small-scale video collections. As smaller video datasets benefit more from motion than appearance, we strive to train our network using optical flow, but avoid its computation during inference. We propose the first motion-augmented self-training regime, we call MotionFit. We start with supervised training of a motion model on a small, and labeled, video collection. With the motion model we generate pseudo-labels for a large unlabeled video collection, which enables us to transfer knowledge by learning to predict these pseudo-labels with an appearance model. Moreover, we introduce a multi-clip loss as a simple yet efficient way to improve the quality of the pseudo-labeling, even without additional auxiliary tasks. We also take into consideration the temporal granularity of videos during self-training of the appearance model, which was missed in previous works. As a result we obtain a strong motion-augmented representation model suited for video downstream tasks like action recognition and clip retrieval. On small-scale video datasets, MotionFit outperforms alternatives for knowledge transfer by 5%-8%, video-only self-supervision by 1%-7% and semi-supervised learning by 9%-18% using the same amount of class labels.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2021-05
issn 2331-8422
language eng
recordid cdi_proquest_journals_2522251183
source Publicly Available Content (ProQuest)
subjects Artificial neural networks
Collection
Datasets
Downstream effects
Knowledge management
Labels
Optical flow (image analysis)
Recognition
Semi-supervised learning
Training
title Motion-Augmented Self-Training for Video Recognition at Smaller Scale
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T00%3A21%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Motion-Augmented%20Self-Training%20for%20Video%20Recognition%20at%20Smaller%20Scale&rft.jtitle=arXiv.org&rft.au=Gavrilyuk,%20Kirill&rft.date=2021-05-04&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2522251183%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_25222511833%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2522251183&rft_id=info:pmid/&rfr_iscdi=true