Loading…

Sequential Video VLAD: Training the Aggregation Locally and Temporally

As characterizing videos simultaneously from spatial and temporal cues has been shown crucial for the video analysis, the combination of convolutional neural networks and recurrent neural networks, i.e., recurrent convolution networks (RCNs), should be a native framework for learning the spatio-temp...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on image processing 2018-10, Vol.27 (10), p.4933-4944
Main Authors:	Youjiang Xu, Yahong Han, Hong, Richang, Qi Tian
Format:	Article
Language:	English
Subjects:	action recognition Aggregates Convolution deep learning Feature extraction Image coding recurrent convolution networks Recurrent neural networks Task analysis video captioning Video representation Visualization
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c319t-f33345b431c8c2c6423804b27685916f0fc83534f1bda6f99d08be0e073b1a9f3
cites	cdi_FETCH-LOGICAL-c319t-f33345b431c8c2c6423804b27685916f0fc83534f1bda6f99d08be0e073b1a9f3
container_end_page	4944
container_issue	10
container_start_page	4933
container_title	IEEE transactions on image processing
container_volume	27
creator	Youjiang Xu Yahong Han Hong, Richang Qi Tian
description	As characterizing videos simultaneously from spatial and temporal cues has been shown crucial for the video analysis, the combination of convolutional neural networks and recurrent neural networks, i.e., recurrent convolution networks (RCNs), should be a native framework for learning the spatio-temporal video features. In this paper, we develop a novel sequential vector of locally aggregated descriptor (VLAD) layer, named SeqVLAD, to combine a trainable VLAD encoding process and the RCNs architecture into a whole framework. In particular, sequential convolutional feature maps extracted from successive video frames are fed into the RCNs to learn soft spatio-temporal assignment parameters, so as to aggregate not only detailed spatial information in separate video frames but also fine motion information in successive video frames. Moreover, we improve the gated recurrent unit (GRU) of RCNs by sharing the input-to-hidden parameters and propose an improved GRU-RCN architecture named shared GRU-RCN (SGRU-RCN). Thus, our SGRU-RCN has a fewer parameters and a less possibility of overfitting. In experiments, we evaluate SeqVLAD with the tasks of video captioning and video action recognition. Experimental results on Microsoft Research Video Description Corpus, Montreal Video Annotation Dataset, UCF101, and HMDB51 demonstrate the effectiveness and good performance of our method.
doi_str_mv	10.1109/TIP.2018.2846664
format	article
fullrecord	<record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmed_primary_29985134</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8382330</ieee_id><sourcerecordid>2067138364</sourcerecordid><originalsourceid>FETCH-LOGICAL-c319t-f33345b431c8c2c6423804b27685916f0fc83534f1bda6f99d08be0e073b1a9f3</originalsourceid><addsrcrecordid>eNo9kEFLw0AQRhdRbK3eBUFy9JI6s7PZ7Hor1WohoGDsNWySTYykSc2mh_57U1p7mhnmfcPwGLtFmCKCfoyXH1MOqKZcCSmlOGNj1AJ9AMHPhx6C0A9R6BG7cu4HAEWA8pKNuNYqQBJjtvi0v1vb9JWpvVWV29ZbRbPnJy_uTNVUTen139ablWVnS9NXbeNFbWbqeueZJvdiu9603X68ZheFqZ29OdYJ-1q8xPM3P3p_Xc5nkZ8R6t4viEgEqSDMVMYzKTgpECkPpQo0ygKKTFFAosA0N7LQOgeVWrAQUopGFzRhD4e7m64d_nZ9sq5cZuvaNLbduoSDDJEUSTGgcECzrnWus0Wy6aq16XYJQrK3lwz2kr295GhviNwfr2_Ttc1PgX9dA3B3ACpr7WmtSHEioD8u8HGO</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2067138364</pqid></control><display><type>article</type><title>Sequential Video VLAD: Training the Aggregation Locally and Temporally</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Youjiang Xu ; Yahong Han ; Hong, Richang ; Qi Tian</creator><creatorcontrib>Youjiang Xu ; Yahong Han ; Hong, Richang ; Qi Tian</creatorcontrib><description>As characterizing videos simultaneously from spatial and temporal cues has been shown crucial for the video analysis, the combination of convolutional neural networks and recurrent neural networks, i.e., recurrent convolution networks (RCNs), should be a native framework for learning the spatio-temporal video features. In this paper, we develop a novel sequential vector of locally aggregated descriptor (VLAD) layer, named SeqVLAD, to combine a trainable VLAD encoding process and the RCNs architecture into a whole framework. In particular, sequential convolutional feature maps extracted from successive video frames are fed into the RCNs to learn soft spatio-temporal assignment parameters, so as to aggregate not only detailed spatial information in separate video frames but also fine motion information in successive video frames. Moreover, we improve the gated recurrent unit (GRU) of RCNs by sharing the input-to-hidden parameters and propose an improved GRU-RCN architecture named shared GRU-RCN (SGRU-RCN). Thus, our SGRU-RCN has a fewer parameters and a less possibility of overfitting. In experiments, we evaluate SeqVLAD with the tasks of video captioning and video action recognition. Experimental results on Microsoft Research Video Description Corpus, Montreal Video Annotation Dataset, UCF101, and HMDB51 demonstrate the effectiveness and good performance of our method.</description><identifier>ISSN: 1057-7149</identifier><identifier>EISSN: 1941-0042</identifier><identifier>DOI: 10.1109/TIP.2018.2846664</identifier><identifier>PMID: 29985134</identifier><identifier>CODEN: IIPRE4</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>action recognition ; Aggregates ; Convolution ; deep learning ; Feature extraction ; Image coding ; recurrent convolution networks ; Recurrent neural networks ; Task analysis ; video captioning ; Video representation ; Visualization</subject><ispartof>IEEE transactions on image processing, 2018-10, Vol.27 (10), p.4933-4944</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c319t-f33345b431c8c2c6423804b27685916f0fc83534f1bda6f99d08be0e073b1a9f3</citedby><cites>FETCH-LOGICAL-c319t-f33345b431c8c2c6423804b27685916f0fc83534f1bda6f99d08be0e073b1a9f3</cites><orcidid>0000-0003-2768-1398 ; 0000-0001-5461-3986</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8382330$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27922,27923,54794</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/29985134$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Youjiang Xu</creatorcontrib><creatorcontrib>Yahong Han</creatorcontrib><creatorcontrib>Hong, Richang</creatorcontrib><creatorcontrib>Qi Tian</creatorcontrib><title>Sequential Video VLAD: Training the Aggregation Locally and Temporally</title><title>IEEE transactions on image processing</title><addtitle>TIP</addtitle><addtitle>IEEE Trans Image Process</addtitle><description>As characterizing videos simultaneously from spatial and temporal cues has been shown crucial for the video analysis, the combination of convolutional neural networks and recurrent neural networks, i.e., recurrent convolution networks (RCNs), should be a native framework for learning the spatio-temporal video features. In this paper, we develop a novel sequential vector of locally aggregated descriptor (VLAD) layer, named SeqVLAD, to combine a trainable VLAD encoding process and the RCNs architecture into a whole framework. In particular, sequential convolutional feature maps extracted from successive video frames are fed into the RCNs to learn soft spatio-temporal assignment parameters, so as to aggregate not only detailed spatial information in separate video frames but also fine motion information in successive video frames. Moreover, we improve the gated recurrent unit (GRU) of RCNs by sharing the input-to-hidden parameters and propose an improved GRU-RCN architecture named shared GRU-RCN (SGRU-RCN). Thus, our SGRU-RCN has a fewer parameters and a less possibility of overfitting. In experiments, we evaluate SeqVLAD with the tasks of video captioning and video action recognition. Experimental results on Microsoft Research Video Description Corpus, Montreal Video Annotation Dataset, UCF101, and HMDB51 demonstrate the effectiveness and good performance of our method.</description><subject>action recognition</subject><subject>Aggregates</subject><subject>Convolution</subject><subject>deep learning</subject><subject>Feature extraction</subject><subject>Image coding</subject><subject>recurrent convolution networks</subject><subject>Recurrent neural networks</subject><subject>Task analysis</subject><subject>video captioning</subject><subject>Video representation</subject><subject>Visualization</subject><issn>1057-7149</issn><issn>1941-0042</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><recordid>eNo9kEFLw0AQRhdRbK3eBUFy9JI6s7PZ7Hor1WohoGDsNWySTYykSc2mh_57U1p7mhnmfcPwGLtFmCKCfoyXH1MOqKZcCSmlOGNj1AJ9AMHPhx6C0A9R6BG7cu4HAEWA8pKNuNYqQBJjtvi0v1vb9JWpvVWV29ZbRbPnJy_uTNVUTen139ablWVnS9NXbeNFbWbqeueZJvdiu9603X68ZheFqZ29OdYJ-1q8xPM3P3p_Xc5nkZ8R6t4viEgEqSDMVMYzKTgpECkPpQo0ygKKTFFAosA0N7LQOgeVWrAQUopGFzRhD4e7m64d_nZ9sq5cZuvaNLbduoSDDJEUSTGgcECzrnWus0Wy6aq16XYJQrK3lwz2kr295GhviNwfr2_Ttc1PgX9dA3B3ACpr7WmtSHEioD8u8HGO</recordid><startdate>201810</startdate><enddate>201810</enddate><creator>Youjiang Xu</creator><creator>Yahong Han</creator><creator>Hong, Richang</creator><creator>Qi Tian</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0003-2768-1398</orcidid><orcidid>https://orcid.org/0000-0001-5461-3986</orcidid></search><sort><creationdate>201810</creationdate><title>Sequential Video VLAD: Training the Aggregation Locally and Temporally</title><author>Youjiang Xu ; Yahong Han ; Hong, Richang ; Qi Tian</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c319t-f33345b431c8c2c6423804b27685916f0fc83534f1bda6f99d08be0e073b1a9f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>action recognition</topic><topic>Aggregates</topic><topic>Convolution</topic><topic>deep learning</topic><topic>Feature extraction</topic><topic>Image coding</topic><topic>recurrent convolution networks</topic><topic>Recurrent neural networks</topic><topic>Task analysis</topic><topic>video captioning</topic><topic>Video representation</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Youjiang Xu</creatorcontrib><creatorcontrib>Yahong Han</creatorcontrib><creatorcontrib>Hong, Richang</creatorcontrib><creatorcontrib>Qi Tian</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Xplore Digital Library</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on image processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Youjiang Xu</au><au>Yahong Han</au><au>Hong, Richang</au><au>Qi Tian</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Sequential Video VLAD: Training the Aggregation Locally and Temporally</atitle><jtitle>IEEE transactions on image processing</jtitle><stitle>TIP</stitle><addtitle>IEEE Trans Image Process</addtitle><date>2018-10</date><risdate>2018</risdate><volume>27</volume><issue>10</issue><spage>4933</spage><epage>4944</epage><pages>4933-4944</pages><issn>1057-7149</issn><eissn>1941-0042</eissn><coden>IIPRE4</coden><abstract>As characterizing videos simultaneously from spatial and temporal cues has been shown crucial for the video analysis, the combination of convolutional neural networks and recurrent neural networks, i.e., recurrent convolution networks (RCNs), should be a native framework for learning the spatio-temporal video features. In this paper, we develop a novel sequential vector of locally aggregated descriptor (VLAD) layer, named SeqVLAD, to combine a trainable VLAD encoding process and the RCNs architecture into a whole framework. In particular, sequential convolutional feature maps extracted from successive video frames are fed into the RCNs to learn soft spatio-temporal assignment parameters, so as to aggregate not only detailed spatial information in separate video frames but also fine motion information in successive video frames. Moreover, we improve the gated recurrent unit (GRU) of RCNs by sharing the input-to-hidden parameters and propose an improved GRU-RCN architecture named shared GRU-RCN (SGRU-RCN). Thus, our SGRU-RCN has a fewer parameters and a less possibility of overfitting. In experiments, we evaluate SeqVLAD with the tasks of video captioning and video action recognition. Experimental results on Microsoft Research Video Description Corpus, Montreal Video Annotation Dataset, UCF101, and HMDB51 demonstrate the effectiveness and good performance of our method.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>29985134</pmid><doi>10.1109/TIP.2018.2846664</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0003-2768-1398</orcidid><orcidid>https://orcid.org/0000-0001-5461-3986</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 1057-7149
ispartof	IEEE transactions on image processing, 2018-10, Vol.27 (10), p.4933-4944
issn	1057-7149 1941-0042
language	eng
recordid	cdi_pubmed_primary_29985134
source	IEEE Electronic Library (IEL) Journals
subjects	action recognition Aggregates Convolution deep learning Feature extraction Image coding recurrent convolution networks Recurrent neural networks Task analysis video captioning Video representation Visualization
title	Sequential Video VLAD: Training the Aggregation Locally and Temporally
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-09T21%3A16%3A02IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Sequential%20Video%20VLAD:%20Training%20the%20Aggregation%20Locally%20and%20Temporally&rft.jtitle=IEEE%20transactions%20on%20image%20processing&rft.au=Youjiang%20Xu&rft.date=2018-10&rft.volume=27&rft.issue=10&rft.spage=4933&rft.epage=4944&rft.pages=4933-4944&rft.issn=1057-7149&rft.eissn=1941-0042&rft.coden=IIPRE4&rft_id=info:doi/10.1109/TIP.2018.2846664&rft_dat=%3Cproquest_pubme%3E2067138364%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c319t-f33345b431c8c2c6423804b27685916f0fc83534f1bda6f99d08be0e073b1a9f3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2067138364&rft_id=info:pmid/29985134&rft_ieee_id=8382330&rfr_iscdi=true