Loading…

A heterogeneous two-stream network for human action recognition

The most widely used two-stream architectures and building blocks for human action recognition in videos generally consist of 2D or 3D convolution neural networks. 3D convolution can abstract motion messages between video frames, which is essential for video classification. 3D convolution neural net...

Full description

Saved in:

Bibliographic Details
Published in:	Ai communications 2023-08, Vol.36 (3), p.219-233
Main Authors:	Liao, Shengbin, Wang, Xiaofeng, Yang, ZongKai
Format:	Article
Language:	English
Subjects:	Artificial neural networks Computational efficiency Computing costs Frames (data processing) Human activity recognition Neural networks Optical flow (image analysis) Redundancy
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites	cdi_FETCH-LOGICAL-c218t-1a52a83a387f92e7b45235f2a8e9033377833b4acdbb14e51da27514b45f6e343
container_end_page	233
container_issue	3
container_start_page	219
container_title	Ai communications
container_volume	36
creator	Liao, Shengbin Wang, Xiaofeng Yang, ZongKai
description	The most widely used two-stream architectures and building blocks for human action recognition in videos generally consist of 2D or 3D convolution neural networks. 3D convolution can abstract motion messages between video frames, which is essential for video classification. 3D convolution neural networks usually obtain good performance compared with 2D cases, however it also increases computational cost. In this paper, we propose a heterogeneous two-stream architecture which incorporates two convolutional networks. One uses a mixed convolution network (MCN), which combines some 3D convolutions in the middle of 2D convolutions to train RGB frames, another one adopts BN-Inception network to train Optical Flow frames. Considering the redundancy of neighborhood video frames, we adopt a sparse sampling strategy to decrease the computational cost. Our architecture is trained and evaluated on the standard video actions benchmarks of HMDB51 and UCF101. Experimental results show our approach obtains the state-of-the-art performance on the datasets of HMDB51 (73.04%) and UCF101 (95.27%).
doi_str_mv	10.3233/AIC-220188
format	article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2854477011</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2854477011</sourcerecordid><originalsourceid>FETCH-LOGICAL-c218t-1a52a83a387f92e7b45235f2a8e9033377833b4acdbb14e51da27514b45f6e343</originalsourceid><addsrcrecordid>eNotkE1LAzEQhoMoWKsXf0HAmxBNZpImPUkpVgsFL3oO2e1sP7Sbmuwi_ntT1tPMvDzMDA9jt0o-ICA-zpZzASCVc2dspJw1wmkD52wkp6CEVTC5ZFc576WUAGhG7GnGt9RRihtqKfaZdz9R5C5ROPCWypA-eRMT3_aH0PJQd7vY8kR13LS7U3_NLprwlenmv47Zx-L5ff4qVm8vy_lsJWpQrhMqGAgOAzrbTIFsVb5C05SMphIRrXWIlQ71uqqUJqPWAaxRunDNhFDjmN0Ne48pfveUO7-PfWrLSQ_OaG2tVKpQ9wNVp5hzosYf0-4Q0q9X0p8E-SLID4LwD6vVVvc</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2854477011</pqid></control><display><type>article</type><title>A heterogeneous two-stream network for human action recognition</title><source>Business Source Ultimate【Trial: -2024/12/31】【Remote access available】</source><source>Library & Information Science Abstracts (LISA)</source><creator>Liao, Shengbin ; Wang, Xiaofeng ; Yang, ZongKai</creator><creatorcontrib>Liao, Shengbin ; Wang, Xiaofeng ; Yang, ZongKai</creatorcontrib><description>The most widely used two-stream architectures and building blocks for human action recognition in videos generally consist of 2D or 3D convolution neural networks. 3D convolution can abstract motion messages between video frames, which is essential for video classification. 3D convolution neural networks usually obtain good performance compared with 2D cases, however it also increases computational cost. In this paper, we propose a heterogeneous two-stream architecture which incorporates two convolutional networks. One uses a mixed convolution network (MCN), which combines some 3D convolutions in the middle of 2D convolutions to train RGB frames, another one adopts BN-Inception network to train Optical Flow frames. Considering the redundancy of neighborhood video frames, we adopt a sparse sampling strategy to decrease the computational cost. Our architecture is trained and evaluated on the standard video actions benchmarks of HMDB51 and UCF101. Experimental results show our approach obtains the state-of-the-art performance on the datasets of HMDB51 (73.04%) and UCF101 (95.27%).</description><identifier>ISSN: 0921-7126</identifier><identifier>EISSN: 1875-8452</identifier><identifier>DOI: 10.3233/AIC-220188</identifier><language>eng</language><publisher>Amsterdam: IOS Press BV</publisher><subject>Artificial neural networks ; Computational efficiency ; Computing costs ; Frames (data processing) ; Human activity recognition ; Neural networks ; Optical flow (image analysis) ; Redundancy</subject><ispartof>Ai communications, 2023-08, Vol.36 (3), p.219-233</ispartof><rights>Copyright IOS Press BV 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c218t-1a52a83a387f92e7b45235f2a8e9033377833b4acdbb14e51da27514b45f6e343</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925,34135</link.rule.ids></links><search><creatorcontrib>Liao, Shengbin</creatorcontrib><creatorcontrib>Wang, Xiaofeng</creatorcontrib><creatorcontrib>Yang, ZongKai</creatorcontrib><title>A heterogeneous two-stream network for human action recognition</title><title>Ai communications</title><description>The most widely used two-stream architectures and building blocks for human action recognition in videos generally consist of 2D or 3D convolution neural networks. 3D convolution can abstract motion messages between video frames, which is essential for video classification. 3D convolution neural networks usually obtain good performance compared with 2D cases, however it also increases computational cost. In this paper, we propose a heterogeneous two-stream architecture which incorporates two convolutional networks. One uses a mixed convolution network (MCN), which combines some 3D convolutions in the middle of 2D convolutions to train RGB frames, another one adopts BN-Inception network to train Optical Flow frames. Considering the redundancy of neighborhood video frames, we adopt a sparse sampling strategy to decrease the computational cost. Our architecture is trained and evaluated on the standard video actions benchmarks of HMDB51 and UCF101. Experimental results show our approach obtains the state-of-the-art performance on the datasets of HMDB51 (73.04%) and UCF101 (95.27%).</description><subject>Artificial neural networks</subject><subject>Computational efficiency</subject><subject>Computing costs</subject><subject>Frames (data processing)</subject><subject>Human activity recognition</subject><subject>Neural networks</subject><subject>Optical flow (image analysis)</subject><subject>Redundancy</subject><issn>0921-7126</issn><issn>1875-8452</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>F2A</sourceid><recordid>eNotkE1LAzEQhoMoWKsXf0HAmxBNZpImPUkpVgsFL3oO2e1sP7Sbmuwi_ntT1tPMvDzMDA9jt0o-ICA-zpZzASCVc2dspJw1wmkD52wkp6CEVTC5ZFc576WUAGhG7GnGt9RRihtqKfaZdz9R5C5ROPCWypA-eRMT3_aH0PJQd7vY8kR13LS7U3_NLprwlenmv47Zx-L5ff4qVm8vy_lsJWpQrhMqGAgOAzrbTIFsVb5C05SMphIRrXWIlQ71uqqUJqPWAaxRunDNhFDjmN0Ne48pfveUO7-PfWrLSQ_OaG2tVKpQ9wNVp5hzosYf0-4Q0q9X0p8E-SLID4LwD6vVVvc</recordid><startdate>20230821</startdate><enddate>20230821</enddate><creator>Liao, Shengbin</creator><creator>Wang, Xiaofeng</creator><creator>Yang, ZongKai</creator><general>IOS Press BV</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>E3H</scope><scope>F2A</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20230821</creationdate><title>A heterogeneous two-stream network for human action recognition</title><author>Liao, Shengbin ; Wang, Xiaofeng ; Yang, ZongKai</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c218t-1a52a83a387f92e7b45235f2a8e9033377833b4acdbb14e51da27514b45f6e343</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Artificial neural networks</topic><topic>Computational efficiency</topic><topic>Computing costs</topic><topic>Frames (data processing)</topic><topic>Human activity recognition</topic><topic>Neural networks</topic><topic>Optical flow (image analysis)</topic><topic>Redundancy</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Liao, Shengbin</creatorcontrib><creatorcontrib>Wang, Xiaofeng</creatorcontrib><creatorcontrib>Yang, ZongKai</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>Library & Information Sciences Abstracts (LISA)</collection><collection>Library & Information Science Abstracts (LISA)</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Ai communications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Liao, Shengbin</au><au>Wang, Xiaofeng</au><au>Yang, ZongKai</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A heterogeneous two-stream network for human action recognition</atitle><jtitle>Ai communications</jtitle><date>2023-08-21</date><risdate>2023</risdate><volume>36</volume><issue>3</issue><spage>219</spage><epage>233</epage><pages>219-233</pages><issn>0921-7126</issn><eissn>1875-8452</eissn><abstract>The most widely used two-stream architectures and building blocks for human action recognition in videos generally consist of 2D or 3D convolution neural networks. 3D convolution can abstract motion messages between video frames, which is essential for video classification. 3D convolution neural networks usually obtain good performance compared with 2D cases, however it also increases computational cost. In this paper, we propose a heterogeneous two-stream architecture which incorporates two convolutional networks. One uses a mixed convolution network (MCN), which combines some 3D convolutions in the middle of 2D convolutions to train RGB frames, another one adopts BN-Inception network to train Optical Flow frames. Considering the redundancy of neighborhood video frames, we adopt a sparse sampling strategy to decrease the computational cost. Our architecture is trained and evaluated on the standard video actions benchmarks of HMDB51 and UCF101. Experimental results show our approach obtains the state-of-the-art performance on the datasets of HMDB51 (73.04%) and UCF101 (95.27%).</abstract><cop>Amsterdam</cop><pub>IOS Press BV</pub><doi>10.3233/AIC-220188</doi><tpages>15</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 0921-7126
ispartof	Ai communications, 2023-08, Vol.36 (3), p.219-233
issn	0921-7126 1875-8452
language	eng
recordid	cdi_proquest_journals_2854477011
source	Business Source Ultimate【Trial: -2024/12/31】【Remote access available】; Library & Information Science Abstracts (LISA)
subjects	Artificial neural networks Computational efficiency Computing costs Frames (data processing) Human activity recognition Neural networks Optical flow (image analysis) Redundancy
title	A heterogeneous two-stream network for human action recognition
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T17%3A56%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20heterogeneous%20two-stream%20network%20for%20human%20action%20recognition&rft.jtitle=Ai%20communications&rft.au=Liao,%20Shengbin&rft.date=2023-08-21&rft.volume=36&rft.issue=3&rft.spage=219&rft.epage=233&rft.pages=219-233&rft.issn=0921-7126&rft.eissn=1875-8452&rft_id=info:doi/10.3233/AIC-220188&rft_dat=%3Cproquest_cross%3E2854477011%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c218t-1a52a83a387f92e7b45235f2a8e9033377833b4acdbb14e51da27514b45f6e343%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2854477011&rft_id=info:pmid/&rfr_iscdi=true