Loading…

A heterogeneous two-stream network for human action recognition

The most widely used two-stream architectures and building blocks for human action recognition in videos generally consist of 2D or 3D convolution neural networks. 3D convolution can abstract motion messages between video frames, which is essential for video classification. 3D convolution neural net...

Full description

Saved in:
Bibliographic Details
Published in:Ai communications 2023-08, Vol.36 (3), p.219-233
Main Authors: Liao, Shengbin, Wang, Xiaofeng, Yang, ZongKai
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c218t-1a52a83a387f92e7b45235f2a8e9033377833b4acdbb14e51da27514b45f6e343
container_end_page 233
container_issue 3
container_start_page 219
container_title Ai communications
container_volume 36
creator Liao, Shengbin
Wang, Xiaofeng
Yang, ZongKai
description The most widely used two-stream architectures and building blocks for human action recognition in videos generally consist of 2D or 3D convolution neural networks. 3D convolution can abstract motion messages between video frames, which is essential for video classification. 3D convolution neural networks usually obtain good performance compared with 2D cases, however it also increases computational cost. In this paper, we propose a heterogeneous two-stream architecture which incorporates two convolutional networks. One uses a mixed convolution network (MCN), which combines some 3D convolutions in the middle of 2D convolutions to train RGB frames, another one adopts BN-Inception network to train Optical Flow frames. Considering the redundancy of neighborhood video frames, we adopt a sparse sampling strategy to decrease the computational cost. Our architecture is trained and evaluated on the standard video actions benchmarks of HMDB51 and UCF101. Experimental results show our approach obtains the state-of-the-art performance on the datasets of HMDB51 (73.04%) and UCF101 (95.27%).
doi_str_mv 10.3233/AIC-220188
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2854477011</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2854477011</sourcerecordid><originalsourceid>FETCH-LOGICAL-c218t-1a52a83a387f92e7b45235f2a8e9033377833b4acdbb14e51da27514b45f6e343</originalsourceid><addsrcrecordid>eNotkE1LAzEQhoMoWKsXf0HAmxBNZpImPUkpVgsFL3oO2e1sP7Sbmuwi_ntT1tPMvDzMDA9jt0o-ICA-zpZzASCVc2dspJw1wmkD52wkp6CEVTC5ZFc576WUAGhG7GnGt9RRihtqKfaZdz9R5C5ROPCWypA-eRMT3_aH0PJQd7vY8kR13LS7U3_NLprwlenmv47Zx-L5ff4qVm8vy_lsJWpQrhMqGAgOAzrbTIFsVb5C05SMphIRrXWIlQ71uqqUJqPWAaxRunDNhFDjmN0Ne48pfveUO7-PfWrLSQ_OaG2tVKpQ9wNVp5hzosYf0-4Q0q9X0p8E-SLID4LwD6vVVvc</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2854477011</pqid></control><display><type>article</type><title>A heterogeneous two-stream network for human action recognition</title><source>Business Source Ultimate【Trial: -2024/12/31】【Remote access available】</source><source>Library &amp; Information Science Abstracts (LISA)</source><creator>Liao, Shengbin ; Wang, Xiaofeng ; Yang, ZongKai</creator><creatorcontrib>Liao, Shengbin ; Wang, Xiaofeng ; Yang, ZongKai</creatorcontrib><description>The most widely used two-stream architectures and building blocks for human action recognition in videos generally consist of 2D or 3D convolution neural networks. 3D convolution can abstract motion messages between video frames, which is essential for video classification. 3D convolution neural networks usually obtain good performance compared with 2D cases, however it also increases computational cost. In this paper, we propose a heterogeneous two-stream architecture which incorporates two convolutional networks. One uses a mixed convolution network (MCN), which combines some 3D convolutions in the middle of 2D convolutions to train RGB frames, another one adopts BN-Inception network to train Optical Flow frames. Considering the redundancy of neighborhood video frames, we adopt a sparse sampling strategy to decrease the computational cost. Our architecture is trained and evaluated on the standard video actions benchmarks of HMDB51 and UCF101. Experimental results show our approach obtains the state-of-the-art performance on the datasets of HMDB51 (73.04%) and UCF101 (95.27%).</description><identifier>ISSN: 0921-7126</identifier><identifier>EISSN: 1875-8452</identifier><identifier>DOI: 10.3233/AIC-220188</identifier><language>eng</language><publisher>Amsterdam: IOS Press BV</publisher><subject>Artificial neural networks ; Computational efficiency ; Computing costs ; Frames (data processing) ; Human activity recognition ; Neural networks ; Optical flow (image analysis) ; Redundancy</subject><ispartof>Ai communications, 2023-08, Vol.36 (3), p.219-233</ispartof><rights>Copyright IOS Press BV 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c218t-1a52a83a387f92e7b45235f2a8e9033377833b4acdbb14e51da27514b45f6e343</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925,34135</link.rule.ids></links><search><creatorcontrib>Liao, Shengbin</creatorcontrib><creatorcontrib>Wang, Xiaofeng</creatorcontrib><creatorcontrib>Yang, ZongKai</creatorcontrib><title>A heterogeneous two-stream network for human action recognition</title><title>Ai communications</title><description>The most widely used two-stream architectures and building blocks for human action recognition in videos generally consist of 2D or 3D convolution neural networks. 3D convolution can abstract motion messages between video frames, which is essential for video classification. 3D convolution neural networks usually obtain good performance compared with 2D cases, however it also increases computational cost. In this paper, we propose a heterogeneous two-stream architecture which incorporates two convolutional networks. One uses a mixed convolution network (MCN), which combines some 3D convolutions in the middle of 2D convolutions to train RGB frames, another one adopts BN-Inception network to train Optical Flow frames. Considering the redundancy of neighborhood video frames, we adopt a sparse sampling strategy to decrease the computational cost. Our architecture is trained and evaluated on the standard video actions benchmarks of HMDB51 and UCF101. Experimental results show our approach obtains the state-of-the-art performance on the datasets of HMDB51 (73.04%) and UCF101 (95.27%).</description><subject>Artificial neural networks</subject><subject>Computational efficiency</subject><subject>Computing costs</subject><subject>Frames (data processing)</subject><subject>Human activity recognition</subject><subject>Neural networks</subject><subject>Optical flow (image analysis)</subject><subject>Redundancy</subject><issn>0921-7126</issn><issn>1875-8452</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>F2A</sourceid><recordid>eNotkE1LAzEQhoMoWKsXf0HAmxBNZpImPUkpVgsFL3oO2e1sP7Sbmuwi_ntT1tPMvDzMDA9jt0o-ICA-zpZzASCVc2dspJw1wmkD52wkp6CEVTC5ZFc576WUAGhG7GnGt9RRihtqKfaZdz9R5C5ROPCWypA-eRMT3_aH0PJQd7vY8kR13LS7U3_NLprwlenmv47Zx-L5ff4qVm8vy_lsJWpQrhMqGAgOAzrbTIFsVb5C05SMphIRrXWIlQ71uqqUJqPWAaxRunDNhFDjmN0Ne48pfveUO7-PfWrLSQ_OaG2tVKpQ9wNVp5hzosYf0-4Q0q9X0p8E-SLID4LwD6vVVvc</recordid><startdate>20230821</startdate><enddate>20230821</enddate><creator>Liao, Shengbin</creator><creator>Wang, Xiaofeng</creator><creator>Yang, ZongKai</creator><general>IOS Press BV</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>E3H</scope><scope>F2A</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20230821</creationdate><title>A heterogeneous two-stream network for human action recognition</title><author>Liao, Shengbin ; Wang, Xiaofeng ; Yang, ZongKai</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c218t-1a52a83a387f92e7b45235f2a8e9033377833b4acdbb14e51da27514b45f6e343</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Artificial neural networks</topic><topic>Computational efficiency</topic><topic>Computing costs</topic><topic>Frames (data processing)</topic><topic>Human activity recognition</topic><topic>Neural networks</topic><topic>Optical flow (image analysis)</topic><topic>Redundancy</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Liao, Shengbin</creatorcontrib><creatorcontrib>Wang, Xiaofeng</creatorcontrib><creatorcontrib>Yang, ZongKai</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>Library &amp; Information Sciences Abstracts (LISA)</collection><collection>Library &amp; Information Science Abstracts (LISA)</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Ai communications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Liao, Shengbin</au><au>Wang, Xiaofeng</au><au>Yang, ZongKai</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A heterogeneous two-stream network for human action recognition</atitle><jtitle>Ai communications</jtitle><date>2023-08-21</date><risdate>2023</risdate><volume>36</volume><issue>3</issue><spage>219</spage><epage>233</epage><pages>219-233</pages><issn>0921-7126</issn><eissn>1875-8452</eissn><abstract>The most widely used two-stream architectures and building blocks for human action recognition in videos generally consist of 2D or 3D convolution neural networks. 3D convolution can abstract motion messages between video frames, which is essential for video classification. 3D convolution neural networks usually obtain good performance compared with 2D cases, however it also increases computational cost. In this paper, we propose a heterogeneous two-stream architecture which incorporates two convolutional networks. One uses a mixed convolution network (MCN), which combines some 3D convolutions in the middle of 2D convolutions to train RGB frames, another one adopts BN-Inception network to train Optical Flow frames. Considering the redundancy of neighborhood video frames, we adopt a sparse sampling strategy to decrease the computational cost. Our architecture is trained and evaluated on the standard video actions benchmarks of HMDB51 and UCF101. Experimental results show our approach obtains the state-of-the-art performance on the datasets of HMDB51 (73.04%) and UCF101 (95.27%).</abstract><cop>Amsterdam</cop><pub>IOS Press BV</pub><doi>10.3233/AIC-220188</doi><tpages>15</tpages></addata></record>
fulltext fulltext
identifier ISSN: 0921-7126
ispartof Ai communications, 2023-08, Vol.36 (3), p.219-233
issn 0921-7126
1875-8452
language eng
recordid cdi_proquest_journals_2854477011
source Business Source Ultimate【Trial: -2024/12/31】【Remote access available】; Library & Information Science Abstracts (LISA)
subjects Artificial neural networks
Computational efficiency
Computing costs
Frames (data processing)
Human activity recognition
Neural networks
Optical flow (image analysis)
Redundancy
title A heterogeneous two-stream network for human action recognition
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T17%3A56%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20heterogeneous%20two-stream%20network%20for%20human%20action%20recognition&rft.jtitle=Ai%20communications&rft.au=Liao,%20Shengbin&rft.date=2023-08-21&rft.volume=36&rft.issue=3&rft.spage=219&rft.epage=233&rft.pages=219-233&rft.issn=0921-7126&rft.eissn=1875-8452&rft_id=info:doi/10.3233/AIC-220188&rft_dat=%3Cproquest_cross%3E2854477011%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c218t-1a52a83a387f92e7b45235f2a8e9033377833b4acdbb14e51da27514b45f6e343%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2854477011&rft_id=info:pmid/&rfr_iscdi=true