Loading…
A heterogeneous two-stream network for human action recognition
The most widely used two-stream architectures and building blocks for human action recognition in videos generally consist of 2D or 3D convolution neural networks. 3D convolution can abstract motion messages between video frames, which is essential for video classification. 3D convolution neural net...
Saved in:
Published in: | Ai communications 2023-08, Vol.36 (3), p.219-233 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | cdi_FETCH-LOGICAL-c218t-1a52a83a387f92e7b45235f2a8e9033377833b4acdbb14e51da27514b45f6e343 |
container_end_page | 233 |
container_issue | 3 |
container_start_page | 219 |
container_title | Ai communications |
container_volume | 36 |
creator | Liao, Shengbin Wang, Xiaofeng Yang, ZongKai |
description | The most widely used two-stream architectures and building blocks for human action recognition in videos generally consist of 2D or 3D convolution neural networks. 3D convolution can abstract motion messages between video frames, which is essential for video classification. 3D convolution neural networks usually obtain good performance compared with 2D cases, however it also increases computational cost. In this paper, we propose a heterogeneous two-stream architecture which incorporates two convolutional networks. One uses a mixed convolution network (MCN), which combines some 3D convolutions in the middle of 2D convolutions to train RGB frames, another one adopts BN-Inception network to train Optical Flow frames. Considering the redundancy of neighborhood video frames, we adopt a sparse sampling strategy to decrease the computational cost. Our architecture is trained and evaluated on the standard video actions benchmarks of HMDB51 and UCF101. Experimental results show our approach obtains the state-of-the-art performance on the datasets of HMDB51 (73.04%) and UCF101 (95.27%). |
doi_str_mv | 10.3233/AIC-220188 |
format | article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2854477011</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2854477011</sourcerecordid><originalsourceid>FETCH-LOGICAL-c218t-1a52a83a387f92e7b45235f2a8e9033377833b4acdbb14e51da27514b45f6e343</originalsourceid><addsrcrecordid>eNotkE1LAzEQhoMoWKsXf0HAmxBNZpImPUkpVgsFL3oO2e1sP7Sbmuwi_ntT1tPMvDzMDA9jt0o-ICA-zpZzASCVc2dspJw1wmkD52wkp6CEVTC5ZFc576WUAGhG7GnGt9RRihtqKfaZdz9R5C5ROPCWypA-eRMT3_aH0PJQd7vY8kR13LS7U3_NLprwlenmv47Zx-L5ff4qVm8vy_lsJWpQrhMqGAgOAzrbTIFsVb5C05SMphIRrXWIlQ71uqqUJqPWAaxRunDNhFDjmN0Ne48pfveUO7-PfWrLSQ_OaG2tVKpQ9wNVp5hzosYf0-4Q0q9X0p8E-SLID4LwD6vVVvc</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2854477011</pqid></control><display><type>article</type><title>A heterogeneous two-stream network for human action recognition</title><source>Business Source Ultimate【Trial: -2024/12/31】【Remote access available】</source><source>Library & Information Science Abstracts (LISA)</source><creator>Liao, Shengbin ; Wang, Xiaofeng ; Yang, ZongKai</creator><creatorcontrib>Liao, Shengbin ; Wang, Xiaofeng ; Yang, ZongKai</creatorcontrib><description>The most widely used two-stream architectures and building blocks for human action recognition in videos generally consist of 2D or 3D convolution neural networks. 3D convolution can abstract motion messages between video frames, which is essential for video classification. 3D convolution neural networks usually obtain good performance compared with 2D cases, however it also increases computational cost. In this paper, we propose a heterogeneous two-stream architecture which incorporates two convolutional networks. One uses a mixed convolution network (MCN), which combines some 3D convolutions in the middle of 2D convolutions to train RGB frames, another one adopts BN-Inception network to train Optical Flow frames. Considering the redundancy of neighborhood video frames, we adopt a sparse sampling strategy to decrease the computational cost. Our architecture is trained and evaluated on the standard video actions benchmarks of HMDB51 and UCF101. Experimental results show our approach obtains the state-of-the-art performance on the datasets of HMDB51 (73.04%) and UCF101 (95.27%).</description><identifier>ISSN: 0921-7126</identifier><identifier>EISSN: 1875-8452</identifier><identifier>DOI: 10.3233/AIC-220188</identifier><language>eng</language><publisher>Amsterdam: IOS Press BV</publisher><subject>Artificial neural networks ; Computational efficiency ; Computing costs ; Frames (data processing) ; Human activity recognition ; Neural networks ; Optical flow (image analysis) ; Redundancy</subject><ispartof>Ai communications, 2023-08, Vol.36 (3), p.219-233</ispartof><rights>Copyright IOS Press BV 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c218t-1a52a83a387f92e7b45235f2a8e9033377833b4acdbb14e51da27514b45f6e343</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925,34135</link.rule.ids></links><search><creatorcontrib>Liao, Shengbin</creatorcontrib><creatorcontrib>Wang, Xiaofeng</creatorcontrib><creatorcontrib>Yang, ZongKai</creatorcontrib><title>A heterogeneous two-stream network for human action recognition</title><title>Ai communications</title><description>The most widely used two-stream architectures and building blocks for human action recognition in videos generally consist of 2D or 3D convolution neural networks. 3D convolution can abstract motion messages between video frames, which is essential for video classification. 3D convolution neural networks usually obtain good performance compared with 2D cases, however it also increases computational cost. In this paper, we propose a heterogeneous two-stream architecture which incorporates two convolutional networks. One uses a mixed convolution network (MCN), which combines some 3D convolutions in the middle of 2D convolutions to train RGB frames, another one adopts BN-Inception network to train Optical Flow frames. Considering the redundancy of neighborhood video frames, we adopt a sparse sampling strategy to decrease the computational cost. Our architecture is trained and evaluated on the standard video actions benchmarks of HMDB51 and UCF101. Experimental results show our approach obtains the state-of-the-art performance on the datasets of HMDB51 (73.04%) and UCF101 (95.27%).</description><subject>Artificial neural networks</subject><subject>Computational efficiency</subject><subject>Computing costs</subject><subject>Frames (data processing)</subject><subject>Human activity recognition</subject><subject>Neural networks</subject><subject>Optical flow (image analysis)</subject><subject>Redundancy</subject><issn>0921-7126</issn><issn>1875-8452</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>F2A</sourceid><recordid>eNotkE1LAzEQhoMoWKsXf0HAmxBNZpImPUkpVgsFL3oO2e1sP7Sbmuwi_ntT1tPMvDzMDA9jt0o-ICA-zpZzASCVc2dspJw1wmkD52wkp6CEVTC5ZFc576WUAGhG7GnGt9RRihtqKfaZdz9R5C5ROPCWypA-eRMT3_aH0PJQd7vY8kR13LS7U3_NLprwlenmv47Zx-L5ff4qVm8vy_lsJWpQrhMqGAgOAzrbTIFsVb5C05SMphIRrXWIlQ71uqqUJqPWAaxRunDNhFDjmN0Ne48pfveUO7-PfWrLSQ_OaG2tVKpQ9wNVp5hzosYf0-4Q0q9X0p8E-SLID4LwD6vVVvc</recordid><startdate>20230821</startdate><enddate>20230821</enddate><creator>Liao, Shengbin</creator><creator>Wang, Xiaofeng</creator><creator>Yang, ZongKai</creator><general>IOS Press BV</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>E3H</scope><scope>F2A</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20230821</creationdate><title>A heterogeneous two-stream network for human action recognition</title><author>Liao, Shengbin ; Wang, Xiaofeng ; Yang, ZongKai</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c218t-1a52a83a387f92e7b45235f2a8e9033377833b4acdbb14e51da27514b45f6e343</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Artificial neural networks</topic><topic>Computational efficiency</topic><topic>Computing costs</topic><topic>Frames (data processing)</topic><topic>Human activity recognition</topic><topic>Neural networks</topic><topic>Optical flow (image analysis)</topic><topic>Redundancy</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Liao, Shengbin</creatorcontrib><creatorcontrib>Wang, Xiaofeng</creatorcontrib><creatorcontrib>Yang, ZongKai</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>Library & Information Sciences Abstracts (LISA)</collection><collection>Library & Information Science Abstracts (LISA)</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Ai communications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Liao, Shengbin</au><au>Wang, Xiaofeng</au><au>Yang, ZongKai</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A heterogeneous two-stream network for human action recognition</atitle><jtitle>Ai communications</jtitle><date>2023-08-21</date><risdate>2023</risdate><volume>36</volume><issue>3</issue><spage>219</spage><epage>233</epage><pages>219-233</pages><issn>0921-7126</issn><eissn>1875-8452</eissn><abstract>The most widely used two-stream architectures and building blocks for human action recognition in videos generally consist of 2D or 3D convolution neural networks. 3D convolution can abstract motion messages between video frames, which is essential for video classification. 3D convolution neural networks usually obtain good performance compared with 2D cases, however it also increases computational cost. In this paper, we propose a heterogeneous two-stream architecture which incorporates two convolutional networks. One uses a mixed convolution network (MCN), which combines some 3D convolutions in the middle of 2D convolutions to train RGB frames, another one adopts BN-Inception network to train Optical Flow frames. Considering the redundancy of neighborhood video frames, we adopt a sparse sampling strategy to decrease the computational cost. Our architecture is trained and evaluated on the standard video actions benchmarks of HMDB51 and UCF101. Experimental results show our approach obtains the state-of-the-art performance on the datasets of HMDB51 (73.04%) and UCF101 (95.27%).</abstract><cop>Amsterdam</cop><pub>IOS Press BV</pub><doi>10.3233/AIC-220188</doi><tpages>15</tpages></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0921-7126 |
ispartof | Ai communications, 2023-08, Vol.36 (3), p.219-233 |
issn | 0921-7126 1875-8452 |
language | eng |
recordid | cdi_proquest_journals_2854477011 |
source | Business Source Ultimate【Trial: -2024/12/31】【Remote access available】; Library & Information Science Abstracts (LISA) |
subjects | Artificial neural networks Computational efficiency Computing costs Frames (data processing) Human activity recognition Neural networks Optical flow (image analysis) Redundancy |
title | A heterogeneous two-stream network for human action recognition |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T17%3A56%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20heterogeneous%20two-stream%20network%20for%20human%20action%20recognition&rft.jtitle=Ai%20communications&rft.au=Liao,%20Shengbin&rft.date=2023-08-21&rft.volume=36&rft.issue=3&rft.spage=219&rft.epage=233&rft.pages=219-233&rft.issn=0921-7126&rft.eissn=1875-8452&rft_id=info:doi/10.3233/AIC-220188&rft_dat=%3Cproquest_cross%3E2854477011%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c218t-1a52a83a387f92e7b45235f2a8e9033377833b4acdbb14e51da27514b45f6e343%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2854477011&rft_id=info:pmid/&rfr_iscdi=true |