Loading…

Hierarchical dynamic depth projected difference images–based action recognition in videos with convolutional neural networks

Temporal information plays a significant role in video-based human action recognition. How to effectively extract the spatial–temporal characteristics of actions in videos has always been a challenging problem. Most existing methods acquire spatial and temporal cues in videos individually. In this a...

Full description

Saved in:
Bibliographic Details
Published in:International journal of advanced robotic systems 2019-01, Vol.16 (1), p.1-8
Main Authors: Wu, Hanbo, Ma, Xin, Li, Yibin
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c417t-c13d0ce99c14e663c7c1232e8b773318265a4487c1c04c1bbce448905a86ef9f3
cites cdi_FETCH-LOGICAL-c417t-c13d0ce99c14e663c7c1232e8b773318265a4487c1c04c1bbce448905a86ef9f3
container_end_page 8
container_issue 1
container_start_page 1
container_title International journal of advanced robotic systems
container_volume 16
creator Wu, Hanbo
Ma, Xin
Li, Yibin
description Temporal information plays a significant role in video-based human action recognition. How to effectively extract the spatial–temporal characteristics of actions in videos has always been a challenging problem. Most existing methods acquire spatial and temporal cues in videos individually. In this article, we propose a new effective representation for depth video sequences, called hierarchical dynamic depth projected difference images that can aggregate the action spatial and temporal information simultaneously at different temporal scales. We firstly project depth video sequences onto three orthogonal Cartesian views to capture the 3D shape and motion information of human actions. Hierarchical dynamic depth projected difference images are constructed with the rank pooling in each projected view to hierarchically encode the spatial–temporal motion dynamics in depth videos. Convolutional neural networks can automatically learn discriminative features from images and have been extended to video classification because of their superior performance. To verify the effectiveness of hierarchical dynamic depth projected difference images representation, we construct a hierarchical dynamic depth projected difference images–based action recognition framework where hierarchical dynamic depth projected difference images in three views are fed into three identical pretrained convolutional neural networks independently for finely retuning. We design three classification schemes in the framework and different schemes utilize different convolutional neural network layers to compare their effects on action recognition. Three views are combined to describe the actions more comprehensively in each classification scheme. The proposed framework is evaluated on three challenging public human action data sets. Experiments indicate that our method has better performance and can provide discriminative spatial–temporal information for human action recognition in depth videos.
doi_str_mv 10.1177/1729881418825093
format article
fullrecord <record><control><sourceid>proquest_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_ae9d2d70af5a4654a328ccd4ec2da19e</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sage_id>10.1177_1729881418825093</sage_id><doaj_id>oai_doaj_org_article_ae9d2d70af5a4654a328ccd4ec2da19e</doaj_id><sourcerecordid>2187094080</sourcerecordid><originalsourceid>FETCH-LOGICAL-c417t-c13d0ce99c14e663c7c1232e8b773318265a4487c1c04c1bbce448905a86ef9f3</originalsourceid><addsrcrecordid>eNp1kcFKxDAQhosoKKt3jwXP1SRNm-Qoou7Cghc9h-xkumatzZp0FS_iO_iGPonpVhQEc5nJPzPfMDNZdkzJKaVCnFHBlJSUUylZRVS5kx0MUjFouz8-qfezoxhXZHiCVEocZG9Th8EEuHdg2ty-dubRQW5x3d_n6-BXCD3a3LqmwYAdYO4ezRLj5_vHwsQUMdA73-UBwS87t_Vdlz87iz7mLy5RwHfPvt0ModShw03Ymv7Fh4d4mO01po149G0n2d3V5e3FtJjfXM8uzucFcCr6AmhpCaBSQDnWdQkCKCsZyoUQZUklqyvDuUwqEA50sQBMX0UqI2tsVFNOstnItd6s9DqkKcKr9sbpreDDUpvQO2hRG1SWWUFMk5h1xU3JJIDlCMwaqjCxTkZW2s_TBmOvV34T0nBRMyoFUZxIkrLImAXBxxiw-elKiR6Opv8eLZUUY0lMK_6F_pv_BZrimeQ</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2187094080</pqid></control><display><type>article</type><title>Hierarchical dynamic depth projected difference images–based action recognition in videos with convolutional neural networks</title><source>SAGE Open Access</source><creator>Wu, Hanbo ; Ma, Xin ; Li, Yibin</creator><creatorcontrib>Wu, Hanbo ; Ma, Xin ; Li, Yibin</creatorcontrib><description>Temporal information plays a significant role in video-based human action recognition. How to effectively extract the spatial–temporal characteristics of actions in videos has always been a challenging problem. Most existing methods acquire spatial and temporal cues in videos individually. In this article, we propose a new effective representation for depth video sequences, called hierarchical dynamic depth projected difference images that can aggregate the action spatial and temporal information simultaneously at different temporal scales. We firstly project depth video sequences onto three orthogonal Cartesian views to capture the 3D shape and motion information of human actions. Hierarchical dynamic depth projected difference images are constructed with the rank pooling in each projected view to hierarchically encode the spatial–temporal motion dynamics in depth videos. Convolutional neural networks can automatically learn discriminative features from images and have been extended to video classification because of their superior performance. To verify the effectiveness of hierarchical dynamic depth projected difference images representation, we construct a hierarchical dynamic depth projected difference images–based action recognition framework where hierarchical dynamic depth projected difference images in three views are fed into three identical pretrained convolutional neural networks independently for finely retuning. We design three classification schemes in the framework and different schemes utilize different convolutional neural network layers to compare their effects on action recognition. Three views are combined to describe the actions more comprehensively in each classification scheme. The proposed framework is evaluated on three challenging public human action data sets. Experiments indicate that our method has better performance and can provide discriminative spatial–temporal information for human action recognition in depth videos.</description><identifier>ISSN: 1729-8806</identifier><identifier>EISSN: 1729-8814</identifier><identifier>DOI: 10.1177/1729881418825093</identifier><language>eng</language><publisher>London, England: SAGE Publications</publisher><subject>Artificial neural networks ; Human activity recognition ; Human motion ; Image classification ; Neural networks ; Representations ; Three dimensional motion ; Video data</subject><ispartof>International journal of advanced robotic systems, 2019-01, Vol.16 (1), p.1-8</ispartof><rights>The Author(s) 2019</rights><rights>Copyright Sage Publications Ltd. Jan 2019</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c417t-c13d0ce99c14e663c7c1232e8b773318265a4487c1c04c1bbce448905a86ef9f3</citedby><cites>FETCH-LOGICAL-c417t-c13d0ce99c14e663c7c1232e8b773318265a4487c1c04c1bbce448905a86ef9f3</cites><orcidid>0000-0003-4402-1957</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://journals.sagepub.com/doi/pdf/10.1177/1729881418825093$$EPDF$$P50$$Gsage$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://journals.sagepub.com/doi/10.1177/1729881418825093$$EHTML$$P50$$Gsage$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,21966,27853,27924,27925,44945,45333</link.rule.ids></links><search><creatorcontrib>Wu, Hanbo</creatorcontrib><creatorcontrib>Ma, Xin</creatorcontrib><creatorcontrib>Li, Yibin</creatorcontrib><title>Hierarchical dynamic depth projected difference images–based action recognition in videos with convolutional neural networks</title><title>International journal of advanced robotic systems</title><description>Temporal information plays a significant role in video-based human action recognition. How to effectively extract the spatial–temporal characteristics of actions in videos has always been a challenging problem. Most existing methods acquire spatial and temporal cues in videos individually. In this article, we propose a new effective representation for depth video sequences, called hierarchical dynamic depth projected difference images that can aggregate the action spatial and temporal information simultaneously at different temporal scales. We firstly project depth video sequences onto three orthogonal Cartesian views to capture the 3D shape and motion information of human actions. Hierarchical dynamic depth projected difference images are constructed with the rank pooling in each projected view to hierarchically encode the spatial–temporal motion dynamics in depth videos. Convolutional neural networks can automatically learn discriminative features from images and have been extended to video classification because of their superior performance. To verify the effectiveness of hierarchical dynamic depth projected difference images representation, we construct a hierarchical dynamic depth projected difference images–based action recognition framework where hierarchical dynamic depth projected difference images in three views are fed into three identical pretrained convolutional neural networks independently for finely retuning. We design three classification schemes in the framework and different schemes utilize different convolutional neural network layers to compare their effects on action recognition. Three views are combined to describe the actions more comprehensively in each classification scheme. The proposed framework is evaluated on three challenging public human action data sets. Experiments indicate that our method has better performance and can provide discriminative spatial–temporal information for human action recognition in depth videos.</description><subject>Artificial neural networks</subject><subject>Human activity recognition</subject><subject>Human motion</subject><subject>Image classification</subject><subject>Neural networks</subject><subject>Representations</subject><subject>Three dimensional motion</subject><subject>Video data</subject><issn>1729-8806</issn><issn>1729-8814</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>AFRWT</sourceid><sourceid>DOA</sourceid><recordid>eNp1kcFKxDAQhosoKKt3jwXP1SRNm-Qoou7Cghc9h-xkumatzZp0FS_iO_iGPonpVhQEc5nJPzPfMDNZdkzJKaVCnFHBlJSUUylZRVS5kx0MUjFouz8-qfezoxhXZHiCVEocZG9Th8EEuHdg2ty-dubRQW5x3d_n6-BXCD3a3LqmwYAdYO4ezRLj5_vHwsQUMdA73-UBwS87t_Vdlz87iz7mLy5RwHfPvt0ModShw03Ymv7Fh4d4mO01po149G0n2d3V5e3FtJjfXM8uzucFcCr6AmhpCaBSQDnWdQkCKCsZyoUQZUklqyvDuUwqEA50sQBMX0UqI2tsVFNOstnItd6s9DqkKcKr9sbpreDDUpvQO2hRG1SWWUFMk5h1xU3JJIDlCMwaqjCxTkZW2s_TBmOvV34T0nBRMyoFUZxIkrLImAXBxxiw-elKiR6Opv8eLZUUY0lMK_6F_pv_BZrimeQ</recordid><startdate>20190101</startdate><enddate>20190101</enddate><creator>Wu, Hanbo</creator><creator>Ma, Xin</creator><creator>Li, Yibin</creator><general>SAGE Publications</general><general>Sage Publications Ltd</general><general>SAGE Publishing</general><scope>AFRWT</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7TB</scope><scope>8FD</scope><scope>FR3</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-4402-1957</orcidid></search><sort><creationdate>20190101</creationdate><title>Hierarchical dynamic depth projected difference images–based action recognition in videos with convolutional neural networks</title><author>Wu, Hanbo ; Ma, Xin ; Li, Yibin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c417t-c13d0ce99c14e663c7c1232e8b773318265a4487c1c04c1bbce448905a86ef9f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Artificial neural networks</topic><topic>Human activity recognition</topic><topic>Human motion</topic><topic>Image classification</topic><topic>Neural networks</topic><topic>Representations</topic><topic>Three dimensional motion</topic><topic>Video data</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wu, Hanbo</creatorcontrib><creatorcontrib>Ma, Xin</creatorcontrib><creatorcontrib>Li, Yibin</creatorcontrib><collection>SAGE Open Access</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Mechanical &amp; Transportation Engineering Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>International journal of advanced robotic systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wu, Hanbo</au><au>Ma, Xin</au><au>Li, Yibin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Hierarchical dynamic depth projected difference images–based action recognition in videos with convolutional neural networks</atitle><jtitle>International journal of advanced robotic systems</jtitle><date>2019-01-01</date><risdate>2019</risdate><volume>16</volume><issue>1</issue><spage>1</spage><epage>8</epage><pages>1-8</pages><issn>1729-8806</issn><eissn>1729-8814</eissn><abstract>Temporal information plays a significant role in video-based human action recognition. How to effectively extract the spatial–temporal characteristics of actions in videos has always been a challenging problem. Most existing methods acquire spatial and temporal cues in videos individually. In this article, we propose a new effective representation for depth video sequences, called hierarchical dynamic depth projected difference images that can aggregate the action spatial and temporal information simultaneously at different temporal scales. We firstly project depth video sequences onto three orthogonal Cartesian views to capture the 3D shape and motion information of human actions. Hierarchical dynamic depth projected difference images are constructed with the rank pooling in each projected view to hierarchically encode the spatial–temporal motion dynamics in depth videos. Convolutional neural networks can automatically learn discriminative features from images and have been extended to video classification because of their superior performance. To verify the effectiveness of hierarchical dynamic depth projected difference images representation, we construct a hierarchical dynamic depth projected difference images–based action recognition framework where hierarchical dynamic depth projected difference images in three views are fed into three identical pretrained convolutional neural networks independently for finely retuning. We design three classification schemes in the framework and different schemes utilize different convolutional neural network layers to compare their effects on action recognition. Three views are combined to describe the actions more comprehensively in each classification scheme. The proposed framework is evaluated on three challenging public human action data sets. Experiments indicate that our method has better performance and can provide discriminative spatial–temporal information for human action recognition in depth videos.</abstract><cop>London, England</cop><pub>SAGE Publications</pub><doi>10.1177/1729881418825093</doi><tpages>8</tpages><orcidid>https://orcid.org/0000-0003-4402-1957</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1729-8806
ispartof International journal of advanced robotic systems, 2019-01, Vol.16 (1), p.1-8
issn 1729-8806
1729-8814
language eng
recordid cdi_doaj_primary_oai_doaj_org_article_ae9d2d70af5a4654a328ccd4ec2da19e
source SAGE Open Access
subjects Artificial neural networks
Human activity recognition
Human motion
Image classification
Neural networks
Representations
Three dimensional motion
Video data
title Hierarchical dynamic depth projected difference images–based action recognition in videos with convolutional neural networks
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T17%3A54%3A50IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Hierarchical%20dynamic%20depth%20projected%20difference%20images%E2%80%93based%20action%20recognition%20in%20videos%20with%20convolutional%20neural%20networks&rft.jtitle=International%20journal%20of%20advanced%20robotic%20systems&rft.au=Wu,%20Hanbo&rft.date=2019-01-01&rft.volume=16&rft.issue=1&rft.spage=1&rft.epage=8&rft.pages=1-8&rft.issn=1729-8806&rft.eissn=1729-8814&rft_id=info:doi/10.1177/1729881418825093&rft_dat=%3Cproquest_doaj_%3E2187094080%3C/proquest_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c417t-c13d0ce99c14e663c7c1232e8b773318265a4487c1c04c1bbce448905a86ef9f3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2187094080&rft_id=info:pmid/&rft_sage_id=10.1177_1729881418825093&rfr_iscdi=true