Loading…

A multi-resolution fusion approach for human activity recognition from video data in tiny edge devices

Human Activity Recognition (HAR) is the process of automatic recognition of Activities of Daily Living (ADL) from human motion data captured in various data modalities by wearable and ambient sensors. Advances in Deep Learning, especially Convolutional Neural Networks (CNNs) and sequential models ha...

Full description

Saved in:
Bibliographic Details
Published in:Information fusion 2023-12, Vol.100, p.101953, Article 101953
Main Authors: Nooruddin, Sheikh, Islam, Md. Milon, Karray, Fakhri, Muhammad, Ghulam
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Human Activity Recognition (HAR) is the process of automatic recognition of Activities of Daily Living (ADL) from human motion data captured in various data modalities by wearable and ambient sensors. Advances in Deep Learning, especially Convolutional Neural Networks (CNNs) and sequential models have revolutionalized HAR from video data sources. Although these models are incredibly accurate and utilize both spatial and temporal information, they require huge computation and memory resources — making them unsuitable for edge or wearable applications. Tiny Machine Learning (TinyML) aims to optimize these models in terms of compute and memory requirements – aiming to make them suitable for always-on resource constrained devices – leading to a reduction in communication latency and network traffic for HAR frameworks. In this paper, we propose a two-stream multi-resolution fusion architecture for HAR from video data modality. The context stream takes a resized image as input and the fovea stream takes the cropped center portion of the resized image as input, reducing the overall dimensionality. We tested two quantization methods: Post-Training Quantization (PTQ) and Quantization Aware Training (QAT) to optimize these models for deployment in edge devices and tested the performance in two challenging video datasets: KTH and UCF11. We performed ablation studies to validate the two-stream model performance. We deployed the proposed architecture in commercial resource constrained devices and monitored their performance in terms of inference latency and power consumption. The results indicate that the proposed architecture clearly outperforms other relevant single-stream models tested in this work in terms of accuracy, precision, recall, and F1-Score while also reducing the overall model size. •A novel two-stream multi-resolution fusion architecture for HAR from video data.•The context stream inputs the lower-resolution images.•The fovea stream takes center-cropped portions of the images as inputs.•Quantizing the architecture to reduce the size to fit into low-power devices.•Deployment of the proposed approach in commodity tiny edge devices.
ISSN:1566-2535
1872-6305
DOI:10.1016/j.inffus.2023.101953