Loading…

Driver Gaze Zone Estimation Based on Three-Channel Convolution-Optimized Vision Transformer With Transfer Learning

Driver gaze zone estimation (DGZE) is essential for detecting the driver's state and taking over rule-making in intelligent driving systems. However, convolutional neural network (CNN)-based multichannel models lack global feature extraction capability, with a large number of parameters and hig...

Full description

Saved in:
Bibliographic Details
Published in:IEEE sensors journal 2024-12, Vol.24 (24), p.42064-42078
Main Authors: Li, Zhao, Jiang, Siyang, Fu, Rui, Guo, Yingshi, Wang, Chang
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Driver gaze zone estimation (DGZE) is essential for detecting the driver's state and taking over rule-making in intelligent driving systems. However, convolutional neural network (CNN)-based multichannel models lack global feature extraction capability, with a large number of parameters and high computational complexity. Therefore, this article proposes a novel method that uses a three-channel convolution-optimized vision transformer (3C-CoViT) to estimate the driver's gaze zone. The method replaces the linear projection in the pure ViT structure with convolutional projection, converts the input images of different channels into image sequences, and then adds a convolutional feed-forward network to extract the local features of the markers, enhance the correlation of adjacent tokens in spatial dimensions, and improve the performance and efficiency of the model. We then pretrained the model on the GazeCapture dataset based on transfer learning and then fine-tuned the model on the dataset built in the actual road experiment. To enhance the interpretability of the model, we presented a novel visualization method. Experimental results show that the proposed method can accurately identify driver gaze zones (98.04% average accuracy) and outperform state-of-the-art methods in terms of accuracy and reliability. Ablation studies proved the effectiveness of our proposed method over the pure ViT and the beneficial effects of transfer learning and three-channel information input.
ISSN:1530-437X
DOI:10.1109/JSEN.2024.3486373