Loading…

Driver Gaze Zone Estimation Based on Three-Channel Convolution-Optimized Vision Transformer With Transfer Learning

Driver gaze zone estimation (DGZE) is essential for detecting the driver's state and taking over rule-making in intelligent driving systems. However, convolutional neural network (CNN)-based multichannel models lack global feature extraction capability, with a large number of parameters and hig...

Full description

Saved in:
Bibliographic Details
Published in:IEEE sensors journal 2024-12, Vol.24 (24), p.42064-42078
Main Authors: Li, Zhao, Jiang, Siyang, Fu, Rui, Guo, Yingshi, Wang, Chang
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 42078
container_issue 24
container_start_page 42064
container_title IEEE sensors journal
container_volume 24
creator Li, Zhao
Jiang, Siyang
Fu, Rui
Guo, Yingshi
Wang, Chang
description Driver gaze zone estimation (DGZE) is essential for detecting the driver's state and taking over rule-making in intelligent driving systems. However, convolutional neural network (CNN)-based multichannel models lack global feature extraction capability, with a large number of parameters and high computational complexity. Therefore, this article proposes a novel method that uses a three-channel convolution-optimized vision transformer (3C-CoViT) to estimate the driver's gaze zone. The method replaces the linear projection in the pure ViT structure with convolutional projection, converts the input images of different channels into image sequences, and then adds a convolutional feed-forward network to extract the local features of the markers, enhance the correlation of adjacent tokens in spatial dimensions, and improve the performance and efficiency of the model. We then pretrained the model on the GazeCapture dataset based on transfer learning and then fine-tuned the model on the dataset built in the actual road experiment. To enhance the interpretability of the model, we presented a novel visualization method. Experimental results show that the proposed method can accurately identify driver gaze zones (98.04% average accuracy) and outperform state-of-the-art methods in terms of accuracy and reliability. Ablation studies proved the effectiveness of our proposed method over the pure ViT and the beneficial effects of transfer learning and three-channel information input.
doi_str_mv 10.1109/JSEN.2024.3486373
format article
fullrecord <record><control><sourceid>ieee</sourceid><recordid>TN_cdi_ieee_primary_10740606</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10740606</ieee_id><sourcerecordid>10740606</sourcerecordid><originalsourceid>FETCH-ieee_primary_107406063</originalsourceid><addsrcrecordid>eNqFjUGLwjAQhXNwQVf9AYKH_IHWiamNXrd2XRbRg0XFiwQcbaQmMukK-uuN4H1PM--9b94w1hMQCwGTwe8qX8RDGCaxTMapVLLBWmIkIUqk2jbZp_dnADFRI9ViNCVzQ-Iz_UC-cxZ57mtz0bVxln9pjwcelqIkxCgrtbVY8czZm6v-Xki0vAbaPAK2Nv51U5C2_ujoEko3pi7fRlBz1GSNPXXYx1FXHrvv2Wb977zIfiKDiPsrhe903wtQCaSQyn_iJ8LnSw8</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Driver Gaze Zone Estimation Based on Three-Channel Convolution-Optimized Vision Transformer With Transfer Learning</title><source>IEEE Xplore (Online service)</source><creator>Li, Zhao ; Jiang, Siyang ; Fu, Rui ; Guo, Yingshi ; Wang, Chang</creator><creatorcontrib>Li, Zhao ; Jiang, Siyang ; Fu, Rui ; Guo, Yingshi ; Wang, Chang</creatorcontrib><description>Driver gaze zone estimation (DGZE) is essential for detecting the driver's state and taking over rule-making in intelligent driving systems. However, convolutional neural network (CNN)-based multichannel models lack global feature extraction capability, with a large number of parameters and high computational complexity. Therefore, this article proposes a novel method that uses a three-channel convolution-optimized vision transformer (3C-CoViT) to estimate the driver's gaze zone. The method replaces the linear projection in the pure ViT structure with convolutional projection, converts the input images of different channels into image sequences, and then adds a convolutional feed-forward network to extract the local features of the markers, enhance the correlation of adjacent tokens in spatial dimensions, and improve the performance and efficiency of the model. We then pretrained the model on the GazeCapture dataset based on transfer learning and then fine-tuned the model on the dataset built in the actual road experiment. To enhance the interpretability of the model, we presented a novel visualization method. Experimental results show that the proposed method can accurately identify driver gaze zones (98.04% average accuracy) and outperform state-of-the-art methods in terms of accuracy and reliability. Ablation studies proved the effectiveness of our proposed method over the pure ViT and the beneficial effects of transfer learning and three-channel information input.</description><identifier>ISSN: 1530-437X</identifier><identifier>DOI: 10.1109/JSEN.2024.3486373</identifier><identifier>CODEN: ISJEAZ</identifier><language>eng</language><publisher>IEEE</publisher><subject>Accuracy ; Computational modeling ; Convolutional neural network (CNN) ; deep learning ; driver gaze zone ; Estimation ; Face recognition ; Feature extraction ; Head ; Magnetic heads ; Mathematical models ; transfer learning ; Transformers ; Vehicles ; vision transformer (ViT) ; visual interpretability</subject><ispartof>IEEE sensors journal, 2024-12, Vol.24 (24), p.42064-42078</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0003-3531-1215 ; 0000-0003-1530-8115 ; 0000-0002-5732-3068 ; 0000-0001-9384-7558</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10740606$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,54796</link.rule.ids></links><search><creatorcontrib>Li, Zhao</creatorcontrib><creatorcontrib>Jiang, Siyang</creatorcontrib><creatorcontrib>Fu, Rui</creatorcontrib><creatorcontrib>Guo, Yingshi</creatorcontrib><creatorcontrib>Wang, Chang</creatorcontrib><title>Driver Gaze Zone Estimation Based on Three-Channel Convolution-Optimized Vision Transformer With Transfer Learning</title><title>IEEE sensors journal</title><addtitle>JSEN</addtitle><description>Driver gaze zone estimation (DGZE) is essential for detecting the driver's state and taking over rule-making in intelligent driving systems. However, convolutional neural network (CNN)-based multichannel models lack global feature extraction capability, with a large number of parameters and high computational complexity. Therefore, this article proposes a novel method that uses a three-channel convolution-optimized vision transformer (3C-CoViT) to estimate the driver's gaze zone. The method replaces the linear projection in the pure ViT structure with convolutional projection, converts the input images of different channels into image sequences, and then adds a convolutional feed-forward network to extract the local features of the markers, enhance the correlation of adjacent tokens in spatial dimensions, and improve the performance and efficiency of the model. We then pretrained the model on the GazeCapture dataset based on transfer learning and then fine-tuned the model on the dataset built in the actual road experiment. To enhance the interpretability of the model, we presented a novel visualization method. Experimental results show that the proposed method can accurately identify driver gaze zones (98.04% average accuracy) and outperform state-of-the-art methods in terms of accuracy and reliability. Ablation studies proved the effectiveness of our proposed method over the pure ViT and the beneficial effects of transfer learning and three-channel information input.</description><subject>Accuracy</subject><subject>Computational modeling</subject><subject>Convolutional neural network (CNN)</subject><subject>deep learning</subject><subject>driver gaze zone</subject><subject>Estimation</subject><subject>Face recognition</subject><subject>Feature extraction</subject><subject>Head</subject><subject>Magnetic heads</subject><subject>Mathematical models</subject><subject>transfer learning</subject><subject>Transformers</subject><subject>Vehicles</subject><subject>vision transformer (ViT)</subject><subject>visual interpretability</subject><issn>1530-437X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNqFjUGLwjAQhXNwQVf9AYKH_IHWiamNXrd2XRbRg0XFiwQcbaQmMukK-uuN4H1PM--9b94w1hMQCwGTwe8qX8RDGCaxTMapVLLBWmIkIUqk2jbZp_dnADFRI9ViNCVzQ-Iz_UC-cxZ57mtz0bVxln9pjwcelqIkxCgrtbVY8czZm6v-Xki0vAbaPAK2Nv51U5C2_ujoEko3pi7fRlBz1GSNPXXYx1FXHrvv2Wb977zIfiKDiPsrhe903wtQCaSQyn_iJ8LnSw8</recordid><startdate>20241215</startdate><enddate>20241215</enddate><creator>Li, Zhao</creator><creator>Jiang, Siyang</creator><creator>Fu, Rui</creator><creator>Guo, Yingshi</creator><creator>Wang, Chang</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><orcidid>https://orcid.org/0000-0003-3531-1215</orcidid><orcidid>https://orcid.org/0000-0003-1530-8115</orcidid><orcidid>https://orcid.org/0000-0002-5732-3068</orcidid><orcidid>https://orcid.org/0000-0001-9384-7558</orcidid></search><sort><creationdate>20241215</creationdate><title>Driver Gaze Zone Estimation Based on Three-Channel Convolution-Optimized Vision Transformer With Transfer Learning</title><author>Li, Zhao ; Jiang, Siyang ; Fu, Rui ; Guo, Yingshi ; Wang, Chang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-ieee_primary_107406063</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Computational modeling</topic><topic>Convolutional neural network (CNN)</topic><topic>deep learning</topic><topic>driver gaze zone</topic><topic>Estimation</topic><topic>Face recognition</topic><topic>Feature extraction</topic><topic>Head</topic><topic>Magnetic heads</topic><topic>Mathematical models</topic><topic>transfer learning</topic><topic>Transformers</topic><topic>Vehicles</topic><topic>vision transformer (ViT)</topic><topic>visual interpretability</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Zhao</creatorcontrib><creatorcontrib>Jiang, Siyang</creatorcontrib><creatorcontrib>Fu, Rui</creatorcontrib><creatorcontrib>Guo, Yingshi</creatorcontrib><creatorcontrib>Wang, Chang</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Xplore</collection><jtitle>IEEE sensors journal</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Zhao</au><au>Jiang, Siyang</au><au>Fu, Rui</au><au>Guo, Yingshi</au><au>Wang, Chang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Driver Gaze Zone Estimation Based on Three-Channel Convolution-Optimized Vision Transformer With Transfer Learning</atitle><jtitle>IEEE sensors journal</jtitle><stitle>JSEN</stitle><date>2024-12-15</date><risdate>2024</risdate><volume>24</volume><issue>24</issue><spage>42064</spage><epage>42078</epage><pages>42064-42078</pages><issn>1530-437X</issn><coden>ISJEAZ</coden><abstract>Driver gaze zone estimation (DGZE) is essential for detecting the driver's state and taking over rule-making in intelligent driving systems. However, convolutional neural network (CNN)-based multichannel models lack global feature extraction capability, with a large number of parameters and high computational complexity. Therefore, this article proposes a novel method that uses a three-channel convolution-optimized vision transformer (3C-CoViT) to estimate the driver's gaze zone. The method replaces the linear projection in the pure ViT structure with convolutional projection, converts the input images of different channels into image sequences, and then adds a convolutional feed-forward network to extract the local features of the markers, enhance the correlation of adjacent tokens in spatial dimensions, and improve the performance and efficiency of the model. We then pretrained the model on the GazeCapture dataset based on transfer learning and then fine-tuned the model on the dataset built in the actual road experiment. To enhance the interpretability of the model, we presented a novel visualization method. Experimental results show that the proposed method can accurately identify driver gaze zones (98.04% average accuracy) and outperform state-of-the-art methods in terms of accuracy and reliability. Ablation studies proved the effectiveness of our proposed method over the pure ViT and the beneficial effects of transfer learning and three-channel information input.</abstract><pub>IEEE</pub><doi>10.1109/JSEN.2024.3486373</doi><orcidid>https://orcid.org/0000-0003-3531-1215</orcidid><orcidid>https://orcid.org/0000-0003-1530-8115</orcidid><orcidid>https://orcid.org/0000-0002-5732-3068</orcidid><orcidid>https://orcid.org/0000-0001-9384-7558</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1530-437X
ispartof IEEE sensors journal, 2024-12, Vol.24 (24), p.42064-42078
issn 1530-437X
language eng
recordid cdi_ieee_primary_10740606
source IEEE Xplore (Online service)
subjects Accuracy
Computational modeling
Convolutional neural network (CNN)
deep learning
driver gaze zone
Estimation
Face recognition
Feature extraction
Head
Magnetic heads
Mathematical models
transfer learning
Transformers
Vehicles
vision transformer (ViT)
visual interpretability
title Driver Gaze Zone Estimation Based on Three-Channel Convolution-Optimized Vision Transformer With Transfer Learning
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T23%3A09%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Driver%20Gaze%20Zone%20Estimation%20Based%20on%20Three-Channel%20Convolution-Optimized%20Vision%20Transformer%20With%20Transfer%20Learning&rft.jtitle=IEEE%20sensors%20journal&rft.au=Li,%20Zhao&rft.date=2024-12-15&rft.volume=24&rft.issue=24&rft.spage=42064&rft.epage=42078&rft.pages=42064-42078&rft.issn=1530-437X&rft.coden=ISJEAZ&rft_id=info:doi/10.1109/JSEN.2024.3486373&rft_dat=%3Cieee%3E10740606%3C/ieee%3E%3Cgrp_id%3Ecdi_FETCH-ieee_primary_107406063%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10740606&rfr_iscdi=true