Loading…
Driver Gaze Zone Estimation Based on Three-Channel Convolution-Optimized Vision Transformer With Transfer Learning
Driver gaze zone estimation (DGZE) is essential for detecting the driver's state and taking over rule-making in intelligent driving systems. However, convolutional neural network (CNN)-based multichannel models lack global feature extraction capability, with a large number of parameters and hig...
Saved in:
Published in: | IEEE sensors journal 2024-12, Vol.24 (24), p.42064-42078 |
---|---|
Main Authors: | , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | 42078 |
container_issue | 24 |
container_start_page | 42064 |
container_title | IEEE sensors journal |
container_volume | 24 |
creator | Li, Zhao Jiang, Siyang Fu, Rui Guo, Yingshi Wang, Chang |
description | Driver gaze zone estimation (DGZE) is essential for detecting the driver's state and taking over rule-making in intelligent driving systems. However, convolutional neural network (CNN)-based multichannel models lack global feature extraction capability, with a large number of parameters and high computational complexity. Therefore, this article proposes a novel method that uses a three-channel convolution-optimized vision transformer (3C-CoViT) to estimate the driver's gaze zone. The method replaces the linear projection in the pure ViT structure with convolutional projection, converts the input images of different channels into image sequences, and then adds a convolutional feed-forward network to extract the local features of the markers, enhance the correlation of adjacent tokens in spatial dimensions, and improve the performance and efficiency of the model. We then pretrained the model on the GazeCapture dataset based on transfer learning and then fine-tuned the model on the dataset built in the actual road experiment. To enhance the interpretability of the model, we presented a novel visualization method. Experimental results show that the proposed method can accurately identify driver gaze zones (98.04% average accuracy) and outperform state-of-the-art methods in terms of accuracy and reliability. Ablation studies proved the effectiveness of our proposed method over the pure ViT and the beneficial effects of transfer learning and three-channel information input. |
doi_str_mv | 10.1109/JSEN.2024.3486373 |
format | article |
fullrecord | <record><control><sourceid>ieee</sourceid><recordid>TN_cdi_ieee_primary_10740606</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10740606</ieee_id><sourcerecordid>10740606</sourcerecordid><originalsourceid>FETCH-ieee_primary_107406063</originalsourceid><addsrcrecordid>eNqFjUGLwjAQhXNwQVf9AYKH_IHWiamNXrd2XRbRg0XFiwQcbaQmMukK-uuN4H1PM--9b94w1hMQCwGTwe8qX8RDGCaxTMapVLLBWmIkIUqk2jbZp_dnADFRI9ViNCVzQ-Iz_UC-cxZ57mtz0bVxln9pjwcelqIkxCgrtbVY8czZm6v-Xki0vAbaPAK2Nv51U5C2_ujoEko3pi7fRlBz1GSNPXXYx1FXHrvv2Wb977zIfiKDiPsrhe903wtQCaSQyn_iJ8LnSw8</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Driver Gaze Zone Estimation Based on Three-Channel Convolution-Optimized Vision Transformer With Transfer Learning</title><source>IEEE Xplore (Online service)</source><creator>Li, Zhao ; Jiang, Siyang ; Fu, Rui ; Guo, Yingshi ; Wang, Chang</creator><creatorcontrib>Li, Zhao ; Jiang, Siyang ; Fu, Rui ; Guo, Yingshi ; Wang, Chang</creatorcontrib><description>Driver gaze zone estimation (DGZE) is essential for detecting the driver's state and taking over rule-making in intelligent driving systems. However, convolutional neural network (CNN)-based multichannel models lack global feature extraction capability, with a large number of parameters and high computational complexity. Therefore, this article proposes a novel method that uses a three-channel convolution-optimized vision transformer (3C-CoViT) to estimate the driver's gaze zone. The method replaces the linear projection in the pure ViT structure with convolutional projection, converts the input images of different channels into image sequences, and then adds a convolutional feed-forward network to extract the local features of the markers, enhance the correlation of adjacent tokens in spatial dimensions, and improve the performance and efficiency of the model. We then pretrained the model on the GazeCapture dataset based on transfer learning and then fine-tuned the model on the dataset built in the actual road experiment. To enhance the interpretability of the model, we presented a novel visualization method. Experimental results show that the proposed method can accurately identify driver gaze zones (98.04% average accuracy) and outperform state-of-the-art methods in terms of accuracy and reliability. Ablation studies proved the effectiveness of our proposed method over the pure ViT and the beneficial effects of transfer learning and three-channel information input.</description><identifier>ISSN: 1530-437X</identifier><identifier>DOI: 10.1109/JSEN.2024.3486373</identifier><identifier>CODEN: ISJEAZ</identifier><language>eng</language><publisher>IEEE</publisher><subject>Accuracy ; Computational modeling ; Convolutional neural network (CNN) ; deep learning ; driver gaze zone ; Estimation ; Face recognition ; Feature extraction ; Head ; Magnetic heads ; Mathematical models ; transfer learning ; Transformers ; Vehicles ; vision transformer (ViT) ; visual interpretability</subject><ispartof>IEEE sensors journal, 2024-12, Vol.24 (24), p.42064-42078</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0003-3531-1215 ; 0000-0003-1530-8115 ; 0000-0002-5732-3068 ; 0000-0001-9384-7558</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10740606$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,54796</link.rule.ids></links><search><creatorcontrib>Li, Zhao</creatorcontrib><creatorcontrib>Jiang, Siyang</creatorcontrib><creatorcontrib>Fu, Rui</creatorcontrib><creatorcontrib>Guo, Yingshi</creatorcontrib><creatorcontrib>Wang, Chang</creatorcontrib><title>Driver Gaze Zone Estimation Based on Three-Channel Convolution-Optimized Vision Transformer With Transfer Learning</title><title>IEEE sensors journal</title><addtitle>JSEN</addtitle><description>Driver gaze zone estimation (DGZE) is essential for detecting the driver's state and taking over rule-making in intelligent driving systems. However, convolutional neural network (CNN)-based multichannel models lack global feature extraction capability, with a large number of parameters and high computational complexity. Therefore, this article proposes a novel method that uses a three-channel convolution-optimized vision transformer (3C-CoViT) to estimate the driver's gaze zone. The method replaces the linear projection in the pure ViT structure with convolutional projection, converts the input images of different channels into image sequences, and then adds a convolutional feed-forward network to extract the local features of the markers, enhance the correlation of adjacent tokens in spatial dimensions, and improve the performance and efficiency of the model. We then pretrained the model on the GazeCapture dataset based on transfer learning and then fine-tuned the model on the dataset built in the actual road experiment. To enhance the interpretability of the model, we presented a novel visualization method. Experimental results show that the proposed method can accurately identify driver gaze zones (98.04% average accuracy) and outperform state-of-the-art methods in terms of accuracy and reliability. Ablation studies proved the effectiveness of our proposed method over the pure ViT and the beneficial effects of transfer learning and three-channel information input.</description><subject>Accuracy</subject><subject>Computational modeling</subject><subject>Convolutional neural network (CNN)</subject><subject>deep learning</subject><subject>driver gaze zone</subject><subject>Estimation</subject><subject>Face recognition</subject><subject>Feature extraction</subject><subject>Head</subject><subject>Magnetic heads</subject><subject>Mathematical models</subject><subject>transfer learning</subject><subject>Transformers</subject><subject>Vehicles</subject><subject>vision transformer (ViT)</subject><subject>visual interpretability</subject><issn>1530-437X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNqFjUGLwjAQhXNwQVf9AYKH_IHWiamNXrd2XRbRg0XFiwQcbaQmMukK-uuN4H1PM--9b94w1hMQCwGTwe8qX8RDGCaxTMapVLLBWmIkIUqk2jbZp_dnADFRI9ViNCVzQ-Iz_UC-cxZ57mtz0bVxln9pjwcelqIkxCgrtbVY8czZm6v-Xki0vAbaPAK2Nv51U5C2_ujoEko3pi7fRlBz1GSNPXXYx1FXHrvv2Wb977zIfiKDiPsrhe903wtQCaSQyn_iJ8LnSw8</recordid><startdate>20241215</startdate><enddate>20241215</enddate><creator>Li, Zhao</creator><creator>Jiang, Siyang</creator><creator>Fu, Rui</creator><creator>Guo, Yingshi</creator><creator>Wang, Chang</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><orcidid>https://orcid.org/0000-0003-3531-1215</orcidid><orcidid>https://orcid.org/0000-0003-1530-8115</orcidid><orcidid>https://orcid.org/0000-0002-5732-3068</orcidid><orcidid>https://orcid.org/0000-0001-9384-7558</orcidid></search><sort><creationdate>20241215</creationdate><title>Driver Gaze Zone Estimation Based on Three-Channel Convolution-Optimized Vision Transformer With Transfer Learning</title><author>Li, Zhao ; Jiang, Siyang ; Fu, Rui ; Guo, Yingshi ; Wang, Chang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-ieee_primary_107406063</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Computational modeling</topic><topic>Convolutional neural network (CNN)</topic><topic>deep learning</topic><topic>driver gaze zone</topic><topic>Estimation</topic><topic>Face recognition</topic><topic>Feature extraction</topic><topic>Head</topic><topic>Magnetic heads</topic><topic>Mathematical models</topic><topic>transfer learning</topic><topic>Transformers</topic><topic>Vehicles</topic><topic>vision transformer (ViT)</topic><topic>visual interpretability</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Zhao</creatorcontrib><creatorcontrib>Jiang, Siyang</creatorcontrib><creatorcontrib>Fu, Rui</creatorcontrib><creatorcontrib>Guo, Yingshi</creatorcontrib><creatorcontrib>Wang, Chang</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Xplore</collection><jtitle>IEEE sensors journal</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Zhao</au><au>Jiang, Siyang</au><au>Fu, Rui</au><au>Guo, Yingshi</au><au>Wang, Chang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Driver Gaze Zone Estimation Based on Three-Channel Convolution-Optimized Vision Transformer With Transfer Learning</atitle><jtitle>IEEE sensors journal</jtitle><stitle>JSEN</stitle><date>2024-12-15</date><risdate>2024</risdate><volume>24</volume><issue>24</issue><spage>42064</spage><epage>42078</epage><pages>42064-42078</pages><issn>1530-437X</issn><coden>ISJEAZ</coden><abstract>Driver gaze zone estimation (DGZE) is essential for detecting the driver's state and taking over rule-making in intelligent driving systems. However, convolutional neural network (CNN)-based multichannel models lack global feature extraction capability, with a large number of parameters and high computational complexity. Therefore, this article proposes a novel method that uses a three-channel convolution-optimized vision transformer (3C-CoViT) to estimate the driver's gaze zone. The method replaces the linear projection in the pure ViT structure with convolutional projection, converts the input images of different channels into image sequences, and then adds a convolutional feed-forward network to extract the local features of the markers, enhance the correlation of adjacent tokens in spatial dimensions, and improve the performance and efficiency of the model. We then pretrained the model on the GazeCapture dataset based on transfer learning and then fine-tuned the model on the dataset built in the actual road experiment. To enhance the interpretability of the model, we presented a novel visualization method. Experimental results show that the proposed method can accurately identify driver gaze zones (98.04% average accuracy) and outperform state-of-the-art methods in terms of accuracy and reliability. Ablation studies proved the effectiveness of our proposed method over the pure ViT and the beneficial effects of transfer learning and three-channel information input.</abstract><pub>IEEE</pub><doi>10.1109/JSEN.2024.3486373</doi><orcidid>https://orcid.org/0000-0003-3531-1215</orcidid><orcidid>https://orcid.org/0000-0003-1530-8115</orcidid><orcidid>https://orcid.org/0000-0002-5732-3068</orcidid><orcidid>https://orcid.org/0000-0001-9384-7558</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1530-437X |
ispartof | IEEE sensors journal, 2024-12, Vol.24 (24), p.42064-42078 |
issn | 1530-437X |
language | eng |
recordid | cdi_ieee_primary_10740606 |
source | IEEE Xplore (Online service) |
subjects | Accuracy Computational modeling Convolutional neural network (CNN) deep learning driver gaze zone Estimation Face recognition Feature extraction Head Magnetic heads Mathematical models transfer learning Transformers Vehicles vision transformer (ViT) visual interpretability |
title | Driver Gaze Zone Estimation Based on Three-Channel Convolution-Optimized Vision Transformer With Transfer Learning |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T23%3A09%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Driver%20Gaze%20Zone%20Estimation%20Based%20on%20Three-Channel%20Convolution-Optimized%20Vision%20Transformer%20With%20Transfer%20Learning&rft.jtitle=IEEE%20sensors%20journal&rft.au=Li,%20Zhao&rft.date=2024-12-15&rft.volume=24&rft.issue=24&rft.spage=42064&rft.epage=42078&rft.pages=42064-42078&rft.issn=1530-437X&rft.coden=ISJEAZ&rft_id=info:doi/10.1109/JSEN.2024.3486373&rft_dat=%3Cieee%3E10740606%3C/ieee%3E%3Cgrp_id%3Ecdi_FETCH-ieee_primary_107406063%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10740606&rfr_iscdi=true |