Loading…

Vision Transformer-based Feature Extraction for Generalized Zero-Shot Learning

Generalized zero-shot learning (GZSL) is a technique to train a deep learning model to identify unseen classes using the image attribute. In this paper, we put forth a new GZSL approach exploiting Vision Transformer (ViT) to maximize the attribute-related information contained in the image feature....

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2023-02
Main Authors: Kim, Jiseob, Shim, Kyuhong, Kim, Junhan, Shim, Byonghyo
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Kim, Jiseob
Shim, Kyuhong
Kim, Junhan
Shim, Byonghyo
description Generalized zero-shot learning (GZSL) is a technique to train a deep learning model to identify unseen classes using the image attribute. In this paper, we put forth a new GZSL approach exploiting Vision Transformer (ViT) to maximize the attribute-related information contained in the image feature. In ViT, the entire image region is processed without the degradation of the image resolution and the local image information is preserved in patch features. To fully enjoy these benefits of ViT, we exploit patch features as well as the CLS feature in extracting the attribute-related image feature. In particular, we propose a novel attention-based module, called attribute attention module (AAM), to aggregate the attribute-related information in patch features. In AAM, the correlation between each patch feature and the synthetic image attribute is used as the importance weight for each patch. From extensive experiments on benchmark datasets, we demonstrate that the proposed technique outperforms the state-of-the-art GZSL approaches by a large margin.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2772191504</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2772191504</sourcerecordid><originalsourceid>FETCH-proquest_journals_27721915043</originalsourceid><addsrcrecordid>eNqNyrEKwjAUQNEgCBbtPwScA2nSWp2l1UFcFAeX8tRXTamJvqQgfr0V_ACnO9wzYJHSOhHzVKkRi71vpJRqlqss0xHbHow3zvI9gfW1ozuSOIHHCy8RQkfIi1cgOIcv6j9foUWC1rx7ckRyYndzgW8QyBp7nbBhDa3H-Ncxm5bFfrkWD3LPDn2oGteR7Vel8lwliySTqf5PfQD-vz5v</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2772191504</pqid></control><display><type>article</type><title>Vision Transformer-based Feature Extraction for Generalized Zero-Shot Learning</title><source>Publicly Available Content (ProQuest)</source><creator>Kim, Jiseob ; Shim, Kyuhong ; Kim, Junhan ; Shim, Byonghyo</creator><creatorcontrib>Kim, Jiseob ; Shim, Kyuhong ; Kim, Junhan ; Shim, Byonghyo</creatorcontrib><description>Generalized zero-shot learning (GZSL) is a technique to train a deep learning model to identify unseen classes using the image attribute. In this paper, we put forth a new GZSL approach exploiting Vision Transformer (ViT) to maximize the attribute-related information contained in the image feature. In ViT, the entire image region is processed without the degradation of the image resolution and the local image information is preserved in patch features. To fully enjoy these benefits of ViT, we exploit patch features as well as the CLS feature in extracting the attribute-related image feature. In particular, we propose a novel attention-based module, called attribute attention module (AAM), to aggregate the attribute-related information in patch features. In AAM, the correlation between each patch feature and the synthetic image attribute is used as the importance weight for each patch. From extensive experiments on benchmark datasets, we demonstrate that the proposed technique outperforms the state-of-the-art GZSL approaches by a large margin.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Deep learning ; Feature extraction ; Image resolution ; Machine learning ; Modules ; Zero-shot learning</subject><ispartof>arXiv.org, 2023-02</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2772191504?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>776,780,25731,36989,44566</link.rule.ids></links><search><creatorcontrib>Kim, Jiseob</creatorcontrib><creatorcontrib>Shim, Kyuhong</creatorcontrib><creatorcontrib>Kim, Junhan</creatorcontrib><creatorcontrib>Shim, Byonghyo</creatorcontrib><title>Vision Transformer-based Feature Extraction for Generalized Zero-Shot Learning</title><title>arXiv.org</title><description>Generalized zero-shot learning (GZSL) is a technique to train a deep learning model to identify unseen classes using the image attribute. In this paper, we put forth a new GZSL approach exploiting Vision Transformer (ViT) to maximize the attribute-related information contained in the image feature. In ViT, the entire image region is processed without the degradation of the image resolution and the local image information is preserved in patch features. To fully enjoy these benefits of ViT, we exploit patch features as well as the CLS feature in extracting the attribute-related image feature. In particular, we propose a novel attention-based module, called attribute attention module (AAM), to aggregate the attribute-related information in patch features. In AAM, the correlation between each patch feature and the synthetic image attribute is used as the importance weight for each patch. From extensive experiments on benchmark datasets, we demonstrate that the proposed technique outperforms the state-of-the-art GZSL approaches by a large margin.</description><subject>Deep learning</subject><subject>Feature extraction</subject><subject>Image resolution</subject><subject>Machine learning</subject><subject>Modules</subject><subject>Zero-shot learning</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNyrEKwjAUQNEgCBbtPwScA2nSWp2l1UFcFAeX8tRXTamJvqQgfr0V_ACnO9wzYJHSOhHzVKkRi71vpJRqlqss0xHbHow3zvI9gfW1ozuSOIHHCy8RQkfIi1cgOIcv6j9foUWC1rx7ckRyYndzgW8QyBp7nbBhDa3H-Ncxm5bFfrkWD3LPDn2oGteR7Vel8lwliySTqf5PfQD-vz5v</recordid><startdate>20230202</startdate><enddate>20230202</enddate><creator>Kim, Jiseob</creator><creator>Shim, Kyuhong</creator><creator>Kim, Junhan</creator><creator>Shim, Byonghyo</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20230202</creationdate><title>Vision Transformer-based Feature Extraction for Generalized Zero-Shot Learning</title><author>Kim, Jiseob ; Shim, Kyuhong ; Kim, Junhan ; Shim, Byonghyo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_27721915043</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Deep learning</topic><topic>Feature extraction</topic><topic>Image resolution</topic><topic>Machine learning</topic><topic>Modules</topic><topic>Zero-shot learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Kim, Jiseob</creatorcontrib><creatorcontrib>Shim, Kyuhong</creatorcontrib><creatorcontrib>Kim, Junhan</creatorcontrib><creatorcontrib>Shim, Byonghyo</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>ProQuest Engineering Database</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kim, Jiseob</au><au>Shim, Kyuhong</au><au>Kim, Junhan</au><au>Shim, Byonghyo</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Vision Transformer-based Feature Extraction for Generalized Zero-Shot Learning</atitle><jtitle>arXiv.org</jtitle><date>2023-02-02</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Generalized zero-shot learning (GZSL) is a technique to train a deep learning model to identify unseen classes using the image attribute. In this paper, we put forth a new GZSL approach exploiting Vision Transformer (ViT) to maximize the attribute-related information contained in the image feature. In ViT, the entire image region is processed without the degradation of the image resolution and the local image information is preserved in patch features. To fully enjoy these benefits of ViT, we exploit patch features as well as the CLS feature in extracting the attribute-related image feature. In particular, we propose a novel attention-based module, called attribute attention module (AAM), to aggregate the attribute-related information in patch features. In AAM, the correlation between each patch feature and the synthetic image attribute is used as the importance weight for each patch. From extensive experiments on benchmark datasets, we demonstrate that the proposed technique outperforms the state-of-the-art GZSL approaches by a large margin.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2023-02
issn 2331-8422
language eng
recordid cdi_proquest_journals_2772191504
source Publicly Available Content (ProQuest)
subjects Deep learning
Feature extraction
Image resolution
Machine learning
Modules
Zero-shot learning
title Vision Transformer-based Feature Extraction for Generalized Zero-Shot Learning
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-01T07%3A58%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Vision%20Transformer-based%20Feature%20Extraction%20for%20Generalized%20Zero-Shot%20Learning&rft.jtitle=arXiv.org&rft.au=Kim,%20Jiseob&rft.date=2023-02-02&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2772191504%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_27721915043%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2772191504&rft_id=info:pmid/&rfr_iscdi=true