Loading…
Vision Transformer-based Feature Extraction for Generalized Zero-Shot Learning
Generalized zero-shot learning (GZSL) is a technique to train a deep learning model to identify unseen classes using the image attribute. In this paper, we put forth a new GZSL approach exploiting Vision Transformer (ViT) to maximize the attribute-related information contained in the image feature....
Saved in:
Published in: | arXiv.org 2023-02 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | |
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Kim, Jiseob Shim, Kyuhong Kim, Junhan Shim, Byonghyo |
description | Generalized zero-shot learning (GZSL) is a technique to train a deep learning model to identify unseen classes using the image attribute. In this paper, we put forth a new GZSL approach exploiting Vision Transformer (ViT) to maximize the attribute-related information contained in the image feature. In ViT, the entire image region is processed without the degradation of the image resolution and the local image information is preserved in patch features. To fully enjoy these benefits of ViT, we exploit patch features as well as the CLS feature in extracting the attribute-related image feature. In particular, we propose a novel attention-based module, called attribute attention module (AAM), to aggregate the attribute-related information in patch features. In AAM, the correlation between each patch feature and the synthetic image attribute is used as the importance weight for each patch. From extensive experiments on benchmark datasets, we demonstrate that the proposed technique outperforms the state-of-the-art GZSL approaches by a large margin. |
format | article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2772191504</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2772191504</sourcerecordid><originalsourceid>FETCH-proquest_journals_27721915043</originalsourceid><addsrcrecordid>eNqNyrEKwjAUQNEgCBbtPwScA2nSWp2l1UFcFAeX8tRXTamJvqQgfr0V_ACnO9wzYJHSOhHzVKkRi71vpJRqlqss0xHbHow3zvI9gfW1ozuSOIHHCy8RQkfIi1cgOIcv6j9foUWC1rx7ckRyYndzgW8QyBp7nbBhDa3H-Ncxm5bFfrkWD3LPDn2oGteR7Vel8lwliySTqf5PfQD-vz5v</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2772191504</pqid></control><display><type>article</type><title>Vision Transformer-based Feature Extraction for Generalized Zero-Shot Learning</title><source>Publicly Available Content (ProQuest)</source><creator>Kim, Jiseob ; Shim, Kyuhong ; Kim, Junhan ; Shim, Byonghyo</creator><creatorcontrib>Kim, Jiseob ; Shim, Kyuhong ; Kim, Junhan ; Shim, Byonghyo</creatorcontrib><description>Generalized zero-shot learning (GZSL) is a technique to train a deep learning model to identify unseen classes using the image attribute. In this paper, we put forth a new GZSL approach exploiting Vision Transformer (ViT) to maximize the attribute-related information contained in the image feature. In ViT, the entire image region is processed without the degradation of the image resolution and the local image information is preserved in patch features. To fully enjoy these benefits of ViT, we exploit patch features as well as the CLS feature in extracting the attribute-related image feature. In particular, we propose a novel attention-based module, called attribute attention module (AAM), to aggregate the attribute-related information in patch features. In AAM, the correlation between each patch feature and the synthetic image attribute is used as the importance weight for each patch. From extensive experiments on benchmark datasets, we demonstrate that the proposed technique outperforms the state-of-the-art GZSL approaches by a large margin.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Deep learning ; Feature extraction ; Image resolution ; Machine learning ; Modules ; Zero-shot learning</subject><ispartof>arXiv.org, 2023-02</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2772191504?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>776,780,25731,36989,44566</link.rule.ids></links><search><creatorcontrib>Kim, Jiseob</creatorcontrib><creatorcontrib>Shim, Kyuhong</creatorcontrib><creatorcontrib>Kim, Junhan</creatorcontrib><creatorcontrib>Shim, Byonghyo</creatorcontrib><title>Vision Transformer-based Feature Extraction for Generalized Zero-Shot Learning</title><title>arXiv.org</title><description>Generalized zero-shot learning (GZSL) is a technique to train a deep learning model to identify unseen classes using the image attribute. In this paper, we put forth a new GZSL approach exploiting Vision Transformer (ViT) to maximize the attribute-related information contained in the image feature. In ViT, the entire image region is processed without the degradation of the image resolution and the local image information is preserved in patch features. To fully enjoy these benefits of ViT, we exploit patch features as well as the CLS feature in extracting the attribute-related image feature. In particular, we propose a novel attention-based module, called attribute attention module (AAM), to aggregate the attribute-related information in patch features. In AAM, the correlation between each patch feature and the synthetic image attribute is used as the importance weight for each patch. From extensive experiments on benchmark datasets, we demonstrate that the proposed technique outperforms the state-of-the-art GZSL approaches by a large margin.</description><subject>Deep learning</subject><subject>Feature extraction</subject><subject>Image resolution</subject><subject>Machine learning</subject><subject>Modules</subject><subject>Zero-shot learning</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNyrEKwjAUQNEgCBbtPwScA2nSWp2l1UFcFAeX8tRXTamJvqQgfr0V_ACnO9wzYJHSOhHzVKkRi71vpJRqlqss0xHbHow3zvI9gfW1ozuSOIHHCy8RQkfIi1cgOIcv6j9foUWC1rx7ckRyYndzgW8QyBp7nbBhDa3H-Ncxm5bFfrkWD3LPDn2oGteR7Vel8lwliySTqf5PfQD-vz5v</recordid><startdate>20230202</startdate><enddate>20230202</enddate><creator>Kim, Jiseob</creator><creator>Shim, Kyuhong</creator><creator>Kim, Junhan</creator><creator>Shim, Byonghyo</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20230202</creationdate><title>Vision Transformer-based Feature Extraction for Generalized Zero-Shot Learning</title><author>Kim, Jiseob ; Shim, Kyuhong ; Kim, Junhan ; Shim, Byonghyo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_27721915043</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Deep learning</topic><topic>Feature extraction</topic><topic>Image resolution</topic><topic>Machine learning</topic><topic>Modules</topic><topic>Zero-shot learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Kim, Jiseob</creatorcontrib><creatorcontrib>Shim, Kyuhong</creatorcontrib><creatorcontrib>Kim, Junhan</creatorcontrib><creatorcontrib>Shim, Byonghyo</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>ProQuest Engineering Database</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kim, Jiseob</au><au>Shim, Kyuhong</au><au>Kim, Junhan</au><au>Shim, Byonghyo</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Vision Transformer-based Feature Extraction for Generalized Zero-Shot Learning</atitle><jtitle>arXiv.org</jtitle><date>2023-02-02</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Generalized zero-shot learning (GZSL) is a technique to train a deep learning model to identify unseen classes using the image attribute. In this paper, we put forth a new GZSL approach exploiting Vision Transformer (ViT) to maximize the attribute-related information contained in the image feature. In ViT, the entire image region is processed without the degradation of the image resolution and the local image information is preserved in patch features. To fully enjoy these benefits of ViT, we exploit patch features as well as the CLS feature in extracting the attribute-related image feature. In particular, we propose a novel attention-based module, called attribute attention module (AAM), to aggregate the attribute-related information in patch features. In AAM, the correlation between each patch feature and the synthetic image attribute is used as the importance weight for each patch. From extensive experiments on benchmark datasets, we demonstrate that the proposed technique outperforms the state-of-the-art GZSL approaches by a large margin.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2023-02 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2772191504 |
source | Publicly Available Content (ProQuest) |
subjects | Deep learning Feature extraction Image resolution Machine learning Modules Zero-shot learning |
title | Vision Transformer-based Feature Extraction for Generalized Zero-Shot Learning |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-01T07%3A58%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Vision%20Transformer-based%20Feature%20Extraction%20for%20Generalized%20Zero-Shot%20Learning&rft.jtitle=arXiv.org&rft.au=Kim,%20Jiseob&rft.date=2023-02-02&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2772191504%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_27721915043%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2772191504&rft_id=info:pmid/&rfr_iscdi=true |