Loading…

LARE: Latent Augmentation using Regional Embedding with Vision-Language Model

In recent years, considerable research has been conducted on vision-language models that handle both image and text data; these models are being applied to diverse downstream tasks, such as "image-related chat," "image recognition by instruction," and "answering visual quest...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2024-09
Main Authors: Sakurai, Kosuke, Ishii, Tatsuya, Shimizu, Ryotaro, Song, Linxin, Goto, Masayuki
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Sakurai, Kosuke
Ishii, Tatsuya
Shimizu, Ryotaro
Song, Linxin
Goto, Masayuki
description In recent years, considerable research has been conducted on vision-language models that handle both image and text data; these models are being applied to diverse downstream tasks, such as "image-related chat," "image recognition by instruction," and "answering visual questions." Vision-language models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), are also high-performance image classifiers that are being developed into domain adaptation methods that can utilize language information to extend into unseen domains. However, because these VLMs embed images as a single point in a unified embedding space, there is room for improvement in the classification accuracy. Therefore, in this study, we proposed the Latent Augmentation using Regional Embedding (LARE), which embeds the image as a region in the unified embedding space learned by the VLM. By sampling the augmented image embeddings from within this latent region, LARE enables data augmentation to various unseen domains, not just to specific unseen domains. LARE achieves robust image classification for domains in and out using augmented image embeddings to fine-tune VLMs. We demonstrate that LARE outperforms previous fine-tuning models in terms of image classification accuracy on three benchmarks. We also demonstrate that LARE is a more robust and general model that is valid under multiple conditions, such as unseen domains, small amounts of data, and imbalanced data.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3107309950</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3107309950</sourcerecordid><originalsourceid>FETCH-proquest_journals_31073099503</originalsourceid><addsrcrecordid>eNqNjNEKgjAYRkcQJOU7_NC1MLfM7E7C6EJvIrqVhX9rolu5jV6_BT1AV4fvcPhmJGKcp8luw9iCxNb2lFK2zVmW8Yg0dXmu9lALh9pB6eUYKJwyGrxVWsIZZRhigGq8Ydd91Vu5B1yVDT6phZZeSITGdDisyPwuBovxj0uyPlaXwyl5Tubl0bq2N34Kb7blKc05LYqM8v-qDyLoPRo</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3107309950</pqid></control><display><type>article</type><title>LARE: Latent Augmentation using Regional Embedding with Vision-Language Model</title><source>Publicly Available Content Database</source><creator>Sakurai, Kosuke ; Ishii, Tatsuya ; Shimizu, Ryotaro ; Song, Linxin ; Goto, Masayuki</creator><creatorcontrib>Sakurai, Kosuke ; Ishii, Tatsuya ; Shimizu, Ryotaro ; Song, Linxin ; Goto, Masayuki</creatorcontrib><description>In recent years, considerable research has been conducted on vision-language models that handle both image and text data; these models are being applied to diverse downstream tasks, such as "image-related chat," "image recognition by instruction," and "answering visual questions." Vision-language models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), are also high-performance image classifiers that are being developed into domain adaptation methods that can utilize language information to extend into unseen domains. However, because these VLMs embed images as a single point in a unified embedding space, there is room for improvement in the classification accuracy. Therefore, in this study, we proposed the Latent Augmentation using Regional Embedding (LARE), which embeds the image as a region in the unified embedding space learned by the VLM. By sampling the augmented image embeddings from within this latent region, LARE enables data augmentation to various unseen domains, not just to specific unseen domains. LARE achieves robust image classification for domains in and out using augmented image embeddings to fine-tune VLMs. We demonstrate that LARE outperforms previous fine-tuning models in terms of image classification accuracy on three benchmarks. We also demonstrate that LARE is a more robust and general model that is valid under multiple conditions, such as unseen domains, small amounts of data, and imbalanced data.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Accuracy ; Data augmentation ; Embedding ; Image classification ; Language ; Regional development ; Robustness ; Vision</subject><ispartof>arXiv.org, 2024-09</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3107309950?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25752,37011,44589</link.rule.ids></links><search><creatorcontrib>Sakurai, Kosuke</creatorcontrib><creatorcontrib>Ishii, Tatsuya</creatorcontrib><creatorcontrib>Shimizu, Ryotaro</creatorcontrib><creatorcontrib>Song, Linxin</creatorcontrib><creatorcontrib>Goto, Masayuki</creatorcontrib><title>LARE: Latent Augmentation using Regional Embedding with Vision-Language Model</title><title>arXiv.org</title><description>In recent years, considerable research has been conducted on vision-language models that handle both image and text data; these models are being applied to diverse downstream tasks, such as "image-related chat," "image recognition by instruction," and "answering visual questions." Vision-language models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), are also high-performance image classifiers that are being developed into domain adaptation methods that can utilize language information to extend into unseen domains. However, because these VLMs embed images as a single point in a unified embedding space, there is room for improvement in the classification accuracy. Therefore, in this study, we proposed the Latent Augmentation using Regional Embedding (LARE), which embeds the image as a region in the unified embedding space learned by the VLM. By sampling the augmented image embeddings from within this latent region, LARE enables data augmentation to various unseen domains, not just to specific unseen domains. LARE achieves robust image classification for domains in and out using augmented image embeddings to fine-tune VLMs. We demonstrate that LARE outperforms previous fine-tuning models in terms of image classification accuracy on three benchmarks. We also demonstrate that LARE is a more robust and general model that is valid under multiple conditions, such as unseen domains, small amounts of data, and imbalanced data.</description><subject>Accuracy</subject><subject>Data augmentation</subject><subject>Embedding</subject><subject>Image classification</subject><subject>Language</subject><subject>Regional development</subject><subject>Robustness</subject><subject>Vision</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNjNEKgjAYRkcQJOU7_NC1MLfM7E7C6EJvIrqVhX9rolu5jV6_BT1AV4fvcPhmJGKcp8luw9iCxNb2lFK2zVmW8Yg0dXmu9lALh9pB6eUYKJwyGrxVWsIZZRhigGq8Ydd91Vu5B1yVDT6phZZeSITGdDisyPwuBovxj0uyPlaXwyl5Tubl0bq2N34Kb7blKc05LYqM8v-qDyLoPRo</recordid><startdate>20240919</startdate><enddate>20240919</enddate><creator>Sakurai, Kosuke</creator><creator>Ishii, Tatsuya</creator><creator>Shimizu, Ryotaro</creator><creator>Song, Linxin</creator><creator>Goto, Masayuki</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240919</creationdate><title>LARE: Latent Augmentation using Regional Embedding with Vision-Language Model</title><author>Sakurai, Kosuke ; Ishii, Tatsuya ; Shimizu, Ryotaro ; Song, Linxin ; Goto, Masayuki</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31073099503</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Data augmentation</topic><topic>Embedding</topic><topic>Image classification</topic><topic>Language</topic><topic>Regional development</topic><topic>Robustness</topic><topic>Vision</topic><toplevel>online_resources</toplevel><creatorcontrib>Sakurai, Kosuke</creatorcontrib><creatorcontrib>Ishii, Tatsuya</creatorcontrib><creatorcontrib>Shimizu, Ryotaro</creatorcontrib><creatorcontrib>Song, Linxin</creatorcontrib><creatorcontrib>Goto, Masayuki</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Sakurai, Kosuke</au><au>Ishii, Tatsuya</au><au>Shimizu, Ryotaro</au><au>Song, Linxin</au><au>Goto, Masayuki</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>LARE: Latent Augmentation using Regional Embedding with Vision-Language Model</atitle><jtitle>arXiv.org</jtitle><date>2024-09-19</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>In recent years, considerable research has been conducted on vision-language models that handle both image and text data; these models are being applied to diverse downstream tasks, such as "image-related chat," "image recognition by instruction," and "answering visual questions." Vision-language models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), are also high-performance image classifiers that are being developed into domain adaptation methods that can utilize language information to extend into unseen domains. However, because these VLMs embed images as a single point in a unified embedding space, there is room for improvement in the classification accuracy. Therefore, in this study, we proposed the Latent Augmentation using Regional Embedding (LARE), which embeds the image as a region in the unified embedding space learned by the VLM. By sampling the augmented image embeddings from within this latent region, LARE enables data augmentation to various unseen domains, not just to specific unseen domains. LARE achieves robust image classification for domains in and out using augmented image embeddings to fine-tune VLMs. We demonstrate that LARE outperforms previous fine-tuning models in terms of image classification accuracy on three benchmarks. We also demonstrate that LARE is a more robust and general model that is valid under multiple conditions, such as unseen domains, small amounts of data, and imbalanced data.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-09
issn 2331-8422
language eng
recordid cdi_proquest_journals_3107309950
source Publicly Available Content Database
subjects Accuracy
Data augmentation
Embedding
Image classification
Language
Regional development
Robustness
Vision
title LARE: Latent Augmentation using Regional Embedding with Vision-Language Model
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-12T00%3A04%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=LARE:%20Latent%20Augmentation%20using%20Regional%20Embedding%20with%20Vision-Language%20Model&rft.jtitle=arXiv.org&rft.au=Sakurai,%20Kosuke&rft.date=2024-09-19&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3107309950%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31073099503%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3107309950&rft_id=info:pmid/&rfr_iscdi=true