Loading…
VLMine: Long-Tail Data Mining with Vision Language Models
Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining ap...
Saved in:
Published in: | arXiv.org 2024-09 |
---|---|
Main Authors: | , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | |
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Mao Ye Meyer, Gregory P Zhang, Zaiwei Park, Dennis Siva, Karthik Mustikovela Chai, Yuning Wolff, Eric M |
description | Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM). Our approach utilizes a VLM to summarize the content of an image into a set of keywords, and we identify rare examples based on keyword frequency. We find that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty. Therefore, we propose a simple and general approach for integrating signals from multiple mining algorithms. We evaluate the proposed method on two diverse tasks: 2D image classification, in which inter-class variation is the primary source of data diversity, and on 3D object detection, where intra-class variation is the main concern. Furthermore, through the detection task, we demonstrate that the knowledge extracted from 2D images is transferable to the 3D domain. Our experiments consistently show large improvements (between 10\% and 50\%) over the baseline techniques on several representative benchmarks: ImageNet-LT, Places-LT, and the Waymo Open Dataset. |
format | article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3109529359</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3109529359</sourcerecordid><originalsourceid>FETCH-proquest_journals_31095293593</originalsourceid><addsrcrecordid>eNqNir0KwjAYAIMgWLTv8IFzIE0aNa7-4NBupWv5wBhTQqJNgq9vBx_A6eDuFqTgQlT0UHO-ImWMI2OM7_ZcSlEQ1Tet9foITfCGdmgdnDEhzNJ6Ax-bntDbaIOHBr3JaDS04a5d3JDlA13U5Y9rsr1eutONvqbwzjqmYQx58nMaRMWU5EpIJf67vt4JNT4</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3109529359</pqid></control><display><type>article</type><title>VLMine: Long-Tail Data Mining with Vision Language Models</title><source>Publicly Available Content Database</source><creator>Mao Ye ; Meyer, Gregory P ; Zhang, Zaiwei ; Park, Dennis ; Siva, Karthik Mustikovela ; Chai, Yuning ; Wolff, Eric M</creator><creatorcontrib>Mao Ye ; Meyer, Gregory P ; Zhang, Zaiwei ; Park, Dennis ; Siva, Karthik Mustikovela ; Chai, Yuning ; Wolff, Eric M</creatorcontrib><description>Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM). Our approach utilizes a VLM to summarize the content of an image into a set of keywords, and we identify rare examples based on keyword frequency. We find that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty. Therefore, we propose a simple and general approach for integrating signals from multiple mining algorithms. We evaluate the proposed method on two diverse tasks: 2D image classification, in which inter-class variation is the primary source of data diversity, and on 3D object detection, where intra-class variation is the main concern. Furthermore, through the detection task, we demonstrate that the knowledge extracted from 2D images is transferable to the 3D domain. Our experiments consistently show large improvements (between 10\% and 50\%) over the baseline techniques on several representative benchmarks: ImageNet-LT, Places-LT, and the Waymo Open Dataset.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Data mining ; Image classification ; Machine learning ; Object recognition ; Signal classification</subject><ispartof>arXiv.org, 2024-09</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3109529359?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>777,781,25734,36993,44571</link.rule.ids></links><search><creatorcontrib>Mao Ye</creatorcontrib><creatorcontrib>Meyer, Gregory P</creatorcontrib><creatorcontrib>Zhang, Zaiwei</creatorcontrib><creatorcontrib>Park, Dennis</creatorcontrib><creatorcontrib>Siva, Karthik Mustikovela</creatorcontrib><creatorcontrib>Chai, Yuning</creatorcontrib><creatorcontrib>Wolff, Eric M</creatorcontrib><title>VLMine: Long-Tail Data Mining with Vision Language Models</title><title>arXiv.org</title><description>Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM). Our approach utilizes a VLM to summarize the content of an image into a set of keywords, and we identify rare examples based on keyword frequency. We find that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty. Therefore, we propose a simple and general approach for integrating signals from multiple mining algorithms. We evaluate the proposed method on two diverse tasks: 2D image classification, in which inter-class variation is the primary source of data diversity, and on 3D object detection, where intra-class variation is the main concern. Furthermore, through the detection task, we demonstrate that the knowledge extracted from 2D images is transferable to the 3D domain. Our experiments consistently show large improvements (between 10\% and 50\%) over the baseline techniques on several representative benchmarks: ImageNet-LT, Places-LT, and the Waymo Open Dataset.</description><subject>Algorithms</subject><subject>Data mining</subject><subject>Image classification</subject><subject>Machine learning</subject><subject>Object recognition</subject><subject>Signal classification</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNir0KwjAYAIMgWLTv8IFzIE0aNa7-4NBupWv5wBhTQqJNgq9vBx_A6eDuFqTgQlT0UHO-ImWMI2OM7_ZcSlEQ1Tet9foITfCGdmgdnDEhzNJ6Ax-bntDbaIOHBr3JaDS04a5d3JDlA13U5Y9rsr1eutONvqbwzjqmYQx58nMaRMWU5EpIJf67vt4JNT4</recordid><startdate>20240923</startdate><enddate>20240923</enddate><creator>Mao Ye</creator><creator>Meyer, Gregory P</creator><creator>Zhang, Zaiwei</creator><creator>Park, Dennis</creator><creator>Siva, Karthik Mustikovela</creator><creator>Chai, Yuning</creator><creator>Wolff, Eric M</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240923</creationdate><title>VLMine: Long-Tail Data Mining with Vision Language Models</title><author>Mao Ye ; Meyer, Gregory P ; Zhang, Zaiwei ; Park, Dennis ; Siva, Karthik Mustikovela ; Chai, Yuning ; Wolff, Eric M</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31095293593</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Algorithms</topic><topic>Data mining</topic><topic>Image classification</topic><topic>Machine learning</topic><topic>Object recognition</topic><topic>Signal classification</topic><toplevel>online_resources</toplevel><creatorcontrib>Mao Ye</creatorcontrib><creatorcontrib>Meyer, Gregory P</creatorcontrib><creatorcontrib>Zhang, Zaiwei</creatorcontrib><creatorcontrib>Park, Dennis</creatorcontrib><creatorcontrib>Siva, Karthik Mustikovela</creatorcontrib><creatorcontrib>Chai, Yuning</creatorcontrib><creatorcontrib>Wolff, Eric M</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mao Ye</au><au>Meyer, Gregory P</au><au>Zhang, Zaiwei</au><au>Park, Dennis</au><au>Siva, Karthik Mustikovela</au><au>Chai, Yuning</au><au>Wolff, Eric M</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>VLMine: Long-Tail Data Mining with Vision Language Models</atitle><jtitle>arXiv.org</jtitle><date>2024-09-23</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM). Our approach utilizes a VLM to summarize the content of an image into a set of keywords, and we identify rare examples based on keyword frequency. We find that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty. Therefore, we propose a simple and general approach for integrating signals from multiple mining algorithms. We evaluate the proposed method on two diverse tasks: 2D image classification, in which inter-class variation is the primary source of data diversity, and on 3D object detection, where intra-class variation is the main concern. Furthermore, through the detection task, we demonstrate that the knowledge extracted from 2D images is transferable to the 3D domain. Our experiments consistently show large improvements (between 10\% and 50\%) over the baseline techniques on several representative benchmarks: ImageNet-LT, Places-LT, and the Waymo Open Dataset.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-09 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_3109529359 |
source | Publicly Available Content Database |
subjects | Algorithms Data mining Image classification Machine learning Object recognition Signal classification |
title | VLMine: Long-Tail Data Mining with Vision Language Models |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T07%3A21%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=VLMine:%20Long-Tail%20Data%20Mining%20with%20Vision%20Language%20Models&rft.jtitle=arXiv.org&rft.au=Mao%20Ye&rft.date=2024-09-23&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3109529359%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31095293593%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3109529359&rft_id=info:pmid/&rfr_iscdi=true |