Loading…

VLMine: Long-Tail Data Mining with Vision Language Models

Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining ap...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2024-09
Main Authors: Mao Ye, Meyer, Gregory P, Zhang, Zaiwei, Park, Dennis, Siva, Karthik Mustikovela, Chai, Yuning, Wolff, Eric M
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Mao Ye
Meyer, Gregory P
Zhang, Zaiwei
Park, Dennis
Siva, Karthik Mustikovela
Chai, Yuning
Wolff, Eric M
description Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM). Our approach utilizes a VLM to summarize the content of an image into a set of keywords, and we identify rare examples based on keyword frequency. We find that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty. Therefore, we propose a simple and general approach for integrating signals from multiple mining algorithms. We evaluate the proposed method on two diverse tasks: 2D image classification, in which inter-class variation is the primary source of data diversity, and on 3D object detection, where intra-class variation is the main concern. Furthermore, through the detection task, we demonstrate that the knowledge extracted from 2D images is transferable to the 3D domain. Our experiments consistently show large improvements (between 10\% and 50\%) over the baseline techniques on several representative benchmarks: ImageNet-LT, Places-LT, and the Waymo Open Dataset.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3109529359</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3109529359</sourcerecordid><originalsourceid>FETCH-proquest_journals_31095293593</originalsourceid><addsrcrecordid>eNqNir0KwjAYAIMgWLTv8IFzIE0aNa7-4NBupWv5wBhTQqJNgq9vBx_A6eDuFqTgQlT0UHO-ImWMI2OM7_ZcSlEQ1Tet9foITfCGdmgdnDEhzNJ6Ax-bntDbaIOHBr3JaDS04a5d3JDlA13U5Y9rsr1eutONvqbwzjqmYQx58nMaRMWU5EpIJf67vt4JNT4</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3109529359</pqid></control><display><type>article</type><title>VLMine: Long-Tail Data Mining with Vision Language Models</title><source>Publicly Available Content Database</source><creator>Mao Ye ; Meyer, Gregory P ; Zhang, Zaiwei ; Park, Dennis ; Siva, Karthik Mustikovela ; Chai, Yuning ; Wolff, Eric M</creator><creatorcontrib>Mao Ye ; Meyer, Gregory P ; Zhang, Zaiwei ; Park, Dennis ; Siva, Karthik Mustikovela ; Chai, Yuning ; Wolff, Eric M</creatorcontrib><description>Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM). Our approach utilizes a VLM to summarize the content of an image into a set of keywords, and we identify rare examples based on keyword frequency. We find that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty. Therefore, we propose a simple and general approach for integrating signals from multiple mining algorithms. We evaluate the proposed method on two diverse tasks: 2D image classification, in which inter-class variation is the primary source of data diversity, and on 3D object detection, where intra-class variation is the main concern. Furthermore, through the detection task, we demonstrate that the knowledge extracted from 2D images is transferable to the 3D domain. Our experiments consistently show large improvements (between 10\% and 50\%) over the baseline techniques on several representative benchmarks: ImageNet-LT, Places-LT, and the Waymo Open Dataset.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Data mining ; Image classification ; Machine learning ; Object recognition ; Signal classification</subject><ispartof>arXiv.org, 2024-09</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3109529359?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>777,781,25734,36993,44571</link.rule.ids></links><search><creatorcontrib>Mao Ye</creatorcontrib><creatorcontrib>Meyer, Gregory P</creatorcontrib><creatorcontrib>Zhang, Zaiwei</creatorcontrib><creatorcontrib>Park, Dennis</creatorcontrib><creatorcontrib>Siva, Karthik Mustikovela</creatorcontrib><creatorcontrib>Chai, Yuning</creatorcontrib><creatorcontrib>Wolff, Eric M</creatorcontrib><title>VLMine: Long-Tail Data Mining with Vision Language Models</title><title>arXiv.org</title><description>Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM). Our approach utilizes a VLM to summarize the content of an image into a set of keywords, and we identify rare examples based on keyword frequency. We find that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty. Therefore, we propose a simple and general approach for integrating signals from multiple mining algorithms. We evaluate the proposed method on two diverse tasks: 2D image classification, in which inter-class variation is the primary source of data diversity, and on 3D object detection, where intra-class variation is the main concern. Furthermore, through the detection task, we demonstrate that the knowledge extracted from 2D images is transferable to the 3D domain. Our experiments consistently show large improvements (between 10\% and 50\%) over the baseline techniques on several representative benchmarks: ImageNet-LT, Places-LT, and the Waymo Open Dataset.</description><subject>Algorithms</subject><subject>Data mining</subject><subject>Image classification</subject><subject>Machine learning</subject><subject>Object recognition</subject><subject>Signal classification</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNir0KwjAYAIMgWLTv8IFzIE0aNa7-4NBupWv5wBhTQqJNgq9vBx_A6eDuFqTgQlT0UHO-ImWMI2OM7_ZcSlEQ1Tet9foITfCGdmgdnDEhzNJ6Ax-bntDbaIOHBr3JaDS04a5d3JDlA13U5Y9rsr1eutONvqbwzjqmYQx58nMaRMWU5EpIJf67vt4JNT4</recordid><startdate>20240923</startdate><enddate>20240923</enddate><creator>Mao Ye</creator><creator>Meyer, Gregory P</creator><creator>Zhang, Zaiwei</creator><creator>Park, Dennis</creator><creator>Siva, Karthik Mustikovela</creator><creator>Chai, Yuning</creator><creator>Wolff, Eric M</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240923</creationdate><title>VLMine: Long-Tail Data Mining with Vision Language Models</title><author>Mao Ye ; Meyer, Gregory P ; Zhang, Zaiwei ; Park, Dennis ; Siva, Karthik Mustikovela ; Chai, Yuning ; Wolff, Eric M</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31095293593</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Algorithms</topic><topic>Data mining</topic><topic>Image classification</topic><topic>Machine learning</topic><topic>Object recognition</topic><topic>Signal classification</topic><toplevel>online_resources</toplevel><creatorcontrib>Mao Ye</creatorcontrib><creatorcontrib>Meyer, Gregory P</creatorcontrib><creatorcontrib>Zhang, Zaiwei</creatorcontrib><creatorcontrib>Park, Dennis</creatorcontrib><creatorcontrib>Siva, Karthik Mustikovela</creatorcontrib><creatorcontrib>Chai, Yuning</creatorcontrib><creatorcontrib>Wolff, Eric M</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mao Ye</au><au>Meyer, Gregory P</au><au>Zhang, Zaiwei</au><au>Park, Dennis</au><au>Siva, Karthik Mustikovela</au><au>Chai, Yuning</au><au>Wolff, Eric M</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>VLMine: Long-Tail Data Mining with Vision Language Models</atitle><jtitle>arXiv.org</jtitle><date>2024-09-23</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM). Our approach utilizes a VLM to summarize the content of an image into a set of keywords, and we identify rare examples based on keyword frequency. We find that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty. Therefore, we propose a simple and general approach for integrating signals from multiple mining algorithms. We evaluate the proposed method on two diverse tasks: 2D image classification, in which inter-class variation is the primary source of data diversity, and on 3D object detection, where intra-class variation is the main concern. Furthermore, through the detection task, we demonstrate that the knowledge extracted from 2D images is transferable to the 3D domain. Our experiments consistently show large improvements (between 10\% and 50\%) over the baseline techniques on several representative benchmarks: ImageNet-LT, Places-LT, and the Waymo Open Dataset.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-09
issn 2331-8422
language eng
recordid cdi_proquest_journals_3109529359
source Publicly Available Content Database
subjects Algorithms
Data mining
Image classification
Machine learning
Object recognition
Signal classification
title VLMine: Long-Tail Data Mining with Vision Language Models
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T07%3A21%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=VLMine:%20Long-Tail%20Data%20Mining%20with%20Vision%20Language%20Models&rft.jtitle=arXiv.org&rft.au=Mao%20Ye&rft.date=2024-09-23&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3109529359%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31095293593%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3109529359&rft_id=info:pmid/&rfr_iscdi=true