Loading…

Tools Identification By On-Board Adaptation of Vision-and-Language Models

A robotic workshop assistant has been a long-standing grand challenge for robotics, speech, computer vision, and artificial intelligence (AI) research. We revisit the goal of visual identification of tools from human queries in the current era of Large Vision-and-Language models (like GPT-4). We fin...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hu, Jun, Miller, Phil, Lomnitz, Michael, Farkya, Saurabh, Yilmaz, Emre, Raghavan, Aswin, Zhang, David, Piacentino, Michael
Format:	Conference Proceeding
Language:	English
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	23801
container_issue	21
container_start_page	23799
container_title
container_volume	38
creator	Hu, Jun Miller, Phil Lomnitz, Michael Farkya, Saurabh Yilmaz, Emre Raghavan, Aswin Zhang, David Piacentino, Michael
description	A robotic workshop assistant has been a long-standing grand challenge for robotics, speech, computer vision, and artificial intelligence (AI) research. We revisit the goal of visual identification of tools from human queries in the current era of Large Vision-and-Language models (like GPT-4). We find that current off-the-shelf models (that are trained on internet images) are unable to overcome the domain shift and unable to identify small, obscure tools in cluttered environments. Furthermore, these models are unable to match tools to their intended purpose or affordances. We present a novel system for online domain adaptation that can be run directly on a small on-board processor. The system uses Hyperdimensional Computing (HD), a fast and efficient neuromorphic method. We adapted CLIP to work with explicit ("I need the hammer") and implicit purpose-driven queries ("Drive these nails"), and even with depth images as input. This demo allows the user to try out various real tools and interact via free-form audio.
doi_str_mv	10.1609/aaai.v38i21.30569
format	conference_proceeding
fullrecord	<record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_1609_aaai_v38i21_30569</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1609_aaai_v38i21_30569</sourcerecordid><originalsourceid>FETCH-LOGICAL-c197t-e29b2e4c2dc259b3cd4612337bc65ea355b75f623a0705072bfd696055ce893d3</originalsourceid><addsrcrecordid>eNotkMtOwzAURC0EElXpB7DLDzjYvrEdL9uKR6Sgbgrb6MZ2KqMQV3FA6t_Tks5mjmYxi0PII2c5V8w8IWLIf6EMgufApDI3ZCFAFxQKVd6emUtDJRhzT1YpfbFzCsM51wtS7WPsU1Y5P0yhCxanEIdsc8p2A91EHF22dnic5jl22WdIZ6I4OFrjcPjBg8_eo_N9eiB3HfbJr669JB8vz_vtG613r9V2XVPLjZ6oF6YVvrDCWSFNC9YVigsA3VolPYKUrZadEoBMM8m0aDunjGJSWl8acLAkfP61Y0xp9F1zHMM3jqeGs-aio7noaGYdzb8O-AMTS1Ox</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Tools Identification By On-Board Adaptation of Vision-and-Language Models</title><source>Freely Accessible Journals</source><creator>Hu, Jun ; Miller, Phil ; Lomnitz, Michael ; Farkya, Saurabh ; Yilmaz, Emre ; Raghavan, Aswin ; Zhang, David ; Piacentino, Michael</creator><creatorcontrib>Hu, Jun ; Miller, Phil ; Lomnitz, Michael ; Farkya, Saurabh ; Yilmaz, Emre ; Raghavan, Aswin ; Zhang, David ; Piacentino, Michael</creatorcontrib><description>A robotic workshop assistant has been a long-standing grand challenge for robotics, speech, computer vision, and artificial intelligence (AI) research. We revisit the goal of visual identification of tools from human queries in the current era of Large Vision-and-Language models (like GPT-4). We find that current off-the-shelf models (that are trained on internet images) are unable to overcome the domain shift and unable to identify small, obscure tools in cluttered environments. Furthermore, these models are unable to match tools to their intended purpose or affordances. We present a novel system for online domain adaptation that can be run directly on a small on-board processor. The system uses Hyperdimensional Computing (HD), a fast and efficient neuromorphic method. We adapted CLIP to work with explicit ("I need the hammer") and implicit purpose-driven queries ("Drive these nails"), and even with depth images as input. This demo allows the user to try out various real tools and interact via free-form audio.</description><identifier>ISSN: 2159-5399</identifier><identifier>EISSN: 2374-3468</identifier><identifier>DOI: 10.1609/aaai.v38i21.30569</identifier><language>eng</language><ispartof>Proceedings of the ... AAAI Conference on Artificial Intelligence, 2024, Vol.38 (21), p.23799-23801</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,777,781,27905,27906</link.rule.ids></links><search><creatorcontrib>Hu, Jun</creatorcontrib><creatorcontrib>Miller, Phil</creatorcontrib><creatorcontrib>Lomnitz, Michael</creatorcontrib><creatorcontrib>Farkya, Saurabh</creatorcontrib><creatorcontrib>Yilmaz, Emre</creatorcontrib><creatorcontrib>Raghavan, Aswin</creatorcontrib><creatorcontrib>Zhang, David</creatorcontrib><creatorcontrib>Piacentino, Michael</creatorcontrib><title>Tools Identification By On-Board Adaptation of Vision-and-Language Models</title><title>Proceedings of the ... AAAI Conference on Artificial Intelligence</title><description>A robotic workshop assistant has been a long-standing grand challenge for robotics, speech, computer vision, and artificial intelligence (AI) research. We revisit the goal of visual identification of tools from human queries in the current era of Large Vision-and-Language models (like GPT-4). We find that current off-the-shelf models (that are trained on internet images) are unable to overcome the domain shift and unable to identify small, obscure tools in cluttered environments. Furthermore, these models are unable to match tools to their intended purpose or affordances. We present a novel system for online domain adaptation that can be run directly on a small on-board processor. The system uses Hyperdimensional Computing (HD), a fast and efficient neuromorphic method. We adapted CLIP to work with explicit ("I need the hammer") and implicit purpose-driven queries ("Drive these nails"), and even with depth images as input. This demo allows the user to try out various real tools and interact via free-form audio.</description><issn>2159-5399</issn><issn>2374-3468</issn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><recordid>eNotkMtOwzAURC0EElXpB7DLDzjYvrEdL9uKR6Sgbgrb6MZ2KqMQV3FA6t_Tks5mjmYxi0PII2c5V8w8IWLIf6EMgufApDI3ZCFAFxQKVd6emUtDJRhzT1YpfbFzCsM51wtS7WPsU1Y5P0yhCxanEIdsc8p2A91EHF22dnic5jl22WdIZ6I4OFrjcPjBg8_eo_N9eiB3HfbJr669JB8vz_vtG613r9V2XVPLjZ6oF6YVvrDCWSFNC9YVigsA3VolPYKUrZadEoBMM8m0aDunjGJSWl8acLAkfP61Y0xp9F1zHMM3jqeGs-aio7noaGYdzb8O-AMTS1Ox</recordid><startdate>20240325</startdate><enddate>20240325</enddate><creator>Hu, Jun</creator><creator>Miller, Phil</creator><creator>Lomnitz, Michael</creator><creator>Farkya, Saurabh</creator><creator>Yilmaz, Emre</creator><creator>Raghavan, Aswin</creator><creator>Zhang, David</creator><creator>Piacentino, Michael</creator><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20240325</creationdate><title>Tools Identification By On-Board Adaptation of Vision-and-Language Models</title><author>Hu, Jun ; Miller, Phil ; Lomnitz, Michael ; Farkya, Saurabh ; Yilmaz, Emre ; Raghavan, Aswin ; Zhang, David ; Piacentino, Michael</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c197t-e29b2e4c2dc259b3cd4612337bc65ea355b75f623a0705072bfd696055ce893d3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><toplevel>online_resources</toplevel><creatorcontrib>Hu, Jun</creatorcontrib><creatorcontrib>Miller, Phil</creatorcontrib><creatorcontrib>Lomnitz, Michael</creatorcontrib><creatorcontrib>Farkya, Saurabh</creatorcontrib><creatorcontrib>Yilmaz, Emre</creatorcontrib><creatorcontrib>Raghavan, Aswin</creatorcontrib><creatorcontrib>Zhang, David</creatorcontrib><creatorcontrib>Piacentino, Michael</creatorcontrib><collection>CrossRef</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hu, Jun</au><au>Miller, Phil</au><au>Lomnitz, Michael</au><au>Farkya, Saurabh</au><au>Yilmaz, Emre</au><au>Raghavan, Aswin</au><au>Zhang, David</au><au>Piacentino, Michael</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Tools Identification By On-Board Adaptation of Vision-and-Language Models</atitle><btitle>Proceedings of the ... AAAI Conference on Artificial Intelligence</btitle><date>2024-03-25</date><risdate>2024</risdate><volume>38</volume><issue>21</issue><spage>23799</spage><epage>23801</epage><pages>23799-23801</pages><issn>2159-5399</issn><eissn>2374-3468</eissn><abstract>A robotic workshop assistant has been a long-standing grand challenge for robotics, speech, computer vision, and artificial intelligence (AI) research. We revisit the goal of visual identification of tools from human queries in the current era of Large Vision-and-Language models (like GPT-4). We find that current off-the-shelf models (that are trained on internet images) are unable to overcome the domain shift and unable to identify small, obscure tools in cluttered environments. Furthermore, these models are unable to match tools to their intended purpose or affordances. We present a novel system for online domain adaptation that can be run directly on a small on-board processor. The system uses Hyperdimensional Computing (HD), a fast and efficient neuromorphic method. We adapted CLIP to work with explicit ("I need the hammer") and implicit purpose-driven queries ("Drive these nails"), and even with depth images as input. This demo allows the user to try out various real tools and interact via free-form audio.</abstract><doi>10.1609/aaai.v38i21.30569</doi><tpages>3</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 2159-5399
ispartof	Proceedings of the ... AAAI Conference on Artificial Intelligence, 2024, Vol.38 (21), p.23799-23801
issn	2159-5399 2374-3468
language	eng
recordid	cdi_crossref_primary_10_1609_aaai_v38i21_30569
source	Freely Accessible Journals
title	Tools Identification By On-Board Adaptation of Vision-and-Language Models
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-17T15%3A07%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Tools%20Identification%20By%20On-Board%20Adaptation%20of%20Vision-and-Language%20Models&rft.btitle=Proceedings%20of%20the%20...%20AAAI%20Conference%20on%20Artificial%20Intelligence&rft.au=Hu,%20Jun&rft.date=2024-03-25&rft.volume=38&rft.issue=21&rft.spage=23799&rft.epage=23801&rft.pages=23799-23801&rft.issn=2159-5399&rft.eissn=2374-3468&rft_id=info:doi/10.1609/aaai.v38i21.30569&rft_dat=%3Ccrossref%3E10_1609_aaai_v38i21_30569%3C/crossref%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c197t-e29b2e4c2dc259b3cd4612337bc65ea355b75f623a0705072bfd696055ce893d3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true