Loading…
ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents
With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. While multimodal large language models (MLLMs) like GPT-4V excel in many areas, they struggle wit...
Saved in:
Published in: | arXiv.org 2024-10 |
---|---|
Main Authors: | , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | |
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Hoscilowicz, Jakub Maj, Bartosz Kozakiewicz, Bartosz Tymoshchuk, Oleksii Janicki, Artur |
description | With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. While multimodal large language models (MLLMs) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this paper, we introduce ClickAgent, a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model (e.g., SeeClick) identifies the relevant UI elements on the screen. This approach addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements. ClickAgent outperforms other prompt-based autonomous agents (CogAgent, AppAgent) on the AITW benchmark. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance. |
format | article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3118117325</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3118117325</sourcerecordid><originalsourceid>FETCH-proquest_journals_31181173253</originalsourceid><addsrcrecordid>eNqNi9EKgjAUQEcQJOU_XOhZcFum9CbDKOixnmXKtJntmnf7_yL6gJ7OwzlnwSIhJU-KnRArFhMNaZqKfS6yTEZMqdG2j7I3zh-gcnftWut6uJ3hgq32Fh0oPenGjtZbQ4AdlMGjwycGgu9HG7bs9Egm_nHNtsfqqk7JNOMrGPL1gGF2H1VLzgvOcyky-V_1BqgCOZQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3118117325</pqid></control><display><type>article</type><title>ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents</title><source>Publicly Available Content Database</source><creator>Hoscilowicz, Jakub ; Maj, Bartosz ; Kozakiewicz, Bartosz ; Tymoshchuk, Oleksii ; Janicki, Artur</creator><creatorcontrib>Hoscilowicz, Jakub ; Maj, Bartosz ; Kozakiewicz, Bartosz ; Tymoshchuk, Oleksii ; Janicki, Artur</creatorcontrib><description>With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. While multimodal large language models (MLLMs) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this paper, we introduce ClickAgent, a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model (e.g., SeeClick) identifies the relevant UI elements on the screen. This approach addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements. ClickAgent outperforms other prompt-based autonomous agents (CogAgent, AppAgent) on the AITW benchmark. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Digital computers ; Effectiveness ; Graphical user interface ; Large language models ; Smartphones</subject><ispartof>arXiv.org, 2024-10</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3118117325?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Hoscilowicz, Jakub</creatorcontrib><creatorcontrib>Maj, Bartosz</creatorcontrib><creatorcontrib>Kozakiewicz, Bartosz</creatorcontrib><creatorcontrib>Tymoshchuk, Oleksii</creatorcontrib><creatorcontrib>Janicki, Artur</creatorcontrib><title>ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents</title><title>arXiv.org</title><description>With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. While multimodal large language models (MLLMs) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this paper, we introduce ClickAgent, a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model (e.g., SeeClick) identifies the relevant UI elements on the screen. This approach addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements. ClickAgent outperforms other prompt-based autonomous agents (CogAgent, AppAgent) on the AITW benchmark. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance.</description><subject>Digital computers</subject><subject>Effectiveness</subject><subject>Graphical user interface</subject><subject>Large language models</subject><subject>Smartphones</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNi9EKgjAUQEcQJOU_XOhZcFum9CbDKOixnmXKtJntmnf7_yL6gJ7OwzlnwSIhJU-KnRArFhMNaZqKfS6yTEZMqdG2j7I3zh-gcnftWut6uJ3hgq32Fh0oPenGjtZbQ4AdlMGjwycGgu9HG7bs9Egm_nHNtsfqqk7JNOMrGPL1gGF2H1VLzgvOcyky-V_1BqgCOZQ</recordid><startdate>20241017</startdate><enddate>20241017</enddate><creator>Hoscilowicz, Jakub</creator><creator>Maj, Bartosz</creator><creator>Kozakiewicz, Bartosz</creator><creator>Tymoshchuk, Oleksii</creator><creator>Janicki, Artur</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241017</creationdate><title>ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents</title><author>Hoscilowicz, Jakub ; Maj, Bartosz ; Kozakiewicz, Bartosz ; Tymoshchuk, Oleksii ; Janicki, Artur</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31181173253</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Digital computers</topic><topic>Effectiveness</topic><topic>Graphical user interface</topic><topic>Large language models</topic><topic>Smartphones</topic><toplevel>online_resources</toplevel><creatorcontrib>Hoscilowicz, Jakub</creatorcontrib><creatorcontrib>Maj, Bartosz</creatorcontrib><creatorcontrib>Kozakiewicz, Bartosz</creatorcontrib><creatorcontrib>Tymoshchuk, Oleksii</creatorcontrib><creatorcontrib>Janicki, Artur</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hoscilowicz, Jakub</au><au>Maj, Bartosz</au><au>Kozakiewicz, Bartosz</au><au>Tymoshchuk, Oleksii</au><au>Janicki, Artur</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents</atitle><jtitle>arXiv.org</jtitle><date>2024-10-17</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. While multimodal large language models (MLLMs) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this paper, we introduce ClickAgent, a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model (e.g., SeeClick) identifies the relevant UI elements on the screen. This approach addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements. ClickAgent outperforms other prompt-based autonomous agents (CogAgent, AppAgent) on the AITW benchmark. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-10 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_3118117325 |
source | Publicly Available Content Database |
subjects | Digital computers Effectiveness Graphical user interface Large language models Smartphones |
title | ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T13%3A34%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=ClickAgent:%20Enhancing%20UI%20Location%20Capabilities%20of%20Autonomous%20Agents&rft.jtitle=arXiv.org&rft.au=Hoscilowicz,%20Jakub&rft.date=2024-10-17&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3118117325%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31181173253%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3118117325&rft_id=info:pmid/&rfr_iscdi=true |