Loading…

ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents

With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. While multimodal large language models (MLLMs) like GPT-4V excel in many areas, they struggle wit...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2024-10
Main Authors:	Hoscilowicz, Jakub, Maj, Bartosz, Kozakiewicz, Bartosz, Tymoshchuk, Oleksii, Janicki, Artur
Format:	Article
Language:	English
Subjects:	Digital computers Effectiveness Graphical user interface Large language models Smartphones
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Hoscilowicz, Jakub Maj, Bartosz Kozakiewicz, Bartosz Tymoshchuk, Oleksii Janicki, Artur
description	With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. While multimodal large language models (MLLMs) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this paper, we introduce ClickAgent, a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model (e.g., SeeClick) identifies the relevant UI elements on the screen. This approach addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements. ClickAgent outperforms other prompt-based autonomous agents (CogAgent, AppAgent) on the AITW benchmark. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance.
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3118117325</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3118117325</sourcerecordid><originalsourceid>FETCH-proquest_journals_31181173253</originalsourceid><addsrcrecordid>eNqNi9EKgjAUQEcQJOU_XOhZcFum9CbDKOixnmXKtJntmnf7_yL6gJ7OwzlnwSIhJU-KnRArFhMNaZqKfS6yTEZMqdG2j7I3zh-gcnftWut6uJ3hgq32Fh0oPenGjtZbQ4AdlMGjwycGgu9HG7bs9Egm_nHNtsfqqk7JNOMrGPL1gGF2H1VLzgvOcyky-V_1BqgCOZQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3118117325</pqid></control><display><type>article</type><title>ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents</title><source>Publicly Available Content Database</source><creator>Hoscilowicz, Jakub ; Maj, Bartosz ; Kozakiewicz, Bartosz ; Tymoshchuk, Oleksii ; Janicki, Artur</creator><creatorcontrib>Hoscilowicz, Jakub ; Maj, Bartosz ; Kozakiewicz, Bartosz ; Tymoshchuk, Oleksii ; Janicki, Artur</creatorcontrib><description>With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. While multimodal large language models (MLLMs) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this paper, we introduce ClickAgent, a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model (e.g., SeeClick) identifies the relevant UI elements on the screen. This approach addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements. ClickAgent outperforms other prompt-based autonomous agents (CogAgent, AppAgent) on the AITW benchmark. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Digital computers ; Effectiveness ; Graphical user interface ; Large language models ; Smartphones</subject><ispartof>arXiv.org, 2024-10</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3118117325?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Hoscilowicz, Jakub</creatorcontrib><creatorcontrib>Maj, Bartosz</creatorcontrib><creatorcontrib>Kozakiewicz, Bartosz</creatorcontrib><creatorcontrib>Tymoshchuk, Oleksii</creatorcontrib><creatorcontrib>Janicki, Artur</creatorcontrib><title>ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents</title><title>arXiv.org</title><description>With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. While multimodal large language models (MLLMs) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this paper, we introduce ClickAgent, a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model (e.g., SeeClick) identifies the relevant UI elements on the screen. This approach addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements. ClickAgent outperforms other prompt-based autonomous agents (CogAgent, AppAgent) on the AITW benchmark. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance.</description><subject>Digital computers</subject><subject>Effectiveness</subject><subject>Graphical user interface</subject><subject>Large language models</subject><subject>Smartphones</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNi9EKgjAUQEcQJOU_XOhZcFum9CbDKOixnmXKtJntmnf7_yL6gJ7OwzlnwSIhJU-KnRArFhMNaZqKfS6yTEZMqdG2j7I3zh-gcnftWut6uJ3hgq32Fh0oPenGjtZbQ4AdlMGjwycGgu9HG7bs9Egm_nHNtsfqqk7JNOMrGPL1gGF2H1VLzgvOcyky-V_1BqgCOZQ</recordid><startdate>20241017</startdate><enddate>20241017</enddate><creator>Hoscilowicz, Jakub</creator><creator>Maj, Bartosz</creator><creator>Kozakiewicz, Bartosz</creator><creator>Tymoshchuk, Oleksii</creator><creator>Janicki, Artur</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241017</creationdate><title>ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents</title><author>Hoscilowicz, Jakub ; Maj, Bartosz ; Kozakiewicz, Bartosz ; Tymoshchuk, Oleksii ; Janicki, Artur</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31181173253</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Digital computers</topic><topic>Effectiveness</topic><topic>Graphical user interface</topic><topic>Large language models</topic><topic>Smartphones</topic><toplevel>online_resources</toplevel><creatorcontrib>Hoscilowicz, Jakub</creatorcontrib><creatorcontrib>Maj, Bartosz</creatorcontrib><creatorcontrib>Kozakiewicz, Bartosz</creatorcontrib><creatorcontrib>Tymoshchuk, Oleksii</creatorcontrib><creatorcontrib>Janicki, Artur</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hoscilowicz, Jakub</au><au>Maj, Bartosz</au><au>Kozakiewicz, Bartosz</au><au>Tymoshchuk, Oleksii</au><au>Janicki, Artur</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents</atitle><jtitle>arXiv.org</jtitle><date>2024-10-17</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. While multimodal large language models (MLLMs) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this paper, we introduce ClickAgent, a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model (e.g., SeeClick) identifies the relevant UI elements on the screen. This approach addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements. ClickAgent outperforms other prompt-based autonomous agents (CogAgent, AppAgent) on the AITW benchmark. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-10
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3118117325
source	Publicly Available Content Database
subjects	Digital computers Effectiveness Graphical user interface Large language models Smartphones
title	ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T13%3A34%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=ClickAgent:%20Enhancing%20UI%20Location%20Capabilities%20of%20Autonomous%20Agents&rft.jtitle=arXiv.org&rft.au=Hoscilowicz,%20Jakub&rft.date=2024-10-17&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3118117325%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31181173253%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3118117325&rft_id=info:pmid/&rfr_iscdi=true