Loading…

ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents

With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. While multimodal large language models (MLLMs) like GPT-4V excel in many areas, they struggle wit...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2024-10
Main Authors: Hoscilowicz, Jakub, Maj, Bartosz, Kozakiewicz, Bartosz, Tymoshchuk, Oleksii, Janicki, Artur
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Hoscilowicz, Jakub
Maj, Bartosz
Kozakiewicz, Bartosz
Tymoshchuk, Oleksii
Janicki, Artur
description With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. While multimodal large language models (MLLMs) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this paper, we introduce ClickAgent, a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model (e.g., SeeClick) identifies the relevant UI elements on the screen. This approach addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements. ClickAgent outperforms other prompt-based autonomous agents (CogAgent, AppAgent) on the AITW benchmark. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3118117325</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3118117325</sourcerecordid><originalsourceid>FETCH-proquest_journals_31181173253</originalsourceid><addsrcrecordid>eNqNi9EKgjAUQEcQJOU_XOhZcFum9CbDKOixnmXKtJntmnf7_yL6gJ7OwzlnwSIhJU-KnRArFhMNaZqKfS6yTEZMqdG2j7I3zh-gcnftWut6uJ3hgq32Fh0oPenGjtZbQ4AdlMGjwycGgu9HG7bs9Egm_nHNtsfqqk7JNOMrGPL1gGF2H1VLzgvOcyky-V_1BqgCOZQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3118117325</pqid></control><display><type>article</type><title>ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents</title><source>Publicly Available Content Database</source><creator>Hoscilowicz, Jakub ; Maj, Bartosz ; Kozakiewicz, Bartosz ; Tymoshchuk, Oleksii ; Janicki, Artur</creator><creatorcontrib>Hoscilowicz, Jakub ; Maj, Bartosz ; Kozakiewicz, Bartosz ; Tymoshchuk, Oleksii ; Janicki, Artur</creatorcontrib><description>With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. While multimodal large language models (MLLMs) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this paper, we introduce ClickAgent, a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model (e.g., SeeClick) identifies the relevant UI elements on the screen. This approach addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements. ClickAgent outperforms other prompt-based autonomous agents (CogAgent, AppAgent) on the AITW benchmark. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Digital computers ; Effectiveness ; Graphical user interface ; Large language models ; Smartphones</subject><ispartof>arXiv.org, 2024-10</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3118117325?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Hoscilowicz, Jakub</creatorcontrib><creatorcontrib>Maj, Bartosz</creatorcontrib><creatorcontrib>Kozakiewicz, Bartosz</creatorcontrib><creatorcontrib>Tymoshchuk, Oleksii</creatorcontrib><creatorcontrib>Janicki, Artur</creatorcontrib><title>ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents</title><title>arXiv.org</title><description>With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. While multimodal large language models (MLLMs) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this paper, we introduce ClickAgent, a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model (e.g., SeeClick) identifies the relevant UI elements on the screen. This approach addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements. ClickAgent outperforms other prompt-based autonomous agents (CogAgent, AppAgent) on the AITW benchmark. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance.</description><subject>Digital computers</subject><subject>Effectiveness</subject><subject>Graphical user interface</subject><subject>Large language models</subject><subject>Smartphones</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNi9EKgjAUQEcQJOU_XOhZcFum9CbDKOixnmXKtJntmnf7_yL6gJ7OwzlnwSIhJU-KnRArFhMNaZqKfS6yTEZMqdG2j7I3zh-gcnftWut6uJ3hgq32Fh0oPenGjtZbQ4AdlMGjwycGgu9HG7bs9Egm_nHNtsfqqk7JNOMrGPL1gGF2H1VLzgvOcyky-V_1BqgCOZQ</recordid><startdate>20241017</startdate><enddate>20241017</enddate><creator>Hoscilowicz, Jakub</creator><creator>Maj, Bartosz</creator><creator>Kozakiewicz, Bartosz</creator><creator>Tymoshchuk, Oleksii</creator><creator>Janicki, Artur</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241017</creationdate><title>ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents</title><author>Hoscilowicz, Jakub ; Maj, Bartosz ; Kozakiewicz, Bartosz ; Tymoshchuk, Oleksii ; Janicki, Artur</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31181173253</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Digital computers</topic><topic>Effectiveness</topic><topic>Graphical user interface</topic><topic>Large language models</topic><topic>Smartphones</topic><toplevel>online_resources</toplevel><creatorcontrib>Hoscilowicz, Jakub</creatorcontrib><creatorcontrib>Maj, Bartosz</creatorcontrib><creatorcontrib>Kozakiewicz, Bartosz</creatorcontrib><creatorcontrib>Tymoshchuk, Oleksii</creatorcontrib><creatorcontrib>Janicki, Artur</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hoscilowicz, Jakub</au><au>Maj, Bartosz</au><au>Kozakiewicz, Bartosz</au><au>Tymoshchuk, Oleksii</au><au>Janicki, Artur</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents</atitle><jtitle>arXiv.org</jtitle><date>2024-10-17</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. While multimodal large language models (MLLMs) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this paper, we introduce ClickAgent, a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model (e.g., SeeClick) identifies the relevant UI elements on the screen. This approach addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements. ClickAgent outperforms other prompt-based autonomous agents (CogAgent, AppAgent) on the AITW benchmark. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-10
issn 2331-8422
language eng
recordid cdi_proquest_journals_3118117325
source Publicly Available Content Database
subjects Digital computers
Effectiveness
Graphical user interface
Large language models
Smartphones
title ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T13%3A34%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=ClickAgent:%20Enhancing%20UI%20Location%20Capabilities%20of%20Autonomous%20Agents&rft.jtitle=arXiv.org&rft.au=Hoscilowicz,%20Jakub&rft.date=2024-10-17&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3118117325%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31181173253%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3118117325&rft_id=info:pmid/&rfr_iscdi=true