AI for GUI Agents

This list features awesome models, datasets and research papers around building AI based GUI agents.

Models

CogAgent

CogAgent is an open-source visual language model improved based on CogVLM. CogAgent-18B has 11 billion visual parameters and 7 billion language parameters, supporting image understanding at a resolution of 1120*1120. On top of the capabilities of CogVLM, it further possesses GUI image Agent capabilities.

https://github.com/THUDM/CogVLM

Fuyu-8B

Fuyu-8B is a multi-modal text and image transformer trained by Adept AI. Architecturally Fuyu is a vanilla decoder-only transformer supporting arbitrary image resolutions.

https://www.adept.ai/blog/fuyu-8b

ScreenAgent

CogAgent fine-tuned on the new ScreenAgent dataset.

https://github.com/niuzaisheng/ScreenAgent

Datasets

Rico: A Mobile App Dataset for Building Data-Driven Design Applications

http://www.interactionmining.org/rico.html

UI understanding datasets for UIBert

https://github.com/google-research-datasets/uibert

Two datasets that are extended from the public Rico dataset, which contains 72k mobile app UI data. They add two different types of annotations to these UIs.

In AppSim, each datapoint contains two UIs of similar category and the annotation of two semantically similar UI elements on them, such as a “Menu” buttons that appear on two UIs.
In RefExp, each datapoint contains a UI and a referring expression of a UI element on it, such as “Red button on the top”.

META-GUI

META-GUI is a dataset for training a Multi-modal convErsaTional Agent on mobile GUI. It consists of 1125 dialogues, 4684 dialogue turns and 18337 data points in total. Each data point contains screenshot history, action history, dialogue history, items appeared on the current screen and actions to be performed.

https://x-lance.github.io/META-GUI-Leaderboard/

https://github.com/X-LANCE/META-GUI-baseline

Android in the Wild (AitW)

Android in the Wild (AitW) is a large-scale dataset for mobile device control that contains human-collected demonstrations of natural language instructions, user interface (UI) screens, and actions for a variety of human tasks.

https://github.com/google-research/google-research/tree/master/android_in_the_wild

RICO Semantics

The RICO Semantics dataset consists of around 500k human annotations on the RICO dataset identifying various icons based on their shapes and semantics, and associations between selected general UI elements (like icons, form fields, radio buttons, text inputs) and their text labels. The annotations also include human annotated bounding boxes which are more accurate and have a greater coverage of UI elements than using bounding boxes from the view hierarchy.

https://github.com/google-research-datasets/rico_semantics

ScreenQA datasets

The dataset contains ~86K questions and answers for ~35K screenshots from the public Rico dataset.

https://github.com/google-research-datasets/screen_qa

UISketch dataset

A dataset of 19,000 hand-drawn sketches of 21 UI element categories.

https://www.kaggle.com/datasets/vinothpandian/uisketch/data

LightShot dataset

13,000 Screen Caps Scraped from prnt.sc.

https://www.kaggle.com/datasets/datasnaek/lightshot

WaveUI-25k

This dataset contains 25k examples of labeled UI elements. It is a subset of a collection of ~80k preprocessed examples assembled from the following sources: WebUI, RoboFlow, GroundUI-18k.

Dataset Viewer

GroundUI-1k & GroundUI-18k

ScreenSpot

ScreenSpot is an evaluation benchmark for GUI grounding, comprising over 1200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (Text or Icon/Widget).

https://github.com/njucckevin/SeeClick

ScreenAgent

Collected 39 sub-task categories across 6 themes from Linux and Windows operating systems. It includes 273 complete task sessions, with 203 sessions (3005 screenshots) for training and 70 sessions (898 screenshots) for testing.

https://github.com/niuzaisheng/ScreenAgent

https://arxiv.org/abs/2402.07945

GUI-World

A comprehensive GUI dataset comprising over 12,000 videos specifically designed to assess and improve the GUI understanding capabilities of MLLMs, spanning a range of categories and scenarios, including desktop, mobile, and extended reality (XR), and representing the first GUI-oriented instruction-tuning dataset in the video domain.

https://gui-world.github.io

WebSRC

WebSRC consists of 400K question-answer pairs, which are collected from 6.4K web pages. Along with the QA pairs, corresponding HTML source code, screenshots, and metadata are also provided in our dataset. Each question in WebSRC requires a certain structural understanding of a web page to answer, and the answer is either a text span on the web page or yes/no.

https://arxiv.org/abs/2101.09465

https://github.com/X-LANCE/WebSRC-Baseline