Code for Paper: Harnessing Webpage Uis For Text Rich Visual Understanding

About MultiUI

MultitUI is a dataset of 7.3 million samples spanning various UI types and tasks, structured using enhanced accessibility trees and task taxonomies.

Repository Structure

This repository is divided into two parts:

Train: contains training code for LLaVA-OneVision, the base model we used.
Evaluation: contains evaluation code on all benchmarks we tested in the paper.

Dataset Download

MultiUI: Download our 7.3 million sample training dataset from huggingface.

Models Checkpoint

Model Name	LLM	Vision Tower	Checkpoint
UIX-Qwen2	Qwen2-7B-Instruct	siglip-so400m-patch14-384	neulab/UIX-Qwen2
UIX-Qwen2-M2W	Qwen2-7B-Instruct	siglip-so400m-patch14-384	neulab/UIX-Qwen2-Mind2Web

Run Evaluation

VisualWebBench

To evaluate VisualWebBench related tasks:

cd eval/VisualWebBench
bash run.sh

lmms-eval-MultiUI

We evaluate on GUI understanding&grounding benchmarks (WebSRC, ScreenQA-short, WidgetCap, ScreenSpot, RefExp), OCR/Doc/Chart-related QA (DocVQA, ChartQA, TextVQA, InfoVQA, VisualMRC, OCRBench), and general grounding benchmark (RefCOCO+) with the lmms-eval framework.

To evaluate these datasets:

cd eval/lmms-eval-MultiUI

model=MODEL_NAME
model_type=MODEL_TYPE
python3 -m accelerate.commands.launch \
         --num_processes=8 \
         -m lmms_eval \
         --model $model_type \
         --model_args pretrained=$model,conv_template=qwen_2 \
         --tasks ${task} \
         --batch_size 1 \
         --log_samples \
         --log_samples_suffix ${task} \
         --output_path eval_logs

Mind2Web Evaluation

Download our processed Mind2Web evaluation dataset from huggingface and place it under eval/Mind2Web-SeeAct/src/offline_experiments/screenshot_generation/data

Run inference

cd eval/Mind2Web-SeeAct/src/offline_experiments/

python eval_m2w.py \
--model_name MODEL_NAME \
--model_path MODEL_PATH \
--task_types test_{task/website/domain}

Calculate metrics

python ./action_generation/metric.py

Dataset Disclaimer

The MultiUI dataset is released for open-source use by the research and developer community. The data is largely sourced from publicly available web content or generated by large language models (LLMs). We constructed this dataset using links from Hugging Face’s FineWeb dataset, which is based on a Common Crawl dump, representing publicly accessible data from the web.

This dataset is mostly intended for research purposes, it may contain material that could have inaccuracies, biases, or other unintended issues. We do not intentionally include any copyrighted material, and any resemblance to such content is unintentional.

If you have any concerns regarding specific data or believe that any content should be removed, please contact us, and we will review the request and take appropriate action.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Code for Paper: Harnessing Webpage Uis For Text Rich Visual Understanding

About MultiUI

Repository Structure

Dataset Download

Models Checkpoint

Run Evaluation

VisualWebBench

lmms-eval-MultiUI

Mind2Web Evaluation

Dataset Disclaimer

Files

README.md

Latest commit

History

README.md

File metadata and controls

Code for Paper: Harnessing Webpage Uis For Text Rich Visual Understanding

About MultiUI

Repository Structure

Dataset Download

Models Checkpoint

Run Evaluation

VisualWebBench

lmms-eval-MultiUI

Mind2Web Evaluation

Dataset Disclaimer