Gemini 2.0 Computer Use

Setup

This project uses PDM to manage Python dependencies. Find installation instructions here: https://pdm-project.org/en/latest/.

Clone the project repository and install dependencies with:

git clone https://github.com/maxi-w/gemini-computer-use.git
cd gemini-computer-use
pdm install

Set an environment variable with your GOOGLE_API_KEY to use Gemini:

export GOOGLE_API_KEY=YOUR_API_KEY

Run your computer agent with a goal:

pdm run start "search for cat images with google"

Simple implementation of screenshot understanding and computer tool use.
Improve structure of actions e.g. with JSON mode.
Checkout tool use as in Gemini SDK Docs.
Improve prompt to prevent some unwanted behaviours.
Explore different grounding info formats (2d box vs point, order of coordinates, scaling).
Make the agent decide when it's done with the task.
Explore Multimodal Live API for screen input Docs

Note: Feel free to open an issue to discuss improvements.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src/gemini_computer_use		src/gemini_computer_use
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml