This project uses PDM to manage Python dependencies. Find installation instructions here: https://pdm-project.org/en/latest/.
Clone the project repository and install dependencies with:
git clone https://github.com/maxi-w/gemini-computer-use.git
cd gemini-computer-use
pdm install
Set an environment variable with your GOOGLE_API_KEY to use Gemini:
export GOOGLE_API_KEY=YOUR_API_KEY
Run your computer agent with a goal:
pdm run start "search for cat images with google"
- Simple implementation of screenshot understanding and computer tool use.
- Improve structure of actions e.g. with JSON mode.
- Checkout tool use as in Gemini SDK Docs.
- Improve prompt to prevent some unwanted behaviours.
- Explore different grounding info formats (2d box vs point, order of coordinates, scaling).
- Make the agent decide when it's done with the task.
- Explore Multimodal Live API for screen input Docs
Note: Feel free to open an issue to discuss improvements.