Skip to content

Commit

Permalink
feat: implement anthropic-style computer tool (#225)
Browse files Browse the repository at this point in the history
* feat: started working on anthropic-style computer tool

* Apply suggestions from code review

Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>

* fix: progress on computer use

* fix: added Dockerfile.server

* fix: fixed vnc in computer use webui

* docs: fixed docs for computer use

* fix: rewrote computer_action function to not be a generator

* docs: fixed server docs for computer use

* docs: refactored computer use warning into seperate file

* fix: optimized Dockerfile.computer for faster rebuilds

* fix: refactor and misc fixes to computer use

* fix: enable select tools in computer use context

* fix: multiple fixes to computer use and web ui

* fix: disable computer tool unless explicitly enabled

* fix: removed deleted file from .dockerignore

* docs: minor fix to computer use docs

---------

Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
  • Loading branch information
ErikBjare and ellipsis-dev[bot] authored Nov 1, 2024
1 parent a6b41aa commit 175167e
Show file tree
Hide file tree
Showing 34 changed files with 965 additions and 47 deletions.
1 change: 1 addition & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ gptme.toml

# Build scripts
scripts
!scripts/computer_home
.github

# Build/test/coverage/docs/prof directories
Expand Down
5 changes: 5 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,12 @@ build:

build-docker:
docker build . -t gptme:latest -f scripts/Dockerfile
docker build . -t gptme-server:latest -f scripts/Dockerfile.server
docker build . -t gptme-eval:latest -f scripts/Dockerfile.eval
# docker build . -t gptme-eval:latest -f scripts/Dockerfile.eval --build-arg RUST=yes --build-arg BROWSER=yes

build-docker-computer:
docker build . -t gptme-computer:latest -f scripts/Dockerfile.computer

build-docker-dev:
docker build . -t gptme-dev:latest -f scripts/Dockerfile.dev
Expand Down
4 changes: 4 additions & 0 deletions docs/computer-use-warning.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.. warning::

The computer use interface is experimental and has serious security implications.
Please use with caution and see Anthropic's documentation on `computer use <https://docs.anthropic.com/en/docs/build-with-claude/computer-use>`_ for additional guidance.
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,7 @@ def setup(app):
("py:class", "flask.app.Flask"),
("py:class", "gptme.tools.python.T"),
("py:class", "threading.Thread"),
("py:class", "gptme.tools.computer.ScalingSource"),
]

# -- Options for HTML output -------------------------------------------------
Expand Down
4 changes: 2 additions & 2 deletions docs/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@ Install
# checkout the code and navigate to the root of the project
git clone https://github.com/ErikBjare/gptme.git
cd gptme
# install poetry (if not installed)
pipx install poetry
# activate the virtualenv
poetry shell
Expand Down
27 changes: 27 additions & 0 deletions docs/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,3 +64,30 @@ Generate docstrings for all functions in a file:
gptme --non-interactive "Patch these files to include concise docstrings for all functions, skip functions that already have docstrings. Include: brief description, parameters." $@
These examples demonstrate how gptme can be used to create simple yet powerful automation tools. Each script can be easily customized and expanded to fit specific project needs.

.. rubric:: Computer Use Examples

Using the computer tool for GUI automation and desktop interaction (requires running the server with computer use support):

.. code-block:: bash
# Start server with computer use support
docker run -p 5000:5000 -p 8080:8080 -p 6080:6080 ghcr.io/erikbjare/gptme:latest-server
# Then in another terminal:
# Open and interact with an application
gptme 'open firefox and navigate to example.com'
# GUI automation with visual feedback
gptme 'create a simple drawing in xpaint'
# Desktop automation with keyboard/mouse
gptme 'open calculator and compute 15 * 23'
The computer use interface at http://localhost:8080 provides a split view with:
- Chat interface on the left
- Desktop view on the right
- Controls for toggling interaction mode

This enables complex GUI automation tasks with visual feedback and confirmation.
47 changes: 42 additions & 5 deletions docs/server.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,49 @@ It can be started by running the following command:
Web UI
------

.. code-block:: bash
The server provides two interfaces:

gptme-server
1. Basic Chat Interface

.. code-block:: bash
gptme-server
Access the basic chat interface at http://localhost:5000

For more usage, see :ref:`the CLI documentation <cli:gptme-server>`.

2. Computer Use Interface

Requires Docker.

.. code-block:: bash
# Clone the repository
git clone https://github.com/ErikBjare/gptme.git
cd gptme
# Build container
make build-docker-computer
# Run container
docker run -v ~/.config/gptme:/home/computeruse/.config/gptme -p 6080:6080 -p 8080:8080 gptme-computer:latest
The computer use interface provides:

- Combined view at http://localhost:8080/computer
- Chat view at http://localhost:8080
- Desktop view at http://localhost:6080/vnc.html

Features:

- Split view with chat on the left, desktop on the right
- Toggle for view-only/interactive desktop mode
- Fullscreen support
- Automatic screen scaling for optimal LLM vision

This should let you view your chats in a web browser and make basic requests.
Requirements:

You can then access the web UI by visiting http://localhost:5000 in your browser.
- Docker for running the server with X11 support
- Browser with WebSocket support for VNC
- Network ports 6080 (VNC) and 8080 (web UI) available

For more usage, see :ref:`the CLI documentation <cli:gptme-server>`.
.. include:: computer-use-warning.rst
36 changes: 36 additions & 0 deletions docs/tools.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ The tools can be grouped into the following categories:

- `Screenshot`_
- `Vision`_
- `Computer`_

- Chat management

Expand Down Expand Up @@ -107,3 +108,38 @@ Chats
.. automodule:: gptme.tools.chats
:members:
:noindex:

Computer
--------

.. automodule:: gptme.tools.computer
:members:
:noindex:

The computer tool provides direct interaction with the desktop environment through X11, allowing for:

- Keyboard input simulation
- Mouse control (movement, clicks, dragging)
- Screen capture with automatic scaling
- Cursor position tracking

To use the computer tool, see the instructions for :doc:`server`.

Example usage::

# Type text
computer(action="type", text="Hello, World!")

# Move mouse and click
computer(action="mouse_move", coordinate=(100, 100))
computer(action="left_click")

# Take screenshot
computer(action="screenshot")

# Send keyboard shortcuts
computer(action="key", text="Control_L+c")

The tool automatically handles screen resolution scaling to ensure optimal performance with LLM vision capabilities.

.. include:: computer-use-warning.rst
12 changes: 12 additions & 0 deletions gptme/server/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,18 @@ def root():
return current_app.send_static_file("index.html")


# serve computer interface
@api.route("/computer")
def computer():
return current_app.send_static_file("computer.html")


# serve chat interface (for embedding in computer view)
@api.route("/chat")
def chat():
return current_app.send_static_file("index.html")


@api.route("/favicon.png")
def favicon():
return flask.send_from_directory(media_path, "logo.png")
Expand Down
28 changes: 25 additions & 3 deletions gptme/server/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,36 @@
default=None,
help="Model to use by default, can be overridden in each request.",
)
def main(debug: bool, verbose: bool, model: str | None): # pragma: no cover
@click.option(
"--host",
default="127.0.0.1",
help="Host to bind the server to.",
)
@click.option(
"--port",
default="5000",
help="Port to run the server on.",
)
@click.option("--tools", default=None, help="Tools to enable, comma separated.")
def main(
debug: bool,
verbose: bool,
model: str | None,
host: str,
port: str,
tools: str | None,
): # pragma: no cover
"""
Starts a server and web UI for gptme.
Note that this is very much a work in progress, and is not yet ready for normal use.
"""
init_logging(verbose)
init(model, interactive=False, tool_allowlist=None)
init(
model,
interactive=False,
tool_allowlist=None if tools is None else tools.split(","),
)

# if flask not installed, ask the user to install `server` extras
try:
Expand All @@ -37,4 +59,4 @@ def main(debug: bool, verbose: bool, model: str | None): # pragma: no cover
click.echo("Initialization complete, starting server")

app = create_app()
app.run(debug=debug)
app.run(debug=debug, host=host, port=int(port))
92 changes: 92 additions & 0 deletions gptme/server/static/computer.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
<!DOCTYPE html>
<html>
<head>
<title>gptme - Computer Use</title>
<meta name="permissions-policy" content="fullscreen=*" />
<style>
body {
margin: 0;
padding: 0;
overflow: hidden;
font-family: system-ui, -apple-system, sans-serif;
}
.container {
display: flex;
height: 100vh;
width: 100vw;
}
.chat {
flex: 1;
border: none;
height: 100vh;
background: #f5f5f5;
}
.desktop {
flex: 2;
border: none;
height: 100vh;
}
.controls {
position: absolute;
top: 10px;
right: 10px;
z-index: 1000;
display: flex;
gap: 10px;
}
button {
padding: 8px 16px;
border-radius: 4px;
border: 1px solid #ccc;
background: white;
cursor: pointer;
font-size: 14px;
}
button:hover {
background: #f0f0f0;
}
</style>
</head>
<body>
<div class="container">
<div class="chat">
<!-- Will be replaced with gptme chat interface -->
<iframe src="/chat" style="width: 100%; height: 100%; border: none;"></iframe>
</div>
<iframe
id="vnc"
class="desktop"
src="http://127.0.0.1:6080/vnc.html?&resize=scale&autoconnect=1&view_only=1&reconnect=1&reconnect_delay=2000"
allow="fullscreen"
></iframe>
</div>
<div class="controls">
<button id="toggleViewOnly">Toggle Screen Control (Off)</button>
<button id="toggleFullscreen">Toggle Fullscreen</button>
</div>
<script>
// Toggle view-only mode
document.getElementById("toggleViewOnly").addEventListener("click", function() {
var vncIframe = document.getElementById("vnc");
var button = document.getElementById("toggleViewOnly");
var currentSrc = vncIframe.src;
if (currentSrc.includes("view_only=1")) {
vncIframe.src = currentSrc.replace("view_only=1", "view_only=0");
button.innerText = "Toggle Screen Control (On)";
} else {
vncIframe.src = currentSrc.replace("view_only=0", "view_only=1");
button.innerText = "Toggle Screen Control (Off)";
}
});

// Toggle fullscreen
document.getElementById("toggleFullscreen").addEventListener("click", function() {
if (!document.fullscreenElement) {
document.documentElement.requestFullscreen();
} else {
document.exitFullscreen();
}
});
</script>
</body>
</html>
3 changes: 2 additions & 1 deletion gptme/server/static/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -151,8 +151,9 @@ <h1 class="text-2xl font-bold text-gray-800 mb-2 md:mb-0">{{ selectedConversatio
<textarea
class="flex-grow border rounded-lg p-3 mb-2 md:mb-0 md:mr-2 focus:outline-none focus:ring-2 focus:ring-blue-500 resize-none"
v-model="newMessage"
placeholder="Type your message"
placeholder="Type your message (Enter to send, Shift+Enter for newline)"
rows="3"
@keydown="handleKeyDown"
></textarea>
<button
type="submit"
Expand Down
Loading

0 comments on commit 175167e

Please sign in to comment.