Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement anthropic-style computer tool #225

Merged
merged 16 commits into from
Nov 1, 2024
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ gptme.toml

# Build scripts
scripts
ErikBjare marked this conversation as resolved.
Show resolved Hide resolved
!scripts/start_x11.sh
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
!scripts/start_x11.sh

!scripts/computer_home
.github

# Build/test/coverage/docs/prof directories
Expand Down
5 changes: 5 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,12 @@ build:

build-docker:
docker build . -t gptme:latest -f scripts/Dockerfile
docker build . -t gptme-server:latest -f scripts/Dockerfile.server
docker build . -t gptme-eval:latest -f scripts/Dockerfile.eval
# docker build . -t gptme-eval:latest -f scripts/Dockerfile.eval --build-arg RUST=yes --build-arg BROWSER=yes

build-docker-computer:
docker build . -t gptme-computer:latest -f scripts/Dockerfile.computer

build-docker-dev:
docker build . -t gptme-dev:latest -f scripts/Dockerfile.dev
Expand Down
4 changes: 4 additions & 0 deletions docs/computer-use-warning.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.. warning::

The computer use interface is experimental and has serious security implications.
Please use with caution and see Anthropic's documentation on `computer use <https://docs.anthropic.com/en/docs/build-with-claude/computer-use>`_ for additional guidance.
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,7 @@ def setup(app):
("py:class", "flask.app.Flask"),
("py:class", "gptme.tools.python.T"),
("py:class", "threading.Thread"),
("py:class", "gptme.tools.computer.ScalingSource"),
]

# -- Options for HTML output -------------------------------------------------
Expand Down
27 changes: 27 additions & 0 deletions docs/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,3 +64,30 @@ Generate docstrings for all functions in a file:
gptme --non-interactive "Patch these files to include concise docstrings for all functions, skip functions that already have docstrings. Include: brief description, parameters." $@

These examples demonstrate how gptme can be used to create simple yet powerful automation tools. Each script can be easily customized and expanded to fit specific project needs.

.. rubric:: Computer Use Examples

Using the computer tool for GUI automation and desktop interaction (requires running the server with computer use support):

.. code-block:: bash

# Start server with computer use support
docker run -p 5000:5000 -p 8080:8080 -p 6080:6080 ghcr.io/erikbjare/gptme:latest-server

# Then in another terminal:

# Open and interact with an application
gptme 'open firefox and navigate to example.com'

# GUI automation with visual feedback
gptme 'create a simple drawing in xpaint'

# Desktop automation with keyboard/mouse
gptme 'open calculator and compute 15 * 23'

The computer use interface at http://localhost:8080 provides a split view with:
- Chat interface on the left
- Desktop view on the right
- Controls for toggling interaction mode

This enables complex GUI automation tasks with visual feedback and confirmation.
47 changes: 42 additions & 5 deletions docs/server.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,49 @@ It can be started by running the following command:
Web UI
------

.. code-block:: bash
The server provides two interfaces:

gptme-server
1. Basic Chat Interface

.. code-block:: bash

gptme-server

Access the basic chat interface at http://localhost:5000

For more usage, see :ref:`the CLI documentation <cli:gptme-server>`.

2. Computer Use Interface

Requires Docker.

.. code-block:: bash

# Clone the repository
git clone https://github.com/ErikBjare/gptme.git
cd gptme
# Build container
make build-docker-computer
# Run container
docker run -v ~/.config/gptme:/home/computeruse/.config/gptme -p 5000:5000 -p 6080:6080 -p 8080:8080 gptme-computer:latest

The computer use interface provides:

- Combined view at http://localhost:8080/computer
- Chat view at http://localhost:8080
- Desktop view at http://localhost:6080/vnc.html

Features:

- Split view with chat on the left, desktop on the right
- Toggle for view-only/interactive desktop mode
- Fullscreen support
- Automatic screen scaling for optimal LLM vision

This should let you view your chats in a web browser and make basic requests.
Requirements:

You can then access the web UI by visiting http://localhost:5000 in your browser.
- Docker for running the server with X11 support
- Browser with WebSocket support for VNC
- Network ports 6080 (VNC) and 8080 (web UI) available

For more usage, see :ref:`the CLI documentation <cli:gptme-server>`.
.. include:: computer-use-warning.rst
36 changes: 36 additions & 0 deletions docs/tools.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ The tools can be grouped into the following categories:

- `Screenshot`_
- `Vision`_
- `Computer`_

- Chat management

Expand Down Expand Up @@ -107,3 +108,38 @@ Chats
.. automodule:: gptme.tools.chats
:members:
:noindex:

Computer
--------

.. automodule:: gptme.tools.computer
:members:
:noindex:

The computer tool provides direct interaction with the desktop environment through X11, allowing for:

- Keyboard input simulation
- Mouse control (movement, clicks, dragging)
- Screen capture with automatic scaling
- Cursor position tracking

To use the computer tool, see the instructions for :doc:`server`.

Example usage::

# Type text
computer(action="type", text="Hello, World!")

# Move mouse and click
computer(action="mouse_move", coordinate=(100, 100))
computer(action="left_click")

# Take screenshot
computer(action="screenshot")

# Send keyboard shortcuts
computer(action="key", text="Control_L+c")

The tool automatically handles screen resolution scaling to ensure optimal performance with LLM vision capabilities.

.. include:: computer-use-warning.rst
12 changes: 12 additions & 0 deletions gptme/server/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,18 @@ def root():
return current_app.send_static_file("index.html")


# serve computer interface
@api.route("/computer")
def computer():
return current_app.send_static_file("computer.html")


# serve chat interface (for embedding in computer view)
@api.route("/chat")
def chat():
return current_app.send_static_file("index.html")


@api.route("/favicon.png")
def favicon():
return flask.send_from_directory(media_path, "logo.png")
Expand Down
28 changes: 25 additions & 3 deletions gptme/server/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,36 @@
default=None,
help="Model to use by default, can be overridden in each request.",
)
def main(debug: bool, verbose: bool, model: str | None): # pragma: no cover
@click.option(
"--host",
default="127.0.0.1",
help="Host to bind the server to.",
)
@click.option(
"--port",
default="5000",
help="Port to run the server on.",
)
@click.option("--tools", default=None, help="Tools to enable, comma separated.")
def main(
debug: bool,
verbose: bool,
model: str | None,
host: str,
port: str,
tools: str | None,
): # pragma: no cover
"""
Starts a server and web UI for gptme.

Note that this is very much a work in progress, and is not yet ready for normal use.
"""
init_logging(verbose)
init(model, interactive=False, tool_allowlist=None)
init(
model,
interactive=False,
tool_allowlist=None if tools is None else tools.split(","),
)

# if flask not installed, ask the user to install `server` extras
try:
Expand All @@ -37,4 +59,4 @@ def main(debug: bool, verbose: bool, model: str | None): # pragma: no cover
click.echo("Initialization complete, starting server")

app = create_app()
app.run(debug=debug)
app.run(debug=debug, host=host, port=int(port))
92 changes: 92 additions & 0 deletions gptme/server/static/computer.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
<!DOCTYPE html>
<html>
<head>
<title>gptme - Computer Use</title>
<meta name="permissions-policy" content="fullscreen=*" />
<style>
body {
margin: 0;
padding: 0;
overflow: hidden;
font-family: system-ui, -apple-system, sans-serif;
}
.container {
display: flex;
height: 100vh;
width: 100vw;
}
.chat {
flex: 1;
border: none;
height: 100vh;
background: #f5f5f5;
}
.desktop {
flex: 2;
border: none;
height: 100vh;
}
.controls {
position: absolute;
top: 10px;
right: 10px;
z-index: 1000;
display: flex;
gap: 10px;
}
button {
padding: 8px 16px;
border-radius: 4px;
border: 1px solid #ccc;
background: white;
cursor: pointer;
font-size: 14px;
}
button:hover {
background: #f0f0f0;
}
</style>
</head>
<body>
<div class="container">
<div class="chat">
<!-- Will be replaced with gptme chat interface -->
<iframe src="/chat" style="width: 100%; height: 100%; border: none;"></iframe>
</div>
<iframe
id="vnc"
class="desktop"
src="http://127.0.0.1:6080/vnc.html?&resize=scale&autoconnect=1&view_only=1&reconnect=1&reconnect_delay=2000"
allow="fullscreen"
></iframe>
</div>
<div class="controls">
<button id="toggleViewOnly">Toggle Screen Control (Off)</button>
<button id="toggleFullscreen">Toggle Fullscreen</button>
</div>
<script>
// Toggle view-only mode
document.getElementById("toggleViewOnly").addEventListener("click", function() {
var vncIframe = document.getElementById("vnc");
var button = document.getElementById("toggleViewOnly");
var currentSrc = vncIframe.src;
if (currentSrc.includes("view_only=1")) {
vncIframe.src = currentSrc.replace("view_only=1", "view_only=0");
button.innerText = "Toggle Screen Control (On)";
} else {
vncIframe.src = currentSrc.replace("view_only=0", "view_only=1");
button.innerText = "Toggle Screen Control (Off)";
}
});

// Toggle fullscreen
document.getElementById("toggleFullscreen").addEventListener("click", function() {
if (!document.fullscreenElement) {
document.documentElement.requestFullscreen();
} else {
document.exitFullscreen();
}
});
</script>
</body>
</html>
3 changes: 2 additions & 1 deletion gptme/server/static/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -151,8 +151,9 @@ <h1 class="text-2xl font-bold text-gray-800 mb-2 md:mb-0">{{ selectedConversatio
<textarea
class="flex-grow border rounded-lg p-3 mb-2 md:mb-0 md:mr-2 focus:outline-none focus:ring-2 focus:ring-blue-500 resize-none"
v-model="newMessage"
placeholder="Type your message"
placeholder="Type your message (Enter to send, Shift+Enter for newline)"
rows="3"
@keydown="handleKeyDown"
></textarea>
<button
type="submit"
Expand Down
Loading
Loading