From 652eb43887e472b3120b24ff32b9c26be59b1928 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Erik=20Bj=C3=A4reholt?= Date: Thu, 24 Oct 2024 20:06:31 +0200 Subject: [PATCH 01/16] feat: started working on anthropic-style computer tool --- .dockerignore | 1 + Makefile | 4 + docs/examples.rst | 27 +++ docs/server.rst | 65 +++++- docs/tools.rst | 72 ++++++ gptme/server/api.py | 12 + gptme/server/static/computer.html | 92 ++++++++ gptme/tools/__init__.py | 2 + gptme/tools/computer.py | 205 ++++++++++++++++++ .../tint2/applications/firefox-custom.desktop | 8 + .../.config/tint2/applications/gedit.desktop | 8 + .../tint2/applications/terminal.desktop | 8 + image/.config/tint2/tint2rc | 100 +++++++++ poetry.lock | 19 +- pyproject.toml | 4 + scripts/Dockerfile.computer | 93 ++++++++ scripts/start_x11.sh | 97 +++++++++ 17 files changed, 811 insertions(+), 6 deletions(-) create mode 100644 gptme/server/static/computer.html create mode 100644 gptme/tools/computer.py create mode 100755 image/.config/tint2/applications/firefox-custom.desktop create mode 100755 image/.config/tint2/applications/gedit.desktop create mode 100644 image/.config/tint2/applications/terminal.desktop create mode 100644 image/.config/tint2/tint2rc create mode 100644 scripts/Dockerfile.computer create mode 100755 scripts/start_x11.sh diff --git a/.dockerignore b/.dockerignore index f3aff7dc..3393309e 100644 --- a/.dockerignore +++ b/.dockerignore @@ -10,6 +10,7 @@ gptme.toml # Build scripts scripts +!scripts/start_x11.sh .github # Build/test/coverage/docs/prof directories diff --git a/Makefile b/Makefile index 57464325..aed8d8a8 100644 --- a/Makefile +++ b/Makefile @@ -17,6 +17,10 @@ build: build-docker: docker build . -t gptme:latest -f scripts/Dockerfile docker build . -t gptme-eval:latest -f scripts/Dockerfile.eval + # TODO: add gptme-server image + +build-docker-computer: + docker build . -t gptme-computer:latest -f scripts/Dockerfile.computer build-docker-dev: docker build . -t gptme-dev:latest -f scripts/Dockerfile.dev diff --git a/docs/examples.rst b/docs/examples.rst index 34bb0ad6..72afdd71 100644 --- a/docs/examples.rst +++ b/docs/examples.rst @@ -64,3 +64,30 @@ Generate docstrings for all functions in a file: gptme --non-interactive "Patch these files to include concise docstrings for all functions, skip functions that already have docstrings. Include: brief description, parameters." $@ These examples demonstrate how gptme can be used to create simple yet powerful automation tools. Each script can be easily customized and expanded to fit specific project needs. + +.. rubric:: Computer Use Examples + +Using the computer tool for GUI automation and desktop interaction (requires running the server with computer use support): + +.. code-block:: bash + + # Start server with computer use support + docker run -p 5000:5000 -p 8080:8080 -p 6080:6080 ghcr.io/erikbjare/gptme:latest-server + + # Then in another terminal: + + # Open and interact with an application + gptme 'open firefox and navigate to example.com' + + # GUI automation with visual feedback + gptme 'create a simple drawing in xpaint' + + # Desktop automation with keyboard/mouse + gptme 'open calculator and compute 15 * 23' + +The computer use interface at http://localhost:8080 provides a split view with: +- Chat interface on the left +- Desktop view on the right +- Controls for toggling interaction mode + +This enables complex GUI automation tasks with visual feedback and confirmation. diff --git a/docs/server.rst b/docs/server.rst index 8e33665a..4da5d052 100644 --- a/docs/server.rst +++ b/docs/server.rst @@ -16,12 +16,69 @@ It can be started by running the following command: Web UI ------ -.. code-block:: bash +The server provides two interfaces: - gptme-server +1. Basic Chat Interface + + .. code-block:: bash + + gptme-server + + Access the basic chat interface at http://localhost:5000 + +2. Computer Use Interface + + .. code-block:: bash + + # Run with computer use support (requires Docker) + docker run -p 5000:5000 -p 8080:8080 -p 6080:6080 ghcr.io/erikbjare/gptme:latest-server + + The computer use interface provides: + + - Combined chat and desktop view at http://localhost:8080 + - Desktop-only view at http://localhost:6080/vnc.html + - Chat-only view at http://localhost:5000 + + Features: + + - Split view with chat on the left, desktop on the right + - Toggle for view-only/interactive desktop mode + - Fullscreen support + - Automatic screen scaling for optimal LLM vision + + Requirements: + + - Docker for running the server with X11 support + - Browser with WebSocket support for VNC + - Network ports 5000 (API), 8080 (combined view), and 6080 (VNC) available + +Security Considerations +--------------------- + +When using the computer use interface: + +1. Network Security + - The server exposes VNC and web interfaces + - Consider using SSH tunneling for remote access + - Restrict access to trusted networks/users + - Use firewall rules to limit port access + +2. Container Security + - Run with minimal privileges + - Mount only necessary directories + - Consider using a separate network namespace + - Regularly update base images and dependencies -This should let you view your chats in a web browser and make basic requests. +3. Usage Guidelines + - Start in view-only mode by default + - Require explicit user action to enable interaction + - Monitor and audit computer use sessions + - Implement timeouts for inactive sessions -You can then access the web UI by visiting http://localhost:5000 in your browser. +4. Data Protection + - Don't expose sensitive information in the desktop environment + - Be cautious with browser automation and credentials + - Consider using ephemeral containers + - Regular cleanup of session data For more usage, see :ref:`the CLI documentation `. diff --git a/docs/tools.rst b/docs/tools.rst index 36f55a9b..eacd917a 100644 --- a/docs/tools.rst +++ b/docs/tools.rst @@ -26,6 +26,7 @@ The tools can be grouped into the following categories: - `Screenshot`_ - `Vision`_ + - `Computer`_ - Chat management @@ -107,3 +108,74 @@ Chats .. automodule:: gptme.tools.chats :members: :noindex: + +Computer +-------- + +.. automodule:: gptme.tools.computer + :members: + :noindex: + +The computer tool provides direct interaction with the desktop environment through X11, allowing for: + +- Keyboard input simulation +- Mouse control (movement, clicks, dragging) +- Screen capture with automatic scaling +- Cursor position tracking + +To use the computer tool, you need to: + +1. Install gptme with computer support:: + + pip install "gptme[computer]" + +2. Run gptme server with X11 support:: + + docker run -p 5000:5000 -p 8080:8080 -p 6080:6080 ghcr.io/erikbjare/gptme:latest-server + +3. Access the combined interface at http://localhost:8080 + +Example usage:: + + # Type text + computer(action="type", text="Hello, World!") + + # Move mouse and click + computer(action="mouse_move", coordinate=(100, 100)) + computer(action="left_click") + + # Take screenshot + computer(action="screenshot") + + # Send keyboard shortcuts + computer(action="key", text="Control_L+c") + +The tool automatically handles screen resolution scaling to ensure optimal performance with LLM vision capabilities. + +Security Considerations +~~~~~~~~~~~~~~~~~~~~~ + +.. warning:: + Computer use poses unique risks beyond standard LLM interactions. To minimize risks: + + 1. Run in an isolated environment: + - Use Docker container with minimal privileges + - Consider using a dedicated virtual machine + - Limit network access where possible + + 2. Protect sensitive data: + - Don't expose login credentials or sensitive information + - Be cautious with browser automation + - Consider using a separate profile/workspace + + 3. Implement safeguards: + - Require human confirmation for consequential actions + - Use view-only mode by default in the web interface + - Monitor and log computer use actions + + 4. Be aware of prompt injection risks: + - The model may follow commands found in viewed content + - Screen content could override user instructions + - Isolate the environment from sensitive operations + + Always inform users of these risks and obtain appropriate consent before enabling computer use features. diff --git a/gptme/server/api.py b/gptme/server/api.py index fda5538e..72bb9eef 100644 --- a/gptme/server/api.py +++ b/gptme/server/api.py @@ -148,6 +148,18 @@ def root(): return current_app.send_static_file("index.html") +# serve computer interface +@api.route("/computer") +def computer(): + return current_app.send_static_file("computer.html") + + +# serve chat interface (for embedding in computer view) +@api.route("/chat") +def chat(): + return current_app.send_static_file("index.html") + + @api.route("/favicon.png") def favicon(): return flask.send_from_directory(media_path, "logo.png") diff --git a/gptme/server/static/computer.html b/gptme/server/static/computer.html new file mode 100644 index 00000000..33994147 --- /dev/null +++ b/gptme/server/static/computer.html @@ -0,0 +1,92 @@ + + + + gptme - Computer Use + + + + +
+
+ + +
+ +
+
+ + +
+ + + diff --git a/gptme/tools/__init__.py b/gptme/tools/__init__.py index 8912ed6d..f6880799 100644 --- a/gptme/tools/__init__.py +++ b/gptme/tools/__init__.py @@ -6,6 +6,7 @@ from .base import ConfirmFunc, ToolSpec, ToolUse from .browser import tool as browser_tool from .chats import tool as chats_tool +from .computer import tool as computer_tool from .gh import tool as gh_tool from .patch import tool as patch_tool from .python import register_function @@ -43,6 +44,7 @@ youtube_tool, screenshot_tool, vision_tool, + computer_tool, # python tool is loaded last to ensure all functions are registered python_tool, ] diff --git a/gptme/tools/computer.py b/gptme/tools/computer.py new file mode 100644 index 00000000..8b5bb110 --- /dev/null +++ b/gptme/tools/computer.py @@ -0,0 +1,205 @@ +""" +Tool for computer interaction through X11, including screen capture, keyboard, and mouse control. +Similar to Anthropic's computer use demo, but integrated with gptme's architecture. +""" + +import asyncio +import os +import shlex +import shutil +import subprocess +from collections.abc import Generator +from enum import Enum +from pathlib import Path +from typing import Literal, TypedDict + +from ..message import Message +from .base import ToolSpec +from .screenshot import _screenshot + +# Constants from Anthropic's implementation +TYPING_DELAY_MS = 12 +TYPING_GROUP_SIZE = 50 +OUTPUT_DIR = "/tmp/outputs" + +Action = Literal[ + "key", + "type", + "mouse_move", + "left_click", + "left_click_drag", + "right_click", + "middle_click", + "double_click", + "screenshot", + "cursor_position", +] + + +class Resolution(TypedDict): + width: int + height: int + + +# Recommended maximum resolutions for LLM vision +MAX_SCALING_TARGETS: dict[str, Resolution] = { + "XGA": Resolution(width=1024, height=768), # 4:3 + "WXGA": Resolution(width=1280, height=800), # 16:10 + "FWXGA": Resolution(width=1366, height=768), # ~16:9 +} + + +class ScalingSource(Enum): + COMPUTER = "computer" + API = "api" + + +def chunks(s: str, chunk_size: int) -> list[str]: + """Split string into chunks for typing simulation.""" + return [s[i : i + chunk_size] for i in range(0, len(s), chunk_size)] + + +def scale_coordinates( + source: ScalingSource, x: int, y: int, current_width: int, current_height: int +) -> tuple[int, int]: + """Scale coordinates to/from recommended resolutions.""" + ratio = current_width / current_height + target_dimension = None + + for dimension in MAX_SCALING_TARGETS.values(): + if abs(dimension["width"] / dimension["height"] - ratio) < 0.02: + if dimension["width"] < current_width: + target_dimension = dimension + break + + if target_dimension is None: + return x, y + + x_scaling_factor = target_dimension["width"] / current_width + y_scaling_factor = target_dimension["height"] / current_height + + if source == ScalingSource.API: + if x > current_width or y > current_height: + raise ValueError(f"Coordinates {x}, {y} are out of bounds") + # Scale up + return round(x / x_scaling_factor), round(y / y_scaling_factor) + # Scale down + return round(x * x_scaling_factor), round(y * y_scaling_factor) + + +def run_xdotool(cmd: str, display: str | None = None) -> str: + """Run an xdotool command with optional display setting.""" + display_prefix = f"DISPLAY={display} " if display else "" + full_cmd = f"{display_prefix}xdotool {cmd}" + proc = subprocess.Popen( + full_cmd, stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE + ) + stdout, stderr = proc.communicate() + if proc.returncode != 0: + raise RuntimeError(f"xdotool command failed: {stderr.decode()}") + return stdout.decode() + + +def computer_action( + action: Action, text: str | None = None, coordinate: tuple[int, int] | None = None +) -> Generator[Message, None, None]: + """ + Perform computer interactions through X11. + + Args: + action: The type of action to perform + text: Text to type or key sequence to send + coordinate: X,Y coordinates for mouse actions + """ + display = os.getenv("DISPLAY", ":1") + width = int(os.getenv("WIDTH", "1024")) + height = int(os.getenv("HEIGHT", "768")) + + try: + if action in ("mouse_move", "left_click_drag"): + if not coordinate: + raise ValueError(f"coordinate is required for {action}") + x, y = scale_coordinates( + ScalingSource.API, coordinate[0], coordinate[1], width, height + ) + + if action == "mouse_move": + run_xdotool(f"mousemove --sync {x} {y}", display) + else: # left_click_drag + run_xdotool(f"mousedown 1 mousemove --sync {x} {y} mouseup 1", display) + + yield Message("system", f"Moved mouse to {x},{y}") + + elif action in ("key", "type"): + if not text: + raise ValueError(f"text is required for {action}") + + if action == "key": + run_xdotool(f"key -- {text}", display) + yield Message("system", f"Sent key sequence: {text}") + else: # type + for chunk in chunks(text, TYPING_GROUP_SIZE): + run_xdotool( + f"type --delay {TYPING_DELAY_MS} -- {shlex.quote(chunk)}", + display, + ) + yield Message("system", f"Typed text: {text}") + + elif action in ("left_click", "right_click", "middle_click", "double_click"): + click_arg = { + "left_click": "1", + "right_click": "3", + "middle_click": "2", + "double_click": "--repeat 2 --delay 500 1", + }[action] + run_xdotool(f"click {click_arg}", display) + yield Message("system", f"Performed {action}") + + elif action == "screenshot": + # Use X11-specific screenshot if available, fall back to native + output_dir = Path(OUTPUT_DIR) + output_dir.mkdir(parents=True, exist_ok=True) + path = output_dir / "screenshot.png" + + if shutil.which("gnome-screenshot"): + run_xdotool(f"gnome-screenshot -f {path} -p", display) + elif os.name == "posix": + _screenshot(path) # Use existing screenshot function + else: + raise NotImplementedError("Screenshot not supported on this platform") + + # Scale if needed + if path.exists(): + x, y = scale_coordinates( + ScalingSource.COMPUTER, width, height, width, height + ) + os.system(f"convert {path} -resize {x}x{y}! {path}") + yield Message("system", f"Screenshot saved to {path}", files=[path]) + + elif action == "cursor_position": + output = run_xdotool("getmouselocation --shell", display) + x = int(output.split("X=")[1].split("\n")[0]) + y = int(output.split("Y=")[1].split("\n")[0]) + x, y = scale_coordinates(ScalingSource.COMPUTER, x, y, width, height) + yield Message("system", f"Cursor position: X={x},Y={y}") + + except Exception as e: + yield Message("system", f"Error: Computer action failed: {str(e)}") + + +tool = ToolSpec( + name="computer", + desc="Control the computer through X11 (keyboard, mouse, screen)", + instructions=""" + Use this tool to interact with the computer through X11. + Available actions: + - key: Send key sequence (e.g., "Return", "Control_L+c") + - type: Type text with realistic delays + - mouse_move: Move mouse to coordinates + - left_click, right_click, middle_click, double_click: Mouse clicks + - left_click_drag: Click and drag to coordinates + - screenshot: Take a screenshot + - cursor_position: Get current mouse position + """, + functions=[computer_action], +) diff --git a/image/.config/tint2/applications/firefox-custom.desktop b/image/.config/tint2/applications/firefox-custom.desktop new file mode 100755 index 00000000..94802126 --- /dev/null +++ b/image/.config/tint2/applications/firefox-custom.desktop @@ -0,0 +1,8 @@ +[Desktop Entry] +Name=Firefox Custom +Comment=Open Firefox with custom URL +Exec=firefox-esr -new-window +Icon=firefox-esr +Terminal=false +Type=Application +Categories=Network;WebBrowser; diff --git a/image/.config/tint2/applications/gedit.desktop b/image/.config/tint2/applications/gedit.desktop new file mode 100755 index 00000000..d5af03f4 --- /dev/null +++ b/image/.config/tint2/applications/gedit.desktop @@ -0,0 +1,8 @@ +[Desktop Entry] +Name=Gedit +Comment=Open gedit +Exec=gedit +Icon=text-editor-symbolic +Terminal=false +Type=Application +Categories=TextEditor; diff --git a/image/.config/tint2/applications/terminal.desktop b/image/.config/tint2/applications/terminal.desktop new file mode 100644 index 00000000..0c2d45d4 --- /dev/null +++ b/image/.config/tint2/applications/terminal.desktop @@ -0,0 +1,8 @@ +[Desktop Entry] +Name=Terminal +Comment=Open Terminal +Exec=xterm +Icon=utilities-terminal +Terminal=false +Type=Application +Categories=System;TerminalEmulator; diff --git a/image/.config/tint2/tint2rc b/image/.config/tint2/tint2rc new file mode 100644 index 00000000..5db6d312 --- /dev/null +++ b/image/.config/tint2/tint2rc @@ -0,0 +1,100 @@ +#------------------------------------- +# Panel +panel_items = TL +panel_size = 100% 60 +panel_margin = 0 0 +panel_padding = 2 0 2 +panel_background_id = 1 +wm_menu = 0 +panel_dock = 0 +panel_position = bottom center horizontal +panel_layer = top +panel_monitor = all +panel_shrink = 0 +autohide = 0 +autohide_show_timeout = 0 +autohide_hide_timeout = 0.5 +autohide_height = 2 +strut_policy = follow_size +panel_window_name = tint2 +disable_transparency = 1 +mouse_effects = 1 +font_shadow = 0 +mouse_hover_icon_asb = 100 0 10 +mouse_pressed_icon_asb = 100 0 0 +scale_relative_to_dpi = 0 +scale_relative_to_screen_height = 0 + +#------------------------------------- +# Taskbar +taskbar_mode = single_desktop +taskbar_hide_if_empty = 0 +taskbar_padding = 0 0 2 +taskbar_background_id = 0 +taskbar_active_background_id = 0 +taskbar_name = 1 +taskbar_hide_inactive_tasks = 0 +taskbar_hide_different_monitor = 0 +taskbar_hide_different_desktop = 0 +taskbar_always_show_all_desktop_tasks = 0 +taskbar_name_padding = 4 2 +taskbar_name_background_id = 0 +taskbar_name_active_background_id = 0 +taskbar_name_font_color = #e3e3e3 100 +taskbar_name_active_font_color = #ffffff 100 +taskbar_distribute_size = 0 +taskbar_sort_order = none +task_align = left + +#------------------------------------- +# Launcher +launcher_padding = 4 8 4 +launcher_background_id = 0 +launcher_icon_background_id = 0 +launcher_icon_size = 48 +launcher_icon_asb = 100 0 0 +launcher_icon_theme_override = 0 +startup_notifications = 1 +launcher_tooltip = 1 + +#------------------------------------- +# Launcher icon +launcher_item_app = /usr/share/applications/libreoffice-calc.desktop +launcher_item_app = /home/computeruse/.config/tint2/applications/terminal.desktop +launcher_item_app = /home/computeruse/.config/tint2/applications/firefox-custom.desktop +launcher_item_app = /usr/share/applications/xpaint.desktop +launcher_item_app = /usr/share/applications/xpdf.desktop +launcher_item_app = /home/computeruse/.config/tint2/applications/gedit.desktop +launcher_item_app = /usr/share/applications/galculator.desktop + +#------------------------------------- +# Background definitions +# ID 1 +rounded = 0 +border_width = 0 +background_color = #000000 60 +border_color = #000000 30 + +# ID 2 +rounded = 4 +border_width = 1 +background_color = #777777 20 +border_color = #777777 30 + +# ID 3 +rounded = 4 +border_width = 1 +background_color = #777777 20 +border_color = #ffffff 40 + +# ID 4 +rounded = 4 +border_width = 1 +background_color = #aa4400 100 +border_color = #aa7733 100 + +# ID 5 +rounded = 4 +border_width = 1 +background_color = #aaaa00 100 +border_color = #aaaa00 100 diff --git a/poetry.lock b/poetry.lock index 96020bd6..5bcbc787 100644 --- a/poetry.lock +++ b/poetry.lock @@ -2498,6 +2498,20 @@ files = [ [package.extras] cli = ["click (>=5.0)"] +[[package]] +name = "python-xlib" +version = "0.33" +description = "Python X Library" +optional = true +python-versions = "*" +files = [ + {file = "python-xlib-0.33.tar.gz", hash = "sha256:55af7906a2c75ce6cb280a584776080602444f75815a7aff4d287bb2d7018b32"}, + {file = "python_xlib-0.33-py2.py3-none-any.whl", hash = "sha256:c3534038d42e0df2f1392a1b30a15a4ff5fdc2b86cfa94f072bf11b10a164398"}, +] + +[package.dependencies] +six = ">=1.10.0" + [[package]] name = "pytz" version = "2024.2" @@ -3411,12 +3425,13 @@ files = [ requests = "*" [extras] -all = ["flask", "matplotlib", "numpy", "pandas", "pillow", "playwright"] +all = ["flask", "matplotlib", "numpy", "pandas", "pillow", "playwright", "python-xlib"] browser = ["playwright"] +computer = ["pillow", "python-xlib"] datascience = ["matplotlib", "numpy", "pandas", "pillow"] server = ["flask"] [metadata] lock-version = "2.0" python-versions = "^3.10" -content-hash = "043894e5aad0f1d3bfb95134b3469d5eba53ce7f8e8f5e83bef97ff90ee483cb" +content-hash = "1a818642af5d1853c1f8bbc628218757ec088e756a1098e5caca4b4403268258" diff --git a/pyproject.toml b/pyproject.toml index fae3bbf2..7018a53e 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -36,6 +36,7 @@ ipython = "^8.17.2" bashlex = "^0.18" playwright = {version = "1.47.*", optional=true} # version constrained due to annoying to have to run `playwright install` on every update youtube_transcript_api = {version = "^0.6.1", optional = true} +python-xlib = {version = "^0.33", optional = true} # for X11 interaction # providers openai = "^1.0" @@ -85,6 +86,7 @@ types-lxml = "*" server = ["flask"] browser = ["playwright"] datascience = ["matplotlib", "pandas", "numpy", "pillow"] +computer = ["python-xlib", "pillow"] # pillow already in datascience but listed for clarity all = [ # server "flask", @@ -92,6 +94,8 @@ all = [ "playwright", # datascience "matplotlib", "pandas", "numpy", "pillow", + # computer + "python-xlib", ] [tool.ruff] diff --git a/scripts/Dockerfile.computer b/scripts/Dockerfile.computer new file mode 100644 index 00000000..d98cdb42 --- /dev/null +++ b/scripts/Dockerfile.computer @@ -0,0 +1,93 @@ +FROM ubuntu:22.04 + +ENV DEBIAN_FRONTEND=noninteractive +ENV DEBIAN_PRIORITY=high + +# Install system dependencies +RUN apt-get update && \ + apt-get -y upgrade && \ + apt-get -y install \ + build-essential \ + python3.10 \ + python3.10-dev \ + python3-pip \ + # UI Requirements + xvfb \ + xterm \ + xdotool \ + scrot \ + imagemagick \ + sudo \ + mutter \ + x11vnc \ + tint2 \ + x11-apps \ + # Tools + make \ + git \ + tmux \ + curl \ + pandoc \ + netcat-openbsd \ + net-tools \ + && rm -rf /var/lib/apt/lists/* + +# Install noVNC for web access +RUN git clone --branch v1.5.0 https://github.com/novnc/noVNC.git /opt/noVNC && \ + git clone --branch v0.12.0 https://github.com/novnc/websockify /opt/noVNC/utils/websockify && \ + ln -s /opt/noVNC/vnc.html /opt/noVNC/index.html + +# Create user and setup environment +ENV USERNAME=computeruse +ENV HOME=/home/$USERNAME +RUN useradd -m -s /bin/bash -d $HOME $USERNAME && \ + echo "${USERNAME} ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers && \ + mkdir -p /workspace && \ + mkdir -p /tmp/.X11-unix && \ + chmod 1777 /tmp/.X11-unix && \ + chown -R $USERNAME:$USERNAME /workspace + +# Install Poetry and dependencies +RUN python3.10 -m pip install --upgrade pip && \ + python3.10 -m pip install poetry + +# Copy desktop environment configs (at the end for faster rebuilds) +RUN mkdir -p $HOME/.config +COPY --chown=$USERNAME:$USERNAME image/.config $HOME/.config +RUN ls -la $HOME/.config/tint2/ + +# Set up project +WORKDIR /app +COPY --chown=$USERNAME:$USERNAME . /app +RUN poetry config virtualenvs.create false && \ + poetry install --no-interaction --no-ansi -E server + +# Switch to non-root user +USER $USERNAME +WORKDIR $HOME + +# Configure git +RUN git config --global user.name "gptme" && \ + git config --global user.email "gptme@superuserlabs.org" && \ + git config --global init.defaultBranch main + +# Set up environment +ENV PYTHONPATH=/app +ENV PATH="/usr/local/bin:$PATH" +ENV DISPLAY_NUM=1 +ENV WIDTH=1024 +ENV HEIGHT=768 +ENV DISPLAY=:${DISPLAY_NUM} + +# Copy and setup startup script +COPY --chown=$USERNAME:$USERNAME scripts/start_x11.sh /app/ +RUN chmod +x /app/start_x11.sh + +# Set working directory +WORKDIR /workspace + +# Expose ports +EXPOSE 5000 6080 8080 + +# Start services +CMD ["/app/start_x11.sh"] diff --git a/scripts/start_x11.sh b/scripts/start_x11.sh new file mode 100755 index 00000000..6ba267bb --- /dev/null +++ b/scripts/start_x11.sh @@ -0,0 +1,97 @@ +#!/bin/bash +set -e # Exit on error + +# Environment setup +DPI=96 +RES_AND_DEPTH=${WIDTH}x${HEIGHT}x24 +export DISPLAY=:${DISPLAY_NUM} + +# Function to check if Xvfb is already running +check_xvfb_running() { + if [ -e /tmp/.X${DISPLAY_NUM}-lock ]; then + return 0 # Xvfb is already running + else + return 1 # Xvfb is not running + fi +} + +# Function to check if Xvfb is ready +wait_for_xvfb() { + local timeout=10 + local start_time=$(date +%s) + while ! xdpyinfo >/dev/null 2>&1; do + if [ $(($(date +%s) - start_time)) -gt $timeout ]; then + echo "Xvfb failed to start within $timeout seconds" >&2 + return 1 + fi + sleep 0.1 + done + return 0 +} + +# Start Xvfb if not already running +if ! check_xvfb_running; then + echo "Starting Xvfb..." + Xvfb $DISPLAY -ac -screen 0 $RES_AND_DEPTH -retro -dpi $DPI -nolisten tcp -nolisten unix & + XVFB_PID=$! + + if ! wait_for_xvfb; then + echo "Xvfb failed to start" + kill $XVFB_PID + exit 1 + fi + echo "Xvfb started successfully on display ${DISPLAY}" + echo "Xvfb PID: $XVFB_PID" +fi + +# Start tint2 +echo "Starting tint2..." +tint2 2>/tmp/tint2_stderr.log & +timeout=30 +while [ $timeout -gt 0 ]; do + if xdotool search --class "tint2" >/dev/null 2>&1; then + break + fi + sleep 1 + ((timeout--)) +done + +if [ $timeout -eq 0 ]; then + echo "tint2 stderr output:" >&2 + cat /tmp/tint2_stderr.log >&2 + exit 1 +fi +rm /tmp/tint2_stderr.log + +# Start window manager +echo "Starting mutter..." +XDG_SESSION_TYPE=x11 mutter --replace --sm-disable 2>/tmp/mutter_stderr.log & +timeout=30 +while [ $timeout -gt 0 ]; do + if xdotool search --class "mutter" >/dev/null 2>&1; then + break + fi + sleep 1 + ((timeout--)) +done + +if [ $timeout -eq 0 ]; then + echo "mutter stderr output:" >&2 + cat /tmp/mutter_stderr.log >&2 + exit 1 +fi +rm /tmp/mutter_stderr.log + +# Start VNC server +echo "Starting VNC server..." +x11vnc -display $DISPLAY -nopw -listen localhost -xkb -ncache 10 -ncache_cr -forever & +sleep 1 + +# Start noVNC +echo "Starting noVNC..." +/opt/noVNC/utils/novnc_proxy --vnc localhost:5900 --listen 8080 & +sleep 1 + +# Start gptme server +echo "Starting gptme server..." +python3 -m gptme.server From f3b491d19096802f050f1ff2fb5cdd0558fb9b4f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Erik=20Bj=C3=A4reholt?= Date: Thu, 24 Oct 2024 20:11:59 +0200 Subject: [PATCH 02/16] Apply suggestions from code review Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com> --- gptme/tools/computer.py | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/gptme/tools/computer.py b/gptme/tools/computer.py index 8b5bb110..03f3021e 100644 --- a/gptme/tools/computer.py +++ b/gptme/tools/computer.py @@ -3,7 +3,6 @@ Similar to Anthropic's computer use demo, but integrated with gptme's architecture. """ -import asyncio import os import shlex import shutil @@ -91,9 +90,7 @@ def run_xdotool(cmd: str, display: str | None = None) -> str: """Run an xdotool command with optional display setting.""" display_prefix = f"DISPLAY={display} " if display else "" full_cmd = f"{display_prefix}xdotool {cmd}" - proc = subprocess.Popen( - full_cmd, stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE - ) + proc = subprocess.Popen(full_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) stdout, stderr = proc.communicate() if proc.returncode != 0: raise RuntimeError(f"xdotool command failed: {stderr.decode()}") @@ -173,7 +170,9 @@ def computer_action( x, y = scale_coordinates( ScalingSource.COMPUTER, width, height, width, height ) - os.system(f"convert {path} -resize {x}x{y}! {path}") + subprocess.run( + f"convert {path} -resize {x}x{y}! {path}", shell=True, check=True + ) yield Message("system", f"Screenshot saved to {path}", files=[path]) elif action == "cursor_position": From 0d1a9f3ea7ac778e9870b79d790be57a508c4dee Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Erik=20Bj=C3=A4reholt?= Date: Thu, 24 Oct 2024 21:00:20 +0200 Subject: [PATCH 03/16] fix: progress on computer use --- .dockerignore | 1 + gptme/server/cli.py | 16 ++++++++++++++-- scripts/Dockerfile.computer | 10 +++++----- .../tint2/applications/firefox-custom.desktop | 0 .../.config/tint2/applications/gedit.desktop | 0 .../.config/tint2/applications/terminal.desktop | 0 .../computer_home}/.config/tint2/tint2rc | 0 scripts/start_x11.sh | 2 +- 8 files changed, 21 insertions(+), 8 deletions(-) rename {image => scripts/computer_home}/.config/tint2/applications/firefox-custom.desktop (100%) rename {image => scripts/computer_home}/.config/tint2/applications/gedit.desktop (100%) rename {image => scripts/computer_home}/.config/tint2/applications/terminal.desktop (100%) rename {image => scripts/computer_home}/.config/tint2/tint2rc (100%) diff --git a/.dockerignore b/.dockerignore index 3393309e..5b0396a9 100644 --- a/.dockerignore +++ b/.dockerignore @@ -11,6 +11,7 @@ gptme.toml # Build scripts scripts !scripts/start_x11.sh +!scripts/computer_home .github # Build/test/coverage/docs/prof directories diff --git a/gptme/server/cli.py b/gptme/server/cli.py index 08e1474d..cdfc064d 100644 --- a/gptme/server/cli.py +++ b/gptme/server/cli.py @@ -16,7 +16,19 @@ default=None, help="Model to use by default, can be overridden in each request.", ) -def main(debug: bool, verbose: bool, model: str | None): # pragma: no cover +@click.option( + "--host", + default="127.0.0.1", + help="Host to bind the server to.", +) +@click.option( + "--port", + default="5000", + help="Port to run the server on.", +) +def main( + debug: bool, verbose: bool, model: str | None, host: str, port: str +): # pragma: no cover """ Starts a server and web UI for gptme. @@ -37,4 +49,4 @@ def main(debug: bool, verbose: bool, model: str | None): # pragma: no cover click.echo("Initialization complete, starting server") app = create_app() - app.run(debug=debug) + app.run(debug=debug, host=host, port=int(port)) diff --git a/scripts/Dockerfile.computer b/scripts/Dockerfile.computer index d98cdb42..8d166f42 100644 --- a/scripts/Dockerfile.computer +++ b/scripts/Dockerfile.computer @@ -51,17 +51,17 @@ RUN useradd -m -s /bin/bash -d $HOME $USERNAME && \ RUN python3.10 -m pip install --upgrade pip && \ python3.10 -m pip install poetry -# Copy desktop environment configs (at the end for faster rebuilds) -RUN mkdir -p $HOME/.config -COPY --chown=$USERNAME:$USERNAME image/.config $HOME/.config -RUN ls -la $HOME/.config/tint2/ - # Set up project WORKDIR /app COPY --chown=$USERNAME:$USERNAME . /app RUN poetry config virtualenvs.create false && \ poetry install --no-interaction --no-ansi -E server +# Copy desktop environment configs (at the end for faster rebuilds) +RUN mkdir -p $HOME && chmod 777 $HOME +COPY scripts/computer_home $HOME +RUN chown -R $USERNAME:$USERNAME $HOME + # Switch to non-root user USER $USERNAME WORKDIR $HOME diff --git a/image/.config/tint2/applications/firefox-custom.desktop b/scripts/computer_home/.config/tint2/applications/firefox-custom.desktop similarity index 100% rename from image/.config/tint2/applications/firefox-custom.desktop rename to scripts/computer_home/.config/tint2/applications/firefox-custom.desktop diff --git a/image/.config/tint2/applications/gedit.desktop b/scripts/computer_home/.config/tint2/applications/gedit.desktop similarity index 100% rename from image/.config/tint2/applications/gedit.desktop rename to scripts/computer_home/.config/tint2/applications/gedit.desktop diff --git a/image/.config/tint2/applications/terminal.desktop b/scripts/computer_home/.config/tint2/applications/terminal.desktop similarity index 100% rename from image/.config/tint2/applications/terminal.desktop rename to scripts/computer_home/.config/tint2/applications/terminal.desktop diff --git a/image/.config/tint2/tint2rc b/scripts/computer_home/.config/tint2/tint2rc similarity index 100% rename from image/.config/tint2/tint2rc rename to scripts/computer_home/.config/tint2/tint2rc diff --git a/scripts/start_x11.sh b/scripts/start_x11.sh index 6ba267bb..7268e056 100755 --- a/scripts/start_x11.sh +++ b/scripts/start_x11.sh @@ -94,4 +94,4 @@ sleep 1 # Start gptme server echo "Starting gptme server..." -python3 -m gptme.server +python3 -m gptme.server --host 0.0.0.0 --port 8081 From 2eff75da914ee3ab0895244413a5c349f915978c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Erik=20Bj=C3=A4reholt?= Date: Fri, 25 Oct 2024 10:37:07 +0200 Subject: [PATCH 04/16] fix: added Dockerfile.server --- Makefile | 3 ++- scripts/Dockerfile | 13 ------------- scripts/Dockerfile.server | 23 +++++++++++++++++++++++ 3 files changed, 25 insertions(+), 14 deletions(-) create mode 100644 scripts/Dockerfile.server diff --git a/Makefile b/Makefile index aed8d8a8..23a3ff5d 100644 --- a/Makefile +++ b/Makefile @@ -16,8 +16,9 @@ build: build-docker: docker build . -t gptme:latest -f scripts/Dockerfile + docker build . -t gptme-server:latest -f scripts/Dockerfile.server docker build . -t gptme-eval:latest -f scripts/Dockerfile.eval - # TODO: add gptme-server image + # docker build . -t gptme-eval:latest -f scripts/Dockerfile.eval --build-arg RUST=yes --build-arg BROWSER=yes build-docker-computer: docker build . -t gptme-computer:latest -f scripts/Dockerfile.computer diff --git a/scripts/Dockerfile b/scripts/Dockerfile index 59b5bbd6..82c52eac 100644 --- a/scripts/Dockerfile +++ b/scripts/Dockerfile @@ -52,18 +52,5 @@ ENV PYTHONPATH=/app # Set the working directory WORKDIR /workspace -# Expose port 5000 -EXPOSE 5000 - -# Healthcheck -# TODO: only relevant for server -#HEALTHCHECK --interval=30s --timeout=30s --start-period=5s --retries=3 \ -# CMD curl -f http://localhost:5000/ || exit 1 - -# TODO: make into separate image -# CMD ["poetry", "run", "python", "-m", "gptme.server"] - -RUN poetry config virtualenvs.create false - # Entrypoint if prompt/args given, run the CLI ENTRYPOINT ["python", "-m", "gptme"] diff --git a/scripts/Dockerfile.server b/scripts/Dockerfile.server new file mode 100644 index 00000000..9ebf7f74 --- /dev/null +++ b/scripts/Dockerfile.server @@ -0,0 +1,23 @@ +# Use the main Dockerfile as the base image +FROM gptme:latest + +# Install server dependencies +USER root +WORKDIR /app +RUN poetry install -E server --without=dev + +# Switch back to non-root user +USER appuser + +# Set the working directory +WORKDIR /workspace + +# Expose the server port +EXPOSE 5000 + +# Add healthcheck +HEALTHCHECK --interval=30s --timeout=30s --start-period=5s --retries=3 \ + CMD curl -f http://localhost:5000/ || exit 1 + +# Set the entrypoint to run the server +ENTRYPOINT ["python", "-m", "gptme.server", "--host", "0.0.0.0"] From cb6ffd55e2c45c62c867bf55060696003dc1e4e5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Erik=20Bj=C3=A4reholt?= Date: Sun, 27 Oct 2024 19:27:02 +0100 Subject: [PATCH 05/16] fix: fixed vnc in computer use webui --- scripts/start_x11.sh | 23 ++++++++++++++++++++--- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/scripts/start_x11.sh b/scripts/start_x11.sh index 7268e056..54d043a5 100755 --- a/scripts/start_x11.sh +++ b/scripts/start_x11.sh @@ -84,13 +84,30 @@ rm /tmp/mutter_stderr.log # Start VNC server echo "Starting VNC server..." -x11vnc -display $DISPLAY -nopw -listen localhost -xkb -ncache 10 -ncache_cr -forever & +x11vnc -display $DISPLAY -nopw -listen 0.0.0.0 -xkb -ncache 10 -ncache_cr -forever & sleep 1 # Start noVNC echo "Starting noVNC..." -/opt/noVNC/utils/novnc_proxy --vnc localhost:5900 --listen 8080 & -sleep 1 +/opt/noVNC/utils/novnc_proxy --vnc localhost:5900 --listen 0.0.0.0:6080 --web /opt/noVNC > /tmp/novnc.log 2>&1 & + +# Wait for noVNC to start +timeout=10 +while [ $timeout -gt 0 ]; do + if netstat -tuln | grep -q ":6080 "; then + break + fi + sleep 1 + ((timeout--)) +done + +if [ $timeout -eq 0 ]; then + echo "noVNC failed to start" + cat /tmp/novnc.log + exit 1 +fi + +echo "noVNC started successfully" # Start gptme server echo "Starting gptme server..." From 20fb541f3840fc51cec15380be60e837a0b2dcca Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Erik=20Bj=C3=A4reholt?= Date: Sun, 27 Oct 2024 20:44:57 +0100 Subject: [PATCH 06/16] docs: fixed docs for computer use --- docs/conf.py | 1 + docs/server.rst | 11 +++++++++-- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/docs/conf.py b/docs/conf.py index 8dbed0c1..f8617bea 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -118,6 +118,7 @@ def setup(app): ("py:class", "flask.app.Flask"), ("py:class", "gptme.tools.python.T"), ("py:class", "threading.Thread"), + ("py:class", "gptme.tools.computer.ScalingSource"), ] # -- Options for HTML output ------------------------------------------------- diff --git a/docs/server.rst b/docs/server.rst index 4da5d052..22fcbbf5 100644 --- a/docs/server.rst +++ b/docs/server.rst @@ -26,6 +26,8 @@ The server provides two interfaces: Access the basic chat interface at http://localhost:5000 + For more usage, see :ref:`the CLI documentation `. + 2. Computer Use Interface .. code-block:: bash @@ -52,8 +54,13 @@ The server provides two interfaces: - Browser with WebSocket support for VNC - Network ports 5000 (API), 8080 (combined view), and 6080 (VNC) available +.. warning:: + + The computer use interface is experimental and has serious security implications. + Please use with caution and consider the security considerations below. + Security Considerations ---------------------- +----------------------- When using the computer use interface: @@ -81,4 +88,4 @@ When using the computer use interface: - Consider using ephemeral containers - Regular cleanup of session data -For more usage, see :ref:`the CLI documentation `. +Please also see Anthropic's documentation on `computer use `_ for additional guidance. From 067adbc45bd30703faccebcd2bd9005c30af0fc5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Erik=20Bj=C3=A4reholt?= Date: Sun, 27 Oct 2024 20:50:24 +0100 Subject: [PATCH 07/16] fix: rewrote computer_action function to not be a generator --- gptme/tools/computer.py | 47 ++++++++++++++++++++++------------------- 1 file changed, 25 insertions(+), 22 deletions(-) diff --git a/gptme/tools/computer.py b/gptme/tools/computer.py index 03f3021e..00d7f0f1 100644 --- a/gptme/tools/computer.py +++ b/gptme/tools/computer.py @@ -7,12 +7,10 @@ import shlex import shutil import subprocess -from collections.abc import Generator from enum import Enum from pathlib import Path from typing import Literal, TypedDict -from ..message import Message from .base import ToolSpec from .screenshot import _screenshot @@ -99,7 +97,7 @@ def run_xdotool(cmd: str, display: str | None = None) -> str: def computer_action( action: Action, text: str | None = None, coordinate: tuple[int, int] | None = None -) -> Generator[Message, None, None]: +) -> None: """ Perform computer interactions through X11. @@ -125,7 +123,7 @@ def computer_action( else: # left_click_drag run_xdotool(f"mousedown 1 mousemove --sync {x} {y} mouseup 1", display) - yield Message("system", f"Moved mouse to {x},{y}") + print(f"Moved mouse to {x},{y}") elif action in ("key", "type"): if not text: @@ -133,14 +131,14 @@ def computer_action( if action == "key": run_xdotool(f"key -- {text}", display) - yield Message("system", f"Sent key sequence: {text}") + print(f"Sent key sequence: {text}") else: # type for chunk in chunks(text, TYPING_GROUP_SIZE): run_xdotool( f"type --delay {TYPING_DELAY_MS} -- {shlex.quote(chunk)}", display, ) - yield Message("system", f"Typed text: {text}") + print(f"Typed text: {text}") elif action in ("left_click", "right_click", "middle_click", "double_click"): click_arg = { @@ -150,7 +148,7 @@ def computer_action( "double_click": "--repeat 2 --delay 500 1", }[action] run_xdotool(f"click {click_arg}", display) - yield Message("system", f"Performed {action}") + print(f"Performed {action}") elif action == "screenshot": # Use X11-specific screenshot if available, fall back to native @@ -173,32 +171,37 @@ def computer_action( subprocess.run( f"convert {path} -resize {x}x{y}! {path}", shell=True, check=True ) - yield Message("system", f"Screenshot saved to {path}", files=[path]) + # TODO: yield a message with the image (same as vision tool) + print(f"Screenshot saved to {path}") # files=[path] + else: + print("Error: Screenshot failed") elif action == "cursor_position": output = run_xdotool("getmouselocation --shell", display) x = int(output.split("X=")[1].split("\n")[0]) y = int(output.split("Y=")[1].split("\n")[0]) x, y = scale_coordinates(ScalingSource.COMPUTER, x, y, width, height) - yield Message("system", f"Cursor position: X={x},Y={y}") + print(f"Cursor position: X={x},Y={y}") except Exception as e: - yield Message("system", f"Error: Computer action failed: {str(e)}") - + print(f"Error: Computer action failed: {str(e)}") + + +instructions = """ +Use this tool to interact with the computer through X11. +Available actions: +- key: Send key sequence (e.g., "Return", "Control_L+c") +- type: Type text with realistic delays +- mouse_move: Move mouse to coordinates +- left_click, right_click, middle_click, double_click: Mouse clicks +- left_click_drag: Click and drag to coordinates +- screenshot: Take a screenshot +- cursor_position: Get current mouse position +""" tool = ToolSpec( name="computer", desc="Control the computer through X11 (keyboard, mouse, screen)", - instructions=""" - Use this tool to interact with the computer through X11. - Available actions: - - key: Send key sequence (e.g., "Return", "Control_L+c") - - type: Type text with realistic delays - - mouse_move: Move mouse to coordinates - - left_click, right_click, middle_click, double_click: Mouse clicks - - left_click_drag: Click and drag to coordinates - - screenshot: Take a screenshot - - cursor_position: Get current mouse position - """, + instructions=instructions, functions=[computer_action], ) From 0ec7539a76f715eb3c5eb05de9d78b70e139bc02 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Erik=20Bj=C3=A4reholt?= Date: Sun, 27 Oct 2024 22:37:14 +0100 Subject: [PATCH 08/16] docs: fixed server docs for computer use --- docs/server.rst | 33 +-------------------------------- scripts/Dockerfile.computer | 6 +++++- 2 files changed, 6 insertions(+), 33 deletions(-) diff --git a/docs/server.rst b/docs/server.rst index 22fcbbf5..7243bcc0 100644 --- a/docs/server.rst +++ b/docs/server.rst @@ -57,35 +57,4 @@ The server provides two interfaces: .. warning:: The computer use interface is experimental and has serious security implications. - Please use with caution and consider the security considerations below. - -Security Considerations ------------------------ - -When using the computer use interface: - -1. Network Security - - The server exposes VNC and web interfaces - - Consider using SSH tunneling for remote access - - Restrict access to trusted networks/users - - Use firewall rules to limit port access - -2. Container Security - - Run with minimal privileges - - Mount only necessary directories - - Consider using a separate network namespace - - Regularly update base images and dependencies - -3. Usage Guidelines - - Start in view-only mode by default - - Require explicit user action to enable interaction - - Monitor and audit computer use sessions - - Implement timeouts for inactive sessions - -4. Data Protection - - Don't expose sensitive information in the desktop environment - - Be cautious with browser automation and credentials - - Consider using ephemeral containers - - Regular cleanup of session data - -Please also see Anthropic's documentation on `computer use `_ for additional guidance. + Please use with caution and see Anthropic's documentation on `computer use `_ for additional guidance. diff --git a/scripts/Dockerfile.computer b/scripts/Dockerfile.computer index 8d166f42..1af53bef 100644 --- a/scripts/Dockerfile.computer +++ b/scripts/Dockerfile.computer @@ -87,7 +87,11 @@ RUN chmod +x /app/start_x11.sh WORKDIR /workspace # Expose ports -EXPOSE 5000 6080 8080 +# 5000: Chat view? +# 6080: noVNC +# 8080: Chat view? +# 8081: Chat view? +EXPOSE 5000 6080 8080 8081 # Start services CMD ["/app/start_x11.sh"] From 65f22500971144c6e2ec118f910a07869efa94d7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Erik=20Bj=C3=A4reholt?= Date: Sun, 27 Oct 2024 22:46:26 +0100 Subject: [PATCH 09/16] docs: refactored computer use warning into seperate file --- docs/computer-use-warning.rst | 4 ++++ docs/server.rst | 5 +---- docs/tools.rst | 28 +--------------------------- 3 files changed, 6 insertions(+), 31 deletions(-) create mode 100644 docs/computer-use-warning.rst diff --git a/docs/computer-use-warning.rst b/docs/computer-use-warning.rst new file mode 100644 index 00000000..d90f162d --- /dev/null +++ b/docs/computer-use-warning.rst @@ -0,0 +1,4 @@ +.. warning:: + + The computer use interface is experimental and has serious security implications. + Please use with caution and see Anthropic's documentation on `computer use `_ for additional guidance. diff --git a/docs/server.rst b/docs/server.rst index 7243bcc0..ab745308 100644 --- a/docs/server.rst +++ b/docs/server.rst @@ -54,7 +54,4 @@ The server provides two interfaces: - Browser with WebSocket support for VNC - Network ports 5000 (API), 8080 (combined view), and 6080 (VNC) available -.. warning:: - - The computer use interface is experimental and has serious security implications. - Please use with caution and see Anthropic's documentation on `computer use `_ for additional guidance. +.. include:: computer-use-warning.rst diff --git a/docs/tools.rst b/docs/tools.rst index eacd917a..9e82927d 100644 --- a/docs/tools.rst +++ b/docs/tools.rst @@ -152,30 +152,4 @@ Example usage:: The tool automatically handles screen resolution scaling to ensure optimal performance with LLM vision capabilities. -Security Considerations -~~~~~~~~~~~~~~~~~~~~~ - -.. warning:: - Computer use poses unique risks beyond standard LLM interactions. To minimize risks: - - 1. Run in an isolated environment: - - Use Docker container with minimal privileges - - Consider using a dedicated virtual machine - - Limit network access where possible - - 2. Protect sensitive data: - - Don't expose login credentials or sensitive information - - Be cautious with browser automation - - Consider using a separate profile/workspace - - 3. Implement safeguards: - - Require human confirmation for consequential actions - - Use view-only mode by default in the web interface - - Monitor and log computer use actions - - 4. Be aware of prompt injection risks: - - The model may follow commands found in viewed content - - Screen content could override user instructions - - Isolate the environment from sensitive operations - - Always inform users of these risks and obtain appropriate consent before enabling computer use features. +.. include:: computer-use-warning.rst From ea240bb29ffc6f0c2afe4be47def9f7490f9d055 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Erik=20Bj=C3=A4reholt?= Date: Mon, 28 Oct 2024 11:33:37 +0100 Subject: [PATCH 10/16] fix: optimized Dockerfile.computer for faster rebuilds --- scripts/Dockerfile.computer | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/scripts/Dockerfile.computer b/scripts/Dockerfile.computer index 1af53bef..7adf6bf5 100644 --- a/scripts/Dockerfile.computer +++ b/scripts/Dockerfile.computer @@ -53,9 +53,12 @@ RUN python3.10 -m pip install --upgrade pip && \ # Set up project WORKDIR /app +COPY --chown=$USERNAME:$USERNAME pyproject.toml poetry.lock /app/ +RUN poetry config virtualenvs.create false +RUN poetry install --no-interaction --no-ansi -E server --no-root + COPY --chown=$USERNAME:$USERNAME . /app -RUN poetry config virtualenvs.create false && \ - poetry install --no-interaction --no-ansi -E server +RUN poetry install --no-interaction --no-ansi -E server # Copy desktop environment configs (at the end for faster rebuilds) RUN mkdir -p $HOME && chmod 777 $HOME From 6b44dd6a16a62ffd2867cd5274a2e87eeffdc451 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Erik=20Bj=C3=A4reholt?= Date: Fri, 1 Nov 2024 11:44:45 +0100 Subject: [PATCH 11/16] fix: refactor and misc fixes to computer use --- docs/server.rst | 19 ++-- gptme/server/static/index.html | 3 +- gptme/server/static/main.js | 8 ++ gptme/tools/computer.py | 10 ++- scripts/Dockerfile.computer | 38 ++++---- scripts/computer_home/entrypoint.sh | 11 +++ scripts/computer_home/mutter_startup.sh | 22 +++++ scripts/computer_home/novnc_startup.sh | 29 ++++++ scripts/computer_home/start_all.sh | 8 ++ scripts/computer_home/tint2_startup.sh | 22 +++++ scripts/computer_home/x11vnc_startup.sh | 48 ++++++++++ scripts/computer_home/xvfb_startup.sh | 23 +++++ scripts/start_x11.sh | 114 ------------------------ 13 files changed, 213 insertions(+), 142 deletions(-) create mode 100755 scripts/computer_home/entrypoint.sh create mode 100755 scripts/computer_home/mutter_startup.sh create mode 100755 scripts/computer_home/novnc_startup.sh create mode 100755 scripts/computer_home/start_all.sh create mode 100755 scripts/computer_home/tint2_startup.sh create mode 100755 scripts/computer_home/x11vnc_startup.sh create mode 100755 scripts/computer_home/xvfb_startup.sh delete mode 100755 scripts/start_x11.sh diff --git a/docs/server.rst b/docs/server.rst index ab745308..700b1d65 100644 --- a/docs/server.rst +++ b/docs/server.rst @@ -30,16 +30,23 @@ The server provides two interfaces: 2. Computer Use Interface + Requires Docker. + .. code-block:: bash - # Run with computer use support (requires Docker) - docker run -p 5000:5000 -p 8080:8080 -p 6080:6080 ghcr.io/erikbjare/gptme:latest-server + # Clone the repository + git clone https://github.com/ErikBjare/gptme.git + cd gptme + # Build container + make build-docker-computer + # Run container + docker run -v ~/.config/gptme:/home/computeruse/.config/gptme -p 5000:5000 -p 6080:6080 -p 8080:8080 gptme-computer:latest The computer use interface provides: - - Combined chat and desktop view at http://localhost:8080 - - Desktop-only view at http://localhost:6080/vnc.html - - Chat-only view at http://localhost:5000 + - Combined view at http://localhost:8080/computer + - Chat view at http://localhost:8080 + - Desktop view at http://localhost:6080/vnc.html Features: @@ -52,6 +59,6 @@ The server provides two interfaces: - Docker for running the server with X11 support - Browser with WebSocket support for VNC - - Network ports 5000 (API), 8080 (combined view), and 6080 (VNC) available + - Network ports 6080 (VNC) and 8080 (web UI) available .. include:: computer-use-warning.rst diff --git a/gptme/server/static/index.html b/gptme/server/static/index.html index 9e1263a3..8df9f719 100644 --- a/gptme/server/static/index.html +++ b/gptme/server/static/index.html @@ -151,8 +151,9 @@

{{ selectedConversatio