Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Installation / Prepare Script #160

Merged
merged 13 commits into from
Apr 7, 2024
Merged

Improved Installation / Prepare Script #160

merged 13 commits into from
Apr 7, 2024

Conversation

larsll
Copy link
Contributor

@larsll larsll commented Mar 31, 2024

Significantly simplified prepare script with the following improvements:

  • Works with Ubuntu 20.04 and Ubuntu 22.04
  • Works with WSL2
  • Shorter installation time as CUDA is not installed. (Only required inside of the containers.)
  • Upgraded Nvidia Drivers (525/535)
  • Uses docker.io version of docker.

@larsll larsll changed the title Slimmed down prepare Improved Installation / Prepare Script Apr 1, 2024
@larsll larsll requested a review from MarkRoss-Eviden April 1, 2024 13:06
@MarkRoss-Eviden
Copy link
Contributor

Initial testing not successful so please don't merge @larsll : -

image

I need to do some investigation as to what's causing it. Ideas?

@larsll
Copy link
Contributor Author

larsll commented Apr 1, 2024

What are you trying to do? (Platform, action etc.)

@MarkRoss-Eviden
Copy link
Contributor

MarkRoss-Eviden commented Apr 2, 2024

Need to test several cases: -
1 - GPU sagemaker / CPU robomaker
2 - GPU sagemaker / GPU robomaker
3 - GPU sagemaker/ CPU robomaker with OpenGL
4 - GPU sagemaker/ GPU robomaker with OpenGL

@MarkRoss-Eviden
Copy link
Contributor

Test case 1 works with following settings (defaults in the DOTS repo): -

image

image

@MarkRoss-Eviden
Copy link
Contributor

Test case 2 works: -

image

image

@MarkRoss-Eviden
Copy link
Contributor

MarkRoss-Eviden commented Apr 2, 2024

Test case 3 did not work: -
image

Trying to manually run from the created instance: -
image

Assume some of the config that currently works no longer works with your slimmed down changes @larsll. Relevant bits of the code that works with current main branch: -

Creation of AMI occurs here - https://github.com/aws-deepracer-community/deepracer-on-the-spot/blob/main/scripts/image-builder.yaml which mainly deals with prepare and install. Perhaps I now need to add in some pre-reqs you've removed?

Then when the instance runs this is the bit of the code that runs when you're trying to use OpenGL (https://github.com/aws-deepracer-community/deepracer-on-the-spot/blob/main/spot-instance.yaml): -

# Setup required config if using OpenGL training
if [[ $DR_HOST_X == True ]];then
  distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed 's/\.//')
  sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/3bf863cc.pub
  sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/machine-learning/repos/$distribution/x86_64/7fa2af80.pub
  echo "deb http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64 /" | sudo tee /etc/apt/sources.list.d/cuda.list
  echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/$distribution/x86_64 /" | sudo tee /etc/apt/sources.list.d/cuda_learn.list
  sudo apt update && sudo apt install -y nvidia-driver-470-server cuda-minimal-build-11-4 --no-install-recommends -o Dpkg::Options::="--force-overwrite"
  distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
  curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
  curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
  sudo apt-get update && sudo apt-get install -y --no-install-recommends nvidia-docker2 nvidia-container-toolkit nvidia-container-runtime xserver-xorg-dev xutils-dev
  ./utils/setup-xorg.sh
  ./utils/start-xorg.sh
  sleep 15
  export DISPLAY=$DR_DISPLAY
  sudo nohup xinit /usr/bin/jwm -- /usr/lib/xorg/Xorg $DISPLAY -config $DR_DIR/tmp/xorg.conf > $DR_DIR/tmp/xorg.log 2>&1 & sleep 1
  nohup xrandr -s 1400x900
  nohup x11vnc -bg -forever -no6 -nopw -rfbport 5901 -rfbportv6 -1 -display WAIT$DISPLAY & sleep 1
  xauth generate $DISPLAY
  export XAUTHORITY=~/.Xauthority
  sudo DISPLAY=:$DISPLAY XAUTHORITY=$(ps aux | grep "X.*\-auth" | grep -v grep | sed -n 's/.*-auth \([^ ]\+\).*/\1/p') xhost +
fi

@larsll
Copy link
Contributor Author

larsll commented Apr 2, 2024

Hmm, that piece of code is a bit of a mystery. On my test with GPU + OpenGL then you only need to do setup-xorg.sh and start-xorg.sh; the rest seems to be pieces copied together from prepare.sh and those scripts... (And there has been changes as to how xorg starts; have a look at those updated scripts...)

@MarkRoss-Eviden
Copy link
Contributor

Hmm, that piece of code is a bit of a mystery. On my test with GPU + OpenGL then you only need to do setup-xorg.sh and start-xorg.sh; the rest seems to be pieces copied together from prepare.sh and those scripts... (And there has been changes as to how xorg starts; have a look at those updated scripts...)

It's code that I added to get OpenGL to work, prior to adding this it did not work. It could be because the approach Tyler took in the original set-up of DOTS is to run a bunch of things initially to 'bake' an AMI (where prepare and init is ran) to speed up deployment, but it means when you deploy the instance you're running a few things specifically on that new instance. It's part of this wider code that runs on the new instance from the AMI: -

#!/bin/bash
source /etc/profile.d/my_custom_files.sh
aws sns publish --topic-arn $MY_SNS_TOPIC --message "Training has initiated on a new instance. The new url to monitor progress is http://$PUBLIC_IP:8100/menu.html" --region $DEEPRACER_REGION
cd ~/deepracer-for-cloud
git pull
sed -i "s/DR_UPLOAD_S3_BUCKET=not-defined/DR_UPLOAD_S3_BUCKET=$DEEPRACER_S3_URI/" ~/deepracer-for-cloud/system.env
sed -i "s/DR_LOCAL_S3_BUCKET=bucket/DR_LOCAL_S3_BUCKET=$DEEPRACER_S3_URI/" ~/deepracer-for-cloud/system.env
sed -i "s/DR_UPLOAD_S3_PREFIX=upload/DR_UPLOAD_S3_PREFIX=$DR_LOCAL_S3_MODEL_PREFIX-upload/" ~/deepracer-for-cloud/run.env
sed -i "s|DR_LOCAL_S3_CUSTOM_FILES_PREFIX=custom_files|DR_LOCAL_S3_CUSTOM_FILES_PREFIX=$CUSTOM_FILE_LOCATION|" /deepracer-for-cloud/run.env
source bin/activate.sh
dr-download-custom-files
cp custom_files/*.env .
dr-reload
# Setup required config if using OpenGL training
if [[ $DR_HOST_X == True ]];then
distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed 's/.//')
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/3bf863cc.pub
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/machine-learning/repos/$distribution/x86_64/7fa2af80.pub
echo "deb http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64 /" | sudo tee /etc/apt/sources.list.d/cuda.list
echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/$distribution/x86_64 /" | sudo tee /etc/apt/sources.list.d/cuda_learn.list
sudo apt update && sudo apt install -y nvidia-driver-470-server cuda-minimal-build-11-4 --no-install-recommends -o Dpkg::Options::="--force-overwrite"
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y --no-install-recommends nvidia-docker2 nvidia-container-toolkit nvidia-container-runtime xserver-xorg-dev xutils-dev
./utils/setup-xorg.sh
./utils/start-xorg.sh
sleep 15
export DISPLAY=$DR_DISPLAY
sudo nohup xinit /usr/bin/jwm -- /usr/lib/xorg/Xorg $DISPLAY -config $DR_DIR/tmp/xorg.conf > $DR_DIR/tmp/xorg.log 2>&1 & sleep 1
nohup xrandr -s 1400x900
nohup x11vnc -bg -forever -no6 -nopw -rfbport 5901 -rfbportv6 -1 -display WAIT$DISPLAY & sleep 1
xauth generate $DISPLAY
export XAUTHORITY=
/.Xauthority
sudo DISPLAY=:$DISPLAY XAUTHORITY=$(ps aux | grep "X.-auth" | grep -v grep | sed -n 's/.-auth ([^ ]+).*/\1/p') xhost +
fi
# There is a bug where at some times the training fails to start, so we start, stop and start it again to reduce the occurrences of this issue.
nohup /bin/bash -lc 'cd ~/deepracer-for-cloud/; dr-start-training -qw; sleep 120; dr-stop-training; sleep 60; echo y | docker container prune; dr-reload; dr-start-training -qwv' &
mkdir -p /tmp/logs/
# We want to be able to monitor our EC2 training without needing to connect to console, so we upload all needed info to Public_IP:8100/menu.html using this script
nohup /bin/bash -lc 'source /home/ubuntu/bin/web_monitoring.sh >/dev/null 2>&1' &
sleep 180 > /dev/null
while [ True ]; do
# if the EC2 started termination process upon interruption notification, this file should exist, hence we leave termination process to manage final uploads without conflict
if [[ -f /home/ubuntu/bin/termination.started ]];then
break
fi
# Update variable references before every iteration in case of any change on the config files
source ~/deepracer-for-cloud/bin/activate.sh

                for name in `docker ps -a --format "{{.Names}}"`; do
                    docker logs ${name} > /tmp/logs/${name}.log 2>&1
                done
                # Only upload best Checkpoint if best Checkpoint has changed
                bestcheckpoint=$(echo n | dr-upload-model -b 2>&1 | grep "checkpoint:")
                aws s3 cp /tmp/logs/ s3://$DEEPRACER_S3_URI/$DR_LOCAL_S3_MODEL_PREFIX/logs/ --recursive
                rm -rf /tmp/logs/*.* > /dev/null 2>&1
                if [ [ "$bestcheckpoint" != "$lastbestcheckpoint" ] && [ "$bestcheckpoint" != "" ] ];then
                  # update file timestamp just to avoid conflict with termination process
                  touch /home/ubuntu/bin/uploading_best_model.timestamp 2>&1
                  dr-upload-model -bfw > /dev/null 2>&1
                  lastbestcheckpoint=$bestcheckpoint
                fi
                sleep 120
            done

I'll have to do some further testing. Also how long does it take and does it still need reboots now you've stripped back the install as perhaps we could do away with the AMI approach and just run from a fresh Ubuntu if it's only a short amount at startup (the AMI approach was designed to reduce time from creating an instance to training starting)

@MarkRoss-Eviden
Copy link
Contributor

MarkRoss-Eviden commented Apr 3, 2024

Error on using OpenGL relates to runnignt he ./utils/setup-xorg.sh script, output below: -

Reading package lists... Done
Building dependency tree
Reading state information... Done
screen is already the newest version (4.8.0-1ubuntu0.1).
screen set to manually installed.
The following additional packages will be installed:
libmotif-common libtcl8.6 libtk8.6 libvncclient1 libvncserver1 libxcb-shape0 libxcomposite1 libxcursor1 libxdamage1 libxft2 libxi6 libxinerama1 libxm4 libxrandr2 libxss1 libxtst6 libxv1 libxxf86dga1 tcl tcl8.6 tk tk8.6 xbitmaps
Suggested packages:
menu-l10n gksu | kde-runtime | ktsuss tcl-tclreadline nickle cairo-5c xorg-docs-core xfonts-cyrillic
Recommended packages:
xserver-xorg | xserver
The following NEW packages will be installed:
libmotif-common libtcl8.6 libtk8.6 libvncclient1 libvncserver1 libxcb-shape0 libxcomposite1 libxcursor1 libxdamage1 libxft2 libxi6 libxinerama1 libxm4 libxrandr2 libxss1 libxtst6 libxv1 libxxf86dga1 menu mesa-utils mwm pkg-config tcl
tcl8.6 tk tk8.6 x11-utils x11-xserver-utils x11vnc xbitmaps xinit xserver-xorg-legacy xterm
0 upgraded, 33 newly installed, 0 to remove and 0 not upgraded.
Need to get 5833 kB of archives.
After this operation, 19.7 MB of additional disk space will be used.
Get:1 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/universe amd64 libmotif-common all 2.3.8-2build1 [10.8 kB]
Get:2 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxft2 amd64 2.3.3-0ubuntu1 [39.2 kB]
Get:3 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/universe amd64 libxm4 amd64 2.3.8-2build1 [993 kB]
Get:4 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/universe amd64 menu amd64 2.1.47ubuntu4 [354 kB]
Get:5 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/universe amd64 mwm amd64 2.3.8-2build1 [171 kB]
Get:6 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libtcl8.6 amd64 8.6.10+dfsg-1 [902 kB]
Get:7 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxss1 amd64 1:1.2.3-1 [8140 B]
Get:8 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libtk8.6 amd64 8.6.10-1 [714 kB]
Get:9 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal-updates/main amd64 libvncclient1 amd64 0.9.12+dfsg-9ubuntu0.3 [65.6 kB]
Get:10 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal-updates/main amd64 libvncserver1 amd64 0.9.12+dfsg-9ubuntu0.3 [119 kB]
Get:11 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxcb-shape0 amd64 1.14-2 [5928 B]
Get:12 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxcomposite1 amd64 1:0.4.5-1 [6976 B]
Get:13 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxcursor1 amd64 1:1.2.0-2 [20.1 kB]
Get:14 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxdamage1 amd64 1:1.1.5-2 [6996 B]
Get:15 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxi6 amd64 2:1.7.10-0ubuntu1 [29.9 kB]
Get:16 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxinerama1 amd64 2:1.1.4-2 [6904 B]
Get:17 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxrandr2 amd64 2:1.5.2-0ubuntu1 [18.5 kB]
Get:18 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxtst6 amd64 2:1.2.3-1 [12.8 kB]
Get:19 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxv1 amd64 2:1.0.11-1 [10.7 kB]
Get:20 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxxf86dga1 amd64 2:1.1.5-0ubuntu1 [12.0 kB]
Get:21 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 pkg-config amd64 0.29.1-0ubuntu4 [45.5 kB]
Get:22 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 tcl8.6 amd64 8.6.10+dfsg-1 [14.8 kB]
Get:23 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/universe amd64 tcl amd64 8.6.9+1 [5112 B]
Get:24 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 tk8.6 amd64 8.6.10-1 [12.5 kB]
Get:25 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/universe amd64 tk amd64 8.6.9+1 [3240 B]
Get:26 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 x11-utils amd64 7.7+5 [199 kB]
Get:27 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 x11-xserver-utils amd64 7.7+8 [162 kB]
Get:28 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/universe amd64 x11vnc amd64 0.9.16-3 [1006 kB]
Get:29 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 xbitmaps all 1.1.1-2 [28.1 kB]
Get:30 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 xinit amd64 1.4.1-0ubuntu2 [17.9 kB]
Get:31 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal-updates/universe amd64 xterm amd64 353-1ubuntu1.20.04.2 [765 kB]
Get:32 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/universe amd64 mesa-utils amd64 8.4.0-1build1 [34.2 kB]
Get:33 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal-updates/main amd64 xserver-xorg-legacy amd64 2:1.20.13-1ubuntu120.04.15 [33.5 kB]
Fetched 5833 kB in 1s (6745 kB/s)
Extracting templates from packages: 100%
Preconfiguring packages ...
Selecting previously unselected package libmotif-common.
(Reading database ... 82514 files and directories currently installed.)
Preparing to unpack .../00-libmotif-common_2.3.8-2build1_all.deb ...
Unpacking libmotif-common (2.3.8-2build1) ...
Selecting previously unselected package libxft2:amd64.
Preparing to unpack .../01-libxft2_2.3.3-0ubuntu1_amd64.deb ...
Unpacking libxft2:amd64 (2.3.3-0ubuntu1) ...
Selecting previously unselected package libxm4:amd64.
Preparing to unpack .../02-libxm4_2.3.8-2build1_amd64.deb ...
Unpacking libxm4:amd64 (2.3.8-2build1) ...
Selecting previously unselected package menu.
Preparing to unpack .../03-menu_2.1.47ubuntu4_amd64.deb ...
Unpacking menu (2.1.47ubuntu4) ...
Selecting previously unselected package mwm.
Preparing to unpack .../04-mwm_2.3.8-2build1_amd64.deb ...
Unpacking mwm (2.3.8-2build1) ...
Selecting previously unselected package libtcl8.6:amd64.
Preparing to unpack .../05-libtcl8.6_8.6.10+dfsg-1_amd64.deb ...
Unpacking libtcl8.6:amd64 (8.6.10+dfsg-1) ...
Selecting previously unselected package libxss1:amd64.
Preparing to unpack .../06-libxss1_1%3a1.2.3-1_amd64.deb ...
Unpacking libxss1:amd64 (1:1.2.3-1) ...
Selecting previously unselected package libtk8.6:amd64.
Preparing to unpack .../07-libtk8.6_8.6.10-1_amd64.deb ...
Unpacking libtk8.6:amd64 (8.6.10-1) ...
Selecting previously unselected package libvncclient1:amd64.
Preparing to unpack .../08-libvncclient1_0.9.12+dfsg-9ubuntu0.3_amd64.deb ...
Unpacking libvncclient1:amd64 (0.9.12+dfsg-9ubuntu0.3) ...
Selecting previously unselected package libvncserver1:amd64.
Preparing to unpack .../09-libvncserver1_0.9.12+dfsg-9ubuntu0.3_amd64.deb ...
Unpacking libvncserver1:amd64 (0.9.12+dfsg-9ubuntu0.3) ...
Selecting previously unselected package libxcb-shape0:amd64.
Preparing to unpack .../10-libxcb-shape0_1.14-2_amd64.deb ...
Unpacking libxcb-shape0:amd64 (1.14-2) ...
Selecting previously unselected package libxcomposite1:amd64.
Preparing to unpack .../11-libxcomposite1_1%3a0.4.5-1_amd64.deb ...
Unpacking libxcomposite1:amd64 (1:0.4.5-1) ...
Selecting previously unselected package libxcursor1:amd64.
Preparing to unpack .../12-libxcursor1_1%3a1.2.0-2_amd64.deb ...
Unpacking libxcursor1:amd64 (1:1.2.0-2) ...
Selecting previously unselected package libxdamage1:amd64.
Preparing to unpack .../13-libxdamage1_1%3a1.1.5-2_amd64.deb ...
Unpacking libxdamage1:amd64 (1:1.1.5-2) ...
Selecting previously unselected package libxi6:amd64.
Preparing to unpack .../14-libxi6_2%3a1.7.10-0ubuntu1_amd64.deb ...
Unpacking libxi6:amd64 (2:1.7.10-0ubuntu1) ...
Selecting previously unselected package libxinerama1:amd64.
Preparing to unpack .../15-libxinerama1_2%3a1.1.4-2_amd64.deb ...
Unpacking libxinerama1:amd64 (2:1.1.4-2) ...
Selecting previously unselected package libxrandr2:amd64.
Preparing to unpack .../16-libxrandr2_2%3a1.5.2-0ubuntu1_amd64.deb ...
Unpacking libxrandr2:amd64 (2:1.5.2-0ubuntu1) ...
Selecting previously unselected package libxtst6:amd64.
Preparing to unpack .../17-libxtst6_2%3a1.2.3-1_amd64.deb ...
Unpacking libxtst6:amd64 (2:1.2.3-1) ...
Selecting previously unselected package libxv1:amd64.
Preparing to unpack .../18-libxv1_2%3a1.0.11-1_amd64.deb ...
Unpacking libxv1:amd64 (2:1.0.11-1) ...
Selecting previously unselected package libxxf86dga1:amd64.
Preparing to unpack .../19-libxxf86dga1_2%3a1.1.5-0ubuntu1_amd64.deb ...
Unpacking libxxf86dga1:amd64 (2:1.1.5-0ubuntu1) ...
Selecting previously unselected package pkg-config.
Preparing to unpack .../20-pkg-config_0.29.1-0ubuntu4_amd64.deb ...
Unpacking pkg-config (0.29.1-0ubuntu4) ...
Selecting previously unselected package tcl8.6.
Preparing to unpack .../21-tcl8.6_8.6.10+dfsg-1_amd64.deb ...
Unpacking tcl8.6 (8.6.10+dfsg-1) ...
Selecting previously unselected package tcl.
Preparing to unpack .../22-tcl_8.6.9+1_amd64.deb ...
Unpacking tcl (8.6.9+1) ...
Selecting previously unselected package tk8.6.
Preparing to unpack .../23-tk8.6_8.6.10-1_amd64.deb ...
Unpacking tk8.6 (8.6.10-1) ...
Selecting previously unselected package tk.
Preparing to unpack .../24-tk_8.6.9+1_amd64.deb ...
Unpacking tk (8.6.9+1) ...
Selecting previously unselected package x11-utils.
Preparing to unpack .../25-x11-utils_7.7+5_amd64.deb ...
Unpacking x11-utils (7.7+5) ...
Selecting previously unselected package x11-xserver-utils.
Preparing to unpack .../26-x11-xserver-utils_7.7+8_amd64.deb ...
Unpacking x11-xserver-utils (7.7+8) ...
Selecting previously unselected package x11vnc.
Preparing to unpack .../27-x11vnc_0.9.16-3_amd64.deb ...
Unpacking x11vnc (0.9.16-3) ...
Selecting previously unselected package xbitmaps.
Preparing to unpack .../28-xbitmaps_1.1.1-2_all.deb ...
Unpacking xbitmaps (1.1.1-2) ...
Selecting previously unselected package xinit.
Preparing to unpack .../29-xinit_1.4.1-0ubuntu2_amd64.deb ...
Unpacking xinit (1.4.1-0ubuntu2) ...
Selecting previously unselected package xterm.
Preparing to unpack .../30-xterm_353-1ubuntu1.20.04.2_amd64.deb ...
Unpacking xterm (353-1ubuntu1.20.04.2) ...
Selecting previously unselected package mesa-utils.
Preparing to unpack .../31-mesa-utils_8.4.0-1build1_amd64.deb ...
Unpacking mesa-utils (8.4.0-1build1) ...
Selecting previously unselected package xserver-xorg-legacy.
Preparing to unpack .../32-xserver-xorg-legacy_2%3a1.20.13-1ubuntu1
20.04.15_amd64.deb ...
Unpacking xserver-xorg-legacy (2:1.20.13-1ubuntu120.04.15) ...
Setting up xinit (1.4.1-0ubuntu2) ...
Setting up libxft2:amd64 (2.3.3-0ubuntu1) ...
Setting up libxdamage1:amd64 (1:1.1.5-2) ...
Setting up libxi6:amd64 (2:1.7.10-0ubuntu1) ...
Setting up libxtst6:amd64 (2:1.2.3-1) ...
Setting up libxcursor1:amd64 (1:1.2.0-2) ...
Setting up libxcb-shape0:amd64 (1.14-2) ...
Setting up libxxf86dga1:amd64 (2:1.1.5-0ubuntu1) ...
Setting up libmotif-common (2.3.8-2build1) ...
Setting up libvncserver1:amd64 (0.9.12+dfsg-9ubuntu0.3) ...
Setting up libvncclient1:amd64 (0.9.12+dfsg-9ubuntu0.3) ...
Setting up mesa-utils (8.4.0-1build1) ...
Setting up libxinerama1:amd64 (2:1.1.4-2) ...
Setting up libxv1:amd64 (2:1.0.11-1) ...
Setting up libxrandr2:amd64 (2:1.5.2-0ubuntu1) ...
Setting up libtcl8.6:amd64 (8.6.10+dfsg-1) ...
Setting up pkg-config (0.29.1-0ubuntu4) ...
Setting up libxss1:amd64 (1:1.2.3-1) ...
Setting up menu (2.1.47ubuntu4) ...
Setting up libxcomposite1:amd64 (1:0.4.5-1) ...
Setting up xserver-xorg-legacy (2:1.20.13-1ubuntu1
20.04.15) ...
Setting up xbitmaps (1.1.1-2) ...
Setting up libxm4:amd64 (2.3.8-2build1) ...
Setting up tcl8.6 (8.6.10+dfsg-1) ...
Setting up libtk8.6:amd64 (8.6.10-1) ...
Setting up x11-xserver-utils (7.7+8) ...
Setting up mwm (2.3.8-2build1) ...
update-alternatives: using /usr/bin/mwm to provide /usr/bin/x-window-manager (x-window-manager) in auto mode
Setting up tcl (8.6.9+1) ...
Setting up x11-utils (7.7+5) ...
Setting up xterm (353-1ubuntu1.20.04.2) ...
update-alternatives: using /usr/bin/xterm to provide /usr/bin/x-terminal-emulator (x-terminal-emulator) in auto mode
update-alternatives: using /usr/bin/lxterm to provide /usr/bin/x-terminal-emulator (x-terminal-emulator) in auto mode
Setting up tk8.6 (8.6.10-1) ...
Setting up tk (8.6.9+1) ...
Setting up x11vnc (0.9.16-3) ...
Processing triggers for mime-support (3.64ubuntu1) ...
Processing triggers for hicolor-icon-theme (0.17-2) ...
Processing triggers for libc-bin (2.31-0ubuntu9.14) ...
Processing triggers for man-db (2.9.1-1) ...
Processing triggers for install-info (6.7.0.dfsg.2-5) ...
Processing triggers for menu (2.1.47ubuntu4) ...

ERROR: Unable to query GPU information

nvidia-xconfig: option "--busid" requires an argument.

Invalid commandline, please run nvidia-xconfig --help for usage information.

BUS_ID var in the script is not being set. running the command that sets the BUS_ID var results in: -
image

EC2 instance does have a GPU :-), it's a g4dn.2xlarge I've tested on.

Looking through the PR I noticed this as thought it might be related to not being able to find the GPU info: -
sudo apt install -y nvidia-driver-525-server --no-install-recommends -o Dpkg::Options::="--force-overwrite"

Post install of that line I can now get back the GPU info: -
image

So it appears the problem is that the updated DRfC code isn't appropriately detecting the GPU on the EC2 instance and running the code to install the nvidia drivers @larsll?

@MarkRoss-Eviden
Copy link
Contributor

Existing DOTS setup output: -

NVidia SMI output when using GPU sagemaker and CPU robomaker (taken from nginx output as everything works) -
image

NVidia SMI output when using GPU sagemaker and CPU robomaker with OpenGL config (taken from terminal as nginx etc doesn't come up): -
image

I'll try a fresh build next

@MarkRoss-Eviden
Copy link
Contributor

GPU for robomaker and sagemaker on OpenGL works now having removed the old nvidia driver install and replacing with up to date one: -
image

GPU for sagemaker, CPU for robomaker with PenGL also now works: -
image

Think we're good to go @larsll

@larsll larsll merged commit 2daf796 into master Apr 7, 2024
@larsll larsll deleted the install-improve branch May 14, 2024 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants