Skip to content
Evan Carlin edited this page May 4, 2022 · 35 revisions

Amazon Web Services (AWS)

Setup Instance

From console:

  • Launch Instance
  • AWS Marketplace
  • Search "centos"
  • CentOS 7 (x86_64) - with Updates HVM
  • Root partition: 10g, encrypted
  • Security group: public-ssh
  • Launch
  • Use existing key

Once booted, get public and private IPs:

  • Add IPs to named
    • create a record in $net_conf_init with the internet and backnet ip addresses (public and private ip addresses in AWS). You'll want to create an elastic ip address through AWS otherwise the public ip changes each time the machine reboots.
    • you'll want to create a record in $host_conf that associates the names you created in the previous step with the desired hostnames. For example, if in the previous step you created aws_foo6i 'x.x.x.234/32' then here you'll create foo6i => ['aws_m503i', 234],. Again, do this for both internet and backnet
    • Finally, associate the names with a domain. Find the domain you want under the "zones" key in the NamedConf hash (ex bar.com). Under the $ipv4_map add the name of the server (ex foo6i).
  • Setup host with rsconf_db.components: [ docker ]
  • bash /srv/rsconf/aws-init.sh <ip>

Proceed with post installation instructions.

GPU Driver install

# need kernel source which is always the latest so do update first
yum update -y
yum install -y kernel-devel
# if new kernel, then
reboot
yum remove $(rpm -qa | grep ^kernel-3 | grep -v $(uname -r))
yum install -y https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-10.1.243-1.x86_64.rpm
# needs to be installed manually:
# http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/yum-plugin-nvidia-0.5-1.el7.noarch.rpm: [Errno -1] Package does not match intended download.
yum install -y http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/yum-plugin-nvidia-0.5-1.el7.noarch.rpm
yum install -y cuda-drivers
reboot
nvidia-smi
curl -s -L https://nvidia.github.io/nvidia-container-runtime/centos7/nvidia-container-runtime.repo \
    | install -m 444 /dev/stdin /etc/yum.repos.d/nvidia-container-runtime.repo
yum install -y nvidia-container-runtime
systemctl restart docker

For cuda 11.4 (version IndeX currently requires): https://developer.nvidia.com/cuda-11-4-4-download-archive?target_os=Linux&target_arch=x86_64&Distribution=CentOS&target_version=7&target_type=rpm_local

# need kernel source which is always the latest so do update first
yum update -y
yum install -y kernel-devel
# if new kernel, then
reboot
yum remove $(rpm -qa | grep ^kernel-3 | grep -v $(uname -r))
wget https://developer.download.nvidia.com/compute/cuda/11.4.4/local_installers/cuda-repo-rhel7-11-4-local-11.4.4_470.82.01-1.x86_64.rpm
rpm -i cuda-repo-rhel7-11-4-local-11.4.4_470.82.01-1.x86_64.rpm
yum clean all
yum install -y nvidia-driver-latest-dkms cuda
yum -y install cuda-drivers
reboot
nvidia-smi # verify it says CUDA 11.4
# Now setup nvidia-container-toolkit. It is confusing which packages are needed
# nvidia-docker2, nvidia-container-tookit, and/or nvidia-container-runtime.
# This comment clears up the confusion https://github.com/NVIDIA/nvidia-docker/issues/1268#issuecomment-632692949 .
# We just need nvidia-container-toolkit since we don't use Kubernetes and have Docker > 19.03
curl -s -L https://nvidia.github.io/libnvidia-container/centos7/libnvidia-container.repo \
    | install -m 444 /dev/stdin /etc/yum.repos.d/nvidia-container-toolkit.repo
yum clean expire-cache
yum install -y nvidia-container-toolkit
systemctl restart docker
docker run --rm --gpus all nvidia/cuda:11.4.0-base nvidia-smi
docker image rm nvidia/cuda:11.4.0-base

Verify

docker run -it --gpus=all --net=host --rm tensorflow/tensorflow:latest-gpu python -c 'import tensorflow; tensorflow.config.experimental.list_physical_devices("gpu")'
<snip>
Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
<snip>

Verify ssh key fingerprint

You have to convert the ssh private key to PEM format then DER format, and finally compute md5:

ssh-keygen -e -m PEM -f ~/.ssh/id_rsa | openssl rsa -RSAPublicKey_in -outform DER | openssl md5 -c
Clone this wiki locally