From 9b4d02964099ce0daf35ccb522ec470a0ce133b7 Mon Sep 17 00:00:00 2001 From: EduardDurech <39579228+EduardDurech@users.noreply.github.com> Date: Tue, 9 Jul 2024 20:40:52 +0200 Subject: [PATCH 01/20] Update csub.py @supportrcp --- csub.py | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/csub.py b/csub.py index dc6c746..1449ab4 100644 --- a/csub.py +++ b/csub.py @@ -169,8 +169,6 @@ apiVersion: run.ai/v2alpha1 kind: {workload_kind} metadata: - annotations: - runai-cli-version: 2.9.25 labels: PreviousJob: "true" name: {args.name} @@ -221,7 +219,7 @@ items: pvc--0: value: - claimName: runai-mlo-{user_cfg['user']}-scratch + claimName: mlo-scratch existingPvc: true path: /mloscratch readOnly: false From 9a503412d594592b91e6b85cc4713e14030ff2cb Mon Sep 17 00:00:00 2001 From: EduardDurech <39579228+EduardDurech@users.noreply.github.com> Date: Tue, 9 Jul 2024 21:20:25 +0200 Subject: [PATCH 02/20] Update README.md @updated cluster --- README.md | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index e04504b..afea790 100644 --- a/README.md +++ b/README.md @@ -77,19 +77,23 @@ The following are just a bunch of commands you need to run to get started. If yo # Sketch for macOS with Apple Silicon. # Download a specific version (here 1.26.7 for Apple Silicon macOS) curl -LO "https://dl.k8s.io/release/v1.26.7/bin/darwin/arm64/kubectl" + # curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" #Linux + # Give it the right permissions and move it. chmod +x ./kubectl sudo mv ./kubectl /usr/local/bin/kubectl sudo chown root: /usr/local/bin/kubectl ``` -2. Setup the kube config file: Create a file in your home directory as ``~/.kube/config`` and copy the contents from the file [`kubeconfig.yaml`](kubeconfig.yaml) in this file. Note that the file on your machine has no suffix. +2. Setup the kube config file: Create a file in your home directory as ``~/.kube/config`` and copy the contents from the file [`kubeconfig.yaml`](kubeconfig.yaml) in this file. Note that the file on your machine has no suffix. For the updated cluster https://wiki.rcp.epfl.ch/home/CaaS/how-to-switch-between-rcp-caas-cluster-and-ic-caas-cluster 3. Install the run:ai CLI: ```bash # Sketch for macOS with Apple Silicon # Download the CLI from the link shown in the help section. wget --content-disposition https://rcp-caas-test.rcp.epfl.ch/cli/darwin + # wget --content-disposition https://rcp-caas-prod.rcp.epfl.ch/cli/linux #Linux + # Give it the right permissions and move it. chmod +x ./runai sudo mv ./runai /usr/local/bin/runai @@ -98,6 +102,7 @@ The following are just a bunch of commands you need to run to get started. If yo ## 3: Login 4. Switch between contexts and login to both clusters. + Old ```bash # Switch to the IC cluster runai config cluster ic-context @@ -113,7 +118,8 @@ The following are just a bunch of commands you need to run to get started. If yo runai list projects runai config project mlo-$GASPAR_USERNAME ``` -5. Run a quick test to see that you can launch jobs: + For the updated cluster use `ic-caas` and `rcp-caas-prod` +6. Run a quick test to see that you can launch jobs: ```bash # Try to submit a job that mounts our shared storage and see its content. runai submit \ From 4aa02d65377db1ebef1f79e11018d385f7e4bccb Mon Sep 17 00:00:00 2001 From: EduardDurech <39579228+EduardDurech@users.noreply.github.com> Date: Wed, 10 Jul 2024 13:17:57 +0200 Subject: [PATCH 03/20] Detailed Kube config link --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index afea790..c3632c3 100644 --- a/README.md +++ b/README.md @@ -85,7 +85,7 @@ The following are just a bunch of commands you need to run to get started. If yo sudo chown root: /usr/local/bin/kubectl ``` -2. Setup the kube config file: Create a file in your home directory as ``~/.kube/config`` and copy the contents from the file [`kubeconfig.yaml`](kubeconfig.yaml) in this file. Note that the file on your machine has no suffix. For the updated cluster https://wiki.rcp.epfl.ch/home/CaaS/how-to-switch-between-rcp-caas-cluster-and-ic-caas-cluster +2. Setup the kube config file: Create a file in your home directory as ``~/.kube/config`` and copy the contents from the file [`kubeconfig.yaml`](kubeconfig.yaml) in this file. Note that the file on your machine has no suffix. For the updated cluster use the config file at https://wiki.rcp.epfl.ch/home/CaaS/how-to-switch-between-rcp-caas-cluster-and-ic-caas-cluster 3. Install the run:ai CLI: ```bash From 72d7bed9ef8f742ac602855aa50e8bfb22db2e3e Mon Sep 17 00:00:00 2001 From: EduardDurech <39579228+EduardDurech@users.noreply.github.com> Date: Thu, 11 Jul 2024 16:18:55 +0200 Subject: [PATCH 04/20] Update kubeconfig.yaml for ic-caas and rcp-caas-prod @ https://wiki.rcp.epfl.ch/home/CaaS/how-to-switch-between-rcp-caas-cluster-and-ic-caas-cluster https://inside.epfl.ch/ic-it-docs/ic-cluster/caas/connecting --- kubeconfig.yaml | 86 ++++++++++++++++++++++++++----------------------- 1 file changed, 45 insertions(+), 41 deletions(-) diff --git a/kubeconfig.yaml b/kubeconfig.yaml index 155b577..62487ce 100644 --- a/kubeconfig.yaml +++ b/kubeconfig.yaml @@ -1,45 +1,49 @@ apiVersion: v1 -clusters: - - cluster: - certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURCVENDQWUyZ0F3SUJBZ0lJRUQxZmRDUWZIdGt3RFFZSktvWklodmNOQVFFTEJRQXdGVEVUTUJFR0ExVUUKQXhNS2EzVmlaWEp1WlhSbGN6QWVGdzB5TXpBNU1ESXhNakl6TURoYUZ3MHpNekE0TXpBeE1qSXpNRGhhTUJVeApFekFSQmdOVkJBTVRDbXQxWW1WeWJtVjBaWE13Z2dFaU1BMEdDU3FHU0liM0RRRUJBUVVBQTRJQkR3QXdnZ0VLCkFvSUJBUUQxSHBDVU90K0FtUXo0UUR2Z2JBcW9SNXJEVzBhMkdFTDQzT0hFZk9oSVg3NUxlcVoxbEdRclVWNHcKeGNYcno0dzdGbUt2UEp3c3F2VjVLSjY3a1pkbnRUa3dybVM4K0hBNmVRejJYMGJ4WURuL0lKbmFkUnBtcVkxawo4S2t4d3RSLzNKbmR4a29yL0NhdnJHQzR4a0Z2TUxLT0pkUHZtV0MxdG9NUEszOU5kRTF4OVo3K1lycVBSYjRmCnp6K2ErMmFQY2kyNGhKcmsySm8xV2NVN2Y3Z29mRGNKY0lwNGJUTUVGUHMxaS95WkQycDY4RVlTZzhhUjZvVzIKT011WE1mREtkMk9PRWVxdTR0MnhScUl3SlhuTFJNdkJKRkw5NmMwSDcyVGEwRXdnOHVudmhrZmFZR2NzYTV1TwpVTzRWZFdmYUtCVmVsYytpcC91MjVxZXNvTnp6QWdNQkFBR2pXVEJYTUE0R0ExVWREd0VCL3dRRUF3SUNwREFQCkJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJUNmVqdEs5T01zR0pXclIvU1I1QnlOVHZOejhEQVYKQmdOVkhSRUVEakFNZ2dwcmRXSmxjbTVsZEdWek1BMEdDU3FHU0liM0RRRUJDd1VBQTRJQkFRQkoxa044bCtuMwpnV1V0RXlMb2ZJWjVDZG1kOGJXNXpZMkRTeWZpYzhNcXlOQUhPd3N5anp1T21iNndvZlErS1pGMWR1SnpUVnF1CmxXSFBQdE1hc2l5U3JSbEJRSEtXV2IzdVJNeG8zeG1SWExvRW1kSmQ1S1F0SDYvUnNSS0pId29KVVNOWmVITFYKS1R4QTcwWEtRakVUYkROenQ4c0JUUkxVM1lMV0JnTE96RVRzSm5DeVV0ZGZsVm82U3Qwemc3NmJGV3pxMXhqMQora3o0Mzhhd2paV1Z2THNBU0dRNHFkT0hvM1NscUgrUnlJb3U5bGFvOVdKOTEzYi90QWxyQ3lOdzVVTURZWHFWCllCQk9PLzdkeS9BRDdxNDFyTk9rUFFHclQ4MkswUjg3ZWgzeXJXSG9FcSt0N2pvdktrdlFvL2IyT3VhZTV3YnEKZmtRczZpbG92bHptCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K - server: https://icadmin011.iccluster.epfl.ch:6443 - name: ic-cluster - - cluster: - certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvakNDQWVhZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJek1EUXlOakE0TVRRME5sb1hEVE16TURReU16QTRNVFEwTmxvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBTFAxCmtTZ2E4NWRWU0p0VUxGQ1g5VWo1K1lTT2dCbG9MZGVxZVgrM1ByVGtQZkptWFBxeXlsVVBLN0tJUWlvSUplNm8KRTBaS2JZbU03SnEvL0lPaHF4R0VraUNrTHJCamJrYXF5M3NibkNhWGFMa1pQYkhNWjgwdmlMMGNFZHNJTWN4WgozdHpMTzFNTldwZW9mZlJ6L1NvbXpqSTVDQldJbUptTmhvZXpJQUVNOGJuaDJKeFBFNzRwWThTS1BTRk5YVzN0CjgxNmM5cXRvc1lJQjVrTnh1UjRGWVh5bGloZHZ3UmVqVW9wajA2ME1rSkl3QmpXM01YTFUrdkVyandKeFc5Q1cKZ2plUndzOG5kdW5VVHREcy9CVjhGbW5JZy81VVNhZTBzUE5FQWxvZC9TbGhrMnNuWTJvUXZlTHpFNkhrMnluRgpHNXd1VGVXRDZGY2Erd1pNMjM4Q0F3RUFBYU5aTUZjd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0hRWURWUjBPQkJZRUZNVVhkVWVnK2xMdTlHWElMQ2VlOVJzOENmUXpNQlVHQTFVZEVRUU8KTUF5Q0NtdDFZbVZ5Ym1WMFpYTXdEUVlKS29aSWh2Y05BUUVMQlFBRGdnRUJBR051a2ZUR3E0RTlrckkreVZQbApaem1reSszaUNTMnYvTU9OU3h0S01idWZ2V0ROZFM3QzZaK1RDQTJSd0c1Y2gzZUh5UW9oTSs0K2wrSTJxMTFwCjNJVGRxYVI4RDhpQkFCbXV6Yzl2a3BKanZTTzZ4VVpnTFJZMHRDTUxXZ3g2b2tBcWhxZDV3YTZIYmN6Z1QrSUcKQlVGbERtR0R4K0MxTnFIYVFKUVN1bENqL1ZyS1RROVFlY1NoZGZqVDgvS1NVUjQ4VTlEdlA3dnU0YkRnWW5DKwpoOXEwUlFpUGR4TEtlL2Q5aGd0UnM5TjFQdGRYZXAxdHB3NCs3Y3N4TE1DSXNmYTBwaW8yb3lEems0bTNjSWRNCi9iNElHUEZaM2hYZktOVGtybnUrWmdCUms5Yjk3emNKZVdhendxTXUyd1dkV2JiQjdpaU5ZK2xtWkl1S0dUeFQKWWpRPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg== - server: https://caas-test.rcp.epfl.ch:443 - name: rcp-cluster -contexts: - - context: - cluster: ic-cluster - user: ic-runai-authenticated-user - name: ic-context - - context: - cluster: rcp-cluster - user: rcp-runai-authenticated-user - name: rcp-context -current-context: ic-context kind: Config preferences: {} +clusters: +# Cluster RCP Prod +- name: caas-prod.rcp.epfl.ch + cluster: + server: https://caas-prod.rcp.epfl.ch:443 + certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUR0ekNDQXArZ0F3SUJBZ0lJRWZYM1lVZWJUd1V3RFFZSktvWklodmNOQVFFTEJRQXdGVEVUTUJFR0ExVUUKQXhNS2EzVmlaWEp1WlhSbGN6QWVGdzB5TkRBMk1UUXdPVFU1TWpkYUZ3MHlOVEEyTVRReE1EQTFNVFJhTUJreApGekFWQmdOVkJBTVREbXQxWW1VdFlYQnBjMlZ5ZG1WeU1JSUJJakFOQmdrcWhraUc5dzBCQVFFRkFBT0NBUThBCk1JSUJDZ0tDQVFFQTY1VnYyWnRsR0J3YW1hOHdRSzFTTlRSUXhRY2JRUXlWMEQrU2hOdDIzWHN5NTVRR2VaOGIKVVNjL1lBaDMrWm5tVnRjeFd4M1VHZGxzWjlKaXJ0NFE5VnF5ek0razVGM3pGenhzY2xST1pra0F6dHpCNE1MTwpHUmNvZnlESDlJMzZkK096MlBXamdTNXJLODdheEkrUmpBb1pZOXNoNGM5a1VDUEFVYkFFRitydlI1cDRZRDAvClBtQnhWNVFqcDAxaGlaV0dMSndnMnhKeTdTak5hbzdkTUova2pUWjQyaG1TcDAzWW9JQ0VwaFR4Tk9RUmJ3bi8KVTByTHQwT1BnL25zTHRrTVFWK3VRNW9kdHJoMGJPWWFJa3YzdUJzU1laMjY3bnVnUHVDeDhJclF1ZHVKTGFYKwpEV2ZVb3pCc1Q5QXJyUzRxc1JpZFV1YVFUcWZVdklzWjVRSURBUUFCbzRJQkJUQ0NBUUV3RGdZRFZSMFBBUUgvCkJBUURBZ1dnTUJNR0ExVWRKUVFNTUFvR0NDc0dBUVVGQndNQk1Bd0dBMVVkRXdFQi93UUNNQUF3SHdZRFZSMGoKQkJnd0ZvQVVaT3dkcVBKTU5iUHJlLzJJWlgxbDd0cmN6Umd3Z2FvR0ExVWRFUVNCb2pDQm40SVVZV1J0YVc0dwpNRGN1Y21Od0xtVndabXd1WTJpQ0ZXTmhZWE10Y0hKdlpDNXlZM0F1WlhCbWJDNWphSUlLYTNWaVpYSnVaWFJsCmM0SVNhM1ZpWlhKdVpYUmxjeTVrWldaaGRXeDBnaFpyZFdKbGNtNWxkR1Z6TG1SbFptRjFiSFF1YzNaamdpeHIKZFdKbGNtNWxkR1Z6TG1SbFptRjFiSFF1YzNaakxtTmhZWE10Y0hKdlpDNXlZM0F1WlhCbWJDNWphSWNFckJJQQpBWWNFQ2x4RUREQU5CZ2txaGtpRzl3MEJBUXNGQUFPQ0FRRUF4c09Td2d1SVRlVEtkdEpwNXZWR2VHUmVzOEY5Cnc4T2Y3aTBNNDBnNDF6Y0VnV3pyaktwZzhIZC9TNG5Wb0cxL2d4dDAvWW52OG1vbDkrajFBa3NqdSt4NVZtRnIKYmN0UDlDMW95YTV4d3lDaWhrVk9DMk1CRElRcnNCeHlnQ0dBanh4R1Bod1V0SE8yODhTNkViSUppcWx5T0o0MAo0UEpVWVMxaTZrNUR6K01GK0NxK1NOSkQwZ3hSNUFjcXRUVy9HT3RBZE5tRERmS21adm1zOWtMR0FKcHhaRGF5CndKckpabWhEMG9KSWVNQm9wbHhYb0ZTZm0yVEZOMWVlSTVkTWJFZzArY0VHM0JENDV2dWpkbFBJRGl4aEkycVoKQktsdDl5am1SWEtwbjlrN3J2ZHpSdkYyS2N3Q2l4Ryt5ZDBiS2F1OVFlWkE3Q09xcFo4cVFwcFY0UT09Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K +# Cluster IC +- name: ic-caas + cluster: + server: https://ic-caas.epfl.ch:6443 + certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUdLekNDQlJPZ0F3SUJBZ0lRQWVhbjJ4T1BQcXdyQ3RUcWZ0VUFiREFOQmdrcWhraUc5dzBCQVFzRkFEQloKTVFzd0NRWURWUVFHRXdKVlV6RVZNQk1HQTFVRUNoTU1SR2xuYVVObGNuUWdTVzVqTVRNd01RWURWUVFERXlwRQphV2RwUTJWeWRDQkhiRzlpWVd3Z1J6SWdWRXhUSUZKVFFTQlRTRUV5TlRZZ01qQXlNQ0JEUVRFd0hoY05NalF3Ck56QTBNREF3TURBd1doY05NalF4TWpNd01qTTFPVFU1V2pCd01Rc3dDUVlEVlFRR0V3SkRTREVSTUE4R0ExVUUKQnhNSVRHRjFjMkZ1Ym1VeE5EQXlCZ05WQkFvTUs4T0pZMjlzWlNCd2IyeDVkR1ZqYUc1cGNYVmxJR2JEcVdURApxWEpoYkdVZ1pHVWdUR0YxYzJGdWJtVXhHREFXQmdOVkJBTVREMmxqTFdOaFlYTXVaWEJtYkM1amFEQlpNQk1HCkJ5cUdTTTQ5QWdFR0NDcUdTTTQ5QXdFSEEwSUFCSDdGbmd3cEdBRjJkbkh4d0JCNkRmNUpkZ2J0MERKanJlWDkKYml4d0s4emRaZWIwMzgwZkRrMVFUSDRJMVJjNmRZZ1F0Q2FZNXhLdkxLWlZJRUNQdmoramdnT2hNSUlEblRBZgpCZ05WSFNNRUdEQVdnQlIwaFlEQVpzZmZOOTdQdlNrM3FnTWR2dTNORnpBZEJnTlZIUTRFRmdRVXZkdGg5NGpwCmd2VTJnaGN0OTR1UXBjdTNIUFl3TVFZRFZSMFJCQ293S0lJUGFXTXRZMkZoY3k1bGNHWnNMbU5vZ2hWcFkzWnQKTURFNE9TNTRZV0Z6TG1Wd1ptd3VZMmd3UGdZRFZSMGdCRGN3TlRBekJnWm5nUXdCQWdJd0tUQW5CZ2dyQmdFRgpCUWNDQVJZYmFIUjBjRG92TDNkM2R5NWthV2RwWTJWeWRDNWpiMjB2UTFCVE1BNEdBMVVkRHdFQi93UUVBd0lECmlEQWRCZ05WSFNVRUZqQVVCZ2dyQmdFRkJRY0RBUVlJS3dZQkJRVUhBd0l3Z1o4R0ExVWRId1NCbHpDQmxEQkkKb0VhZ1JJWkNhSFIwY0RvdkwyTnliRE11WkdsbmFXTmxjblF1WTI5dEwwUnBaMmxEWlhKMFIyeHZZbUZzUnpKVQpURk5TVTBGVFNFRXlOVFl5TURJd1EwRXhMVEV1WTNKc01FaWdScUJFaGtKb2RIUndPaTh2WTNKc05DNWthV2RwClkyVnlkQzVqYjIwdlJHbG5hVU5sY25SSGJHOWlZV3hITWxSTVUxSlRRVk5JUVRJMU5qSXdNakJEUVRFdE1TNWoKY213d2dZY0dDQ3NHQVFVRkJ3RUJCSHN3ZVRBa0JnZ3JCZ0VGQlFjd0FZWVlhSFIwY0RvdkwyOWpjM0F1WkdsbgphV05sY25RdVkyOXRNRkVHQ0NzR0FRVUZCekFDaGtWb2RIUndPaTh2WTJGalpYSjBjeTVrYVdkcFkyVnlkQzVqCmIyMHZSR2xuYVVObGNuUkhiRzlpWVd4SE1sUk1VMUpUUVZOSVFUSTFOakl3TWpCRFFURXRNUzVqY25Rd0RBWUQKVlIwVEFRSC9CQUl3QURDQ0FYMEdDaXNHQVFRQjFua0NCQUlFZ2dGdEJJSUJhUUZuQUhZQWR2K0lQd3EyKzVWUgp3bUhNOVllNk5MU2t6YnNwM0doQ0NwL21aMHhhT25RQUFBR1FmUHozcGdBQUJBTUFSekJGQWlFQTBuakJGbEtvCkJQNGxZSDFFTTJYYTE2czUxdW9GL3UxQloyaG9WQ1creUpvQ0lHMzB6NkpRUWlDTG5SZTZRQ1N2cklGc0J2YXEKTzJJUHNKMmhMRmlXU1VXOUFIWUEycmEvYXorMXRpS2ZtOEs3WEd2b2NKRnhiTHRSaElVMHZhUTlNRWpYKzZzQQpBQUdRZlB6M2FnQUFCQU1BUnpCRkFpRUEzVDI1bVlWZ3NNcXNMN1hVK0hBeWV3VWVkV1l3cnJSZEM1YmNsWjR0CjVFTUNJQW9uN1VKSlFlOEFJZk1ndWhTUDNHdm5GOWFMUjFWNWI3bE5vYnhEcURqWEFIVUFQeGRMVDljaVIxaVUKSFdVY2hMNE5FdTJRTjM4ZmhXcnJ3YjhvaGV6NFpHNEFBQUdRZlB6NEJnQUFCQU1BUmpCRUFpQTN5WXZUMGNWYQpyMDZZQndsRXZ2cFZ5bmhWSVlCaE9RNU1OT05tMjkzZ2NBSWdBU1JzSTFybEY4VWNoMXI5b2hIZzBLT1VEL1liCjlBUTZvNWFKZWl5dTYzd3dEUVlKS29aSWh2Y05BUUVMQlFBRGdnRUJBSjVIMVFaSUxRdDZxeUlNb2xhNkgrZ0gKN0JTYU9OeGlySFBETEZ3b0RtN2tIRU1NTnViTW8rdFlSYThZSmJiWVhOUjErUFFpRGNNOXVZMnFjWXhkdHZLNApMMDhEdjlWUC8vY2VWeEkrNmRPRUlQNXRRS01pZE51L09KTU9KWW1yZnEwY1Uzb1RPQlZCYVBBWWN3empzTHdGCkh3QUgyR0RDM1NzdXdUZjVwRWdTS3FGbG9vdVhrWlF0UkhIVytrM3YzcGhXeE1zZEY5R3JMUGJkcFhSWnFyTXUKdXJpS2FNRUc3TGVWcGQ2czNKT1JCTmRYcEwxZXA4eFdYKzVQdEZxUDRzR0trc3VuTjh6SU5tR2oyRVd0V3ZzeQpYQm5iVjhnZTZPeStRRDM3Zk5sZUJFdTByUnVyZjBHNEFGTFlsVTVqNk85cnVLL0NZYlJSM0NncGZlZEM0bE09Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K users: - - name: ic-runai-authenticated-user - user: - auth-provider: - config: - airgapped: "true" - auth-flow: remote-browser - client-id: runai-cli-sso - idp-issuer-url: https://app.run.ai/auth/realms/EPFL - realm: EPFL - redirect-uri: https://epfl.run.ai/oauth-code - name: oidc - - name: rcp-runai-authenticated-user - user: - auth-provider: - config: - airgapped: "true" - auth-flow: remote-browser - client-id: runai-cli - idp-issuer-url: https://app.run.ai/auth/realms/rcpepfl - realm: rcpepfl - redirect-uri: https://rcpepfl.run.ai/oauth-code - name: oidc +# Authenticated user on RunAI RCP +- name: runai-rcp-authenticated-user + user: + auth-provider: + name: oidc + config: + airgapped: "true" + auth-flow: remote-browser + client-id: runai-cli + idp-issuer-url: https://app.run.ai/auth/realms/rcpepfl + realm: rcpepfl + redirect-uri: https://rcpepfl.run.ai/oauth-code +# Authenticated user on RunAI IC +- name: runai-authenticated-user + user: + auth-provider: + config: + airgapped: "true" + auth-flow: remote-browser + client-id: runai-cli + idp-issuer-url: https://app.run.ai/auth/realms/epfl + realm: epfl + redirect-uri: https://epfl.run.ai/oauth-code + name: oidc +contexts: +# Contexts (a context a cluster associated with a user) +- name: rcp-caas-prod + context: + cluster: caas-prod.rcp.epfl.ch + user: runai-rcp-authenticated-user +- name: ic-caas + context: + cluster: ic-caas + user: runai-authenticated-user From d96be69299f0a6b6d0fde8f1337e71050ec8afb0 Mon Sep 17 00:00:00 2001 From: EduardDurech <39579228+EduardDurech@users.noreply.github.com> Date: Tue, 23 Jul 2024 18:16:24 +0200 Subject: [PATCH 05/20] Update kubeconfig.yaml --- kubeconfig.yaml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kubeconfig.yaml b/kubeconfig.yaml index 62487ce..7c97bbf 100644 --- a/kubeconfig.yaml +++ b/kubeconfig.yaml @@ -1,4 +1,5 @@ apiVersion: v1 +current-context: rcp-caas-prod kind: Config preferences: {} clusters: @@ -6,7 +7,7 @@ clusters: - name: caas-prod.rcp.epfl.ch cluster: server: https://caas-prod.rcp.epfl.ch:443 - certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUR0ekNDQXArZ0F3SUJBZ0lJRWZYM1lVZWJUd1V3RFFZSktvWklodmNOQVFFTEJRQXdGVEVUTUJFR0ExVUUKQXhNS2EzVmlaWEp1WlhSbGN6QWVGdzB5TkRBMk1UUXdPVFU1TWpkYUZ3MHlOVEEyTVRReE1EQTFNVFJhTUJreApGekFWQmdOVkJBTVREbXQxWW1VdFlYQnBjMlZ5ZG1WeU1JSUJJakFOQmdrcWhraUc5dzBCQVFFRkFBT0NBUThBCk1JSUJDZ0tDQVFFQTY1VnYyWnRsR0J3YW1hOHdRSzFTTlRSUXhRY2JRUXlWMEQrU2hOdDIzWHN5NTVRR2VaOGIKVVNjL1lBaDMrWm5tVnRjeFd4M1VHZGxzWjlKaXJ0NFE5VnF5ek0razVGM3pGenhzY2xST1pra0F6dHpCNE1MTwpHUmNvZnlESDlJMzZkK096MlBXamdTNXJLODdheEkrUmpBb1pZOXNoNGM5a1VDUEFVYkFFRitydlI1cDRZRDAvClBtQnhWNVFqcDAxaGlaV0dMSndnMnhKeTdTak5hbzdkTUova2pUWjQyaG1TcDAzWW9JQ0VwaFR4Tk9RUmJ3bi8KVTByTHQwT1BnL25zTHRrTVFWK3VRNW9kdHJoMGJPWWFJa3YzdUJzU1laMjY3bnVnUHVDeDhJclF1ZHVKTGFYKwpEV2ZVb3pCc1Q5QXJyUzRxc1JpZFV1YVFUcWZVdklzWjVRSURBUUFCbzRJQkJUQ0NBUUV3RGdZRFZSMFBBUUgvCkJBUURBZ1dnTUJNR0ExVWRKUVFNTUFvR0NDc0dBUVVGQndNQk1Bd0dBMVVkRXdFQi93UUNNQUF3SHdZRFZSMGoKQkJnd0ZvQVVaT3dkcVBKTU5iUHJlLzJJWlgxbDd0cmN6Umd3Z2FvR0ExVWRFUVNCb2pDQm40SVVZV1J0YVc0dwpNRGN1Y21Od0xtVndabXd1WTJpQ0ZXTmhZWE10Y0hKdlpDNXlZM0F1WlhCbWJDNWphSUlLYTNWaVpYSnVaWFJsCmM0SVNhM1ZpWlhKdVpYUmxjeTVrWldaaGRXeDBnaFpyZFdKbGNtNWxkR1Z6TG1SbFptRjFiSFF1YzNaamdpeHIKZFdKbGNtNWxkR1Z6TG1SbFptRjFiSFF1YzNaakxtTmhZWE10Y0hKdlpDNXlZM0F1WlhCbWJDNWphSWNFckJJQQpBWWNFQ2x4RUREQU5CZ2txaGtpRzl3MEJBUXNGQUFPQ0FRRUF4c09Td2d1SVRlVEtkdEpwNXZWR2VHUmVzOEY5Cnc4T2Y3aTBNNDBnNDF6Y0VnV3pyaktwZzhIZC9TNG5Wb0cxL2d4dDAvWW52OG1vbDkrajFBa3NqdSt4NVZtRnIKYmN0UDlDMW95YTV4d3lDaWhrVk9DMk1CRElRcnNCeHlnQ0dBanh4R1Bod1V0SE8yODhTNkViSUppcWx5T0o0MAo0UEpVWVMxaTZrNUR6K01GK0NxK1NOSkQwZ3hSNUFjcXRUVy9HT3RBZE5tRERmS21adm1zOWtMR0FKcHhaRGF5CndKckpabWhEMG9KSWVNQm9wbHhYb0ZTZm0yVEZOMWVlSTVkTWJFZzArY0VHM0JENDV2dWpkbFBJRGl4aEkycVoKQktsdDl5am1SWEtwbjlrN3J2ZHpSdkYyS2N3Q2l4Ryt5ZDBiS2F1OVFlWkE3Q09xcFo4cVFwcFY0UT09Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K + certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURCVENDQWUyZ0F3SUJBZ0lJRHdwSElpTmQrVUV3RFFZSktvWklodmNOQVFFTEJRQXdGVEVUTUJFR0ExVUUKQXhNS2EzVmlaWEp1WlhSbGN6QWVGdzB5TkRBMk1UUXdPVFU1TWpkYUZ3MHpOREEyTVRJeE1EQTBNamRhTUJVeApFekFSQmdOVkJBTVRDbXQxWW1WeWJtVjBaWE13Z2dFaU1BMEdDU3FHU0liM0RRRUJBUVVBQTRJQkR3QXdnZ0VLCkFvSUJBUURvOHJDRjNjeXdRRTlxTVpEOHNGTXo2K0FzSEpnWi81WVNwMGNhWHNKd0JWUERneGdwRGZKY0hnYXYKS2tOdVhTNGpBN1VrZkg1amZXQitvdytpamN3OUR4cjV6STB2TUNReWtzYk9kMVFFMis0Q0J1U0JXU01Gc1pYZQp2T01SanltN056SytxWkVldHpxR0M0bU5LdU9qbC92cGd4ZDNuM2Y2L3loRHhockp2bkVWKzZlUE5icWpDZURZCld1VWFZdUYxRmM4QnZHN0hma3FYRlRWWVdlNkpNa3JSbDQxOVo5a2diNnIvUFNZVzZqdDhhNThTSGNHSVhnTFcKOTBta3BFb1JCMENOSG0wQllEQjdjNFJxMmdyaWtZTUlldGM0eXk2L3NSdFp6NzFiTUQrM2ZDNk92NDdvOXUzWgpld0VWeEJ4dG11ZkVvVGduVEVyNXFYMlhxWFZMQWdNQkFBR2pXVEJYTUE0R0ExVWREd0VCL3dRRUF3SUNwREFQCkJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJSazdCMm84a3cxcyt0Ny9ZaGxmV1h1MnR6TkdEQVYKQmdOVkhSRUVEakFNZ2dwcmRXSmxjbTVsZEdWek1BMEdDU3FHU0liM0RRRUJDd1VBQTRJQkFRQXFOdnQrR01lTwp6QnZZZEQ2SExCakFVeWc1czd0TDgzOVltd0RhRXBseG45ZlBRdUV6UW14cnEwUEoxcnVZNnRvRks1SEN4RFVzCmJDN3R3WlMzaVdNNXQ5NEJveHJGVC92c3QrQmtzbWdvTGM2T0N1MitYcngyMUg3UnFLTnNVR01LN2tFdGN6cHgKeXUrYTB6T0tISEUxNWFSVENPbklzQ1pXaTRhVFhIZ00zQ2U4VEhBMXRxaW9pREFHMVFUQXNhNXhTeVM3RWlUSQpDYi9xbktPRlVvM3V3bkRocWljRTU3dE1LTjliRE8rV3hNMzVxT2lBZXVXOUVnc2JlOFA5aDY2NG1tK1QzbjY0ClJNL1l1NHhmcDZwMHMvdGZyZTVjaUFvT0dGekYyRmVKek5PYm1vRkVseUtKc0RwbEorcWFTVXlaL2NtNWRIYUUKQVUxOVMrUWpFc1cvCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K # Cluster IC - name: ic-caas cluster: From a98a512e17cbffc036034fdb3336581ef9c076ea Mon Sep 17 00:00:00 2001 From: EduardDurech <39579228+EduardDurech@users.noreply.github.com> Date: Tue, 23 Jul 2024 19:08:59 +0200 Subject: [PATCH 06/20] Update README.md --- README.md | 122 +++++++++++++++++++++++++----------------------------- 1 file changed, 57 insertions(+), 65 deletions(-) diff --git a/README.md b/README.md index c3632c3..349f8fd 100644 --- a/README.md +++ b/README.md @@ -19,6 +19,7 @@ Content overview: - [1: Pre-setup (access, repository)](#1-pre-setup-access-repository) - [2: Setup the tools on your own machine](#2-setup-the-tools-on-your-own-machine) - [3: Login](#3-login) + - [4: Use this repo to start a job](#4-use-this-repo-to-start-a-job) - [5: Cloning and running your code](#5-cloning-and-running-your-code) - [Managing Workflows and Advanced Topics](#managing-workflows-and-advanced-topics) - [Using VSCODE](#using-vscode) @@ -69,83 +70,74 @@ The following are just a bunch of commands you need to run to get started. If yo ## 2: Setup the tools on your own machine > [!IMPORTANT] -> The setup below was tested on macOS with Apple Silicon. If you are using a different system, you may need to adapt the commands. +> The setup below was tested on Linux. If you are using a different system, you may need to adapt the commands. > For Windows, we have no experience with the setup and thereby recommend WSL (Windows Subsystem for Linux) to run the commands. -1. Install kubectl. To make sure the version matches with the clusters (status: 15.12.2023), on macOS with Apple Silicon, run the following commands. For other systems, you will need to change the URL in the command above (check https://kubernetes.io/docs/tasks/tools/install-kubectl/). Make sure that the version matches with the version of the cluster! +1. Install kubectl. Make sure that the version matches with the version of the cluster! ```bash - # Sketch for macOS with Apple Silicon. - # Download a specific version (here 1.26.7 for Apple Silicon macOS) - curl -LO "https://dl.k8s.io/release/v1.26.7/bin/darwin/arm64/kubectl" - # curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" #Linux - - # Give it the right permissions and move it. - chmod +x ./kubectl - sudo mv ./kubectl /usr/local/bin/kubectl - sudo chown root: /usr/local/bin/kubectl +curl -sL "https://dl.k8s.io/release/$(curl -s https://api.github.com/repos/kubernetes/kubernetes/releases | grep -oP '"tag_name": "\K(v1\.29\.[0-9]+)' | sort -V | tail -n 1)/bin/linux/amd64/kubectl" | sudo install /dev/stdin /usr/local/bin/kubectl ``` -2. Setup the kube config file: Create a file in your home directory as ``~/.kube/config`` and copy the contents from the file [`kubeconfig.yaml`](kubeconfig.yaml) in this file. Note that the file on your machine has no suffix. For the updated cluster use the config file at https://wiki.rcp.epfl.ch/home/CaaS/how-to-switch-between-rcp-caas-cluster-and-ic-caas-cluster +2. Setup the kube config file: Take our template file [`kubeconfig.yaml`](kubeconfig.yaml) as your config in the home folder `~/.kube/config`. Note that the file on your machine has no suffix. +```bash +curl -o ~/.kube/config https://raw.githubusercontent.com/EduardDurech/getting-started/main/kubeconfig.yaml +``` 3. Install the run:ai CLI: - ```bash - # Sketch for macOS with Apple Silicon - # Download the CLI from the link shown in the help section. - wget --content-disposition https://rcp-caas-test.rcp.epfl.ch/cli/darwin - # wget --content-disposition https://rcp-caas-prod.rcp.epfl.ch/cli/linux #Linux - - # Give it the right permissions and move it. - chmod +x ./runai - sudo mv ./runai /usr/local/bin/runai - sudo chown root: /usr/local/bin/runai - ``` +```bash +# Sketch for Linux +curl -sL https://rcp-caas-prod.rcp.epfl.ch/cli/linux | sudo install /dev/stdin /usr/local/bin/runai +``` ## 3: Login -4. Switch between contexts and login to both clusters. - Old - ```bash - # Switch to the IC cluster - runai config cluster ic-context - # Login to the cluster - runai login - # Check that things worked fine - runai list projects - # put your default project - runai config project mlo-$GASPAR_USERNAME - # Repeat for the RCP cluster - runai config cluster rcp-context - runai login - runai list projects - runai config project mlo-$GASPAR_USERNAME - ``` - For the updated cluster use `ic-caas` and `rcp-caas-prod` -6. Run a quick test to see that you can launch jobs: - ```bash - # Try to submit a job that mounts our shared storage and see its content. - runai submit \ - --name setup-test-storage \ - --image ubuntu \ - --pvc runai-mlo-$GASPAR_USERNAME-scratch:/mloscratch \ - -- ls -la /mloscratch/homes - # Check the status of the job - runai describe job setup-test-storage - - # Check its logs to see that it ran. - runai logs setup-test-storage - - # Delete the successful jobs - runai delete jobs setup-test-storage - ``` +1. Switch between contexts and login to both clusters. +```bash +# Switch to the IC cluster +runai config cluster ic-caas +# Login to the cluster +runai login +# Check that things worked fine +runai list projects +# Put default project +runai config project mlo-$GASPAR_USERNAME + +# Repeat for the RCP cluster +runai config cluster rcp-caas-prod +runai login +runai list projects +runai config project mlo-$GASPAR_USERNAME +``` + +2. Run a quick test to see that you can launch jobs: +```bash +# Let's use the normal RCP cluster +runai config cluster rcp-caas-prod +runai login +# Try to submit a job that mounts our shared storage and see its content. +# (side note: on ic-caas, the pvc is called runai-mlo-$GASPAR_USERNAME-scratch:/mloscratch, so the arg below has to be changed) +runai submit \ + --name setup-test-storage \ + --image ubuntu \ + --pvc mlo-scratch:/mloscratch \ + -- ls -la /mloscratch/homes +# Check the status of the job +runai describe job setup-test-storage + +# Check its logs to see that it ran. +runai logs setup-test-storage + +# Delete the successful jobs +runai delete jobs setup-test-storage +``` The `runai submit` command already suffices to run jobs. If that is fine for you, you can jump to the section on using provided images and the run:ai CLI [here](#alternative-workflow-using-the-runai-cli-and-base-docker-images-with-pre-installed-packages). However, we provide a few scripts in this repository to make your life easier to get started. ## 4: Use this repo to start a job - 1. Clone this repository and create a `user.yaml` file in the root folder of the repo using the template in `templates/user_template.yaml`. ```bash -git clone https://github.com/epfml/getting-started.git +git clone https://github.com/EduardDurech/getting-started.git cd getting-started touch user.yaml # then copy the content from templates/user_template.yaml inside here and update ``` @@ -171,7 +163,7 @@ runai exec sandbox -it -- zsh 6. If everything worked correctly, you should be inside a terminal on the cluster! ## 5: Cloning and running your code -1. Clone your fork of your GitHub repository into the pod **inside your home folder**. +1. Clone your fork of your GitHub repository (where you have your experiment code) into the pod **inside your home folder**. ```bash # Inside the pod cd /mloscratch/homes/ @@ -201,12 +193,12 @@ For remote development (changing code, debugging, etc.), we recommend using VSCo > > Note that your pods **can be killed anytime**. This means you might need to restart an experiment (with the `python csub.py` command we give above). You can see the status of your jobs with `runai list`. If a job has status "Failed", you have to delete it via `runai delete job sandbox` before being able to start the same job again. > -> **Keep your files inside your home folder**: Importantly, when a job is restarted or killed, everything inside the container folders of `~/` are lost. This is why you need to work inside `/mloscratch/homes/`. For conda and other things (e.g. `~/.zshrc`, we have set up automatic symlinks to files that are persistent on scratch. +> **Keep your files inside your home folder**: Importantly, when a job is restarted or killed, everything inside the container folders of `~/` are lost. This is why you need to work inside `/mloscratch/homes/`. For conda and other things (e.g. `~/.zshrc`), we have set up automatic symlinks to files that are persistent on scratch. > > To have a job that can run in the background, do `python csub.py -n sandbox --train --command "cd /mloscratch/homes//; python main.py "` You're good to go :) It's up to you to customize your environment and install the packages you need. Read up on the rest of this README to learn more about the cluster and the scripts. -Remember that you can switch between the two contexts of the IC cluster and RCP cluster with the command `runai config cluster ` as shown above -- for example, if you need a 80GB A100 GPU, use the RCP cluster. +Remember that you can switch between the two contexts of the IC cluster and RCP cluster with the command `runai config cluster ` as shown above -- for example, if you need a 80GB A100 GPU, use `runai config cluster rcp-caas-prod`. >[!CAUTION] > Using the cluster creates costs. Please do not forget to stop your jobs when not used! @@ -231,8 +223,8 @@ runai delete job pod_name # kills the job and removes it from the list of jobs runai describe job pod_name # shows information on the status/execution of the job runai list jobs # list all jobs and their status runai logs pod_name # shows the output/logs for the job -runai config cluster ic-context # switch to IC cluster context -runai config cluster rcp-context # switch to RCP cluster context +runai config cluster ic-caas # switch to IC cluster context +runai config cluster rcp-caas-prod # switch to RCP cluster context ``` Some commands that might come in handy (credits to Thijs): ```bash From 6f60f2f779e5ce33de9a575ec6cd9e58c5dc9063 Mon Sep 17 00:00:00 2001 From: EduardDurech <39579228+EduardDurech@users.noreply.github.com> Date: Tue, 23 Jul 2024 19:14:11 +0200 Subject: [PATCH 07/20] Update csub.py --- csub.py | 113 ++++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 81 insertions(+), 32 deletions(-) diff --git a/csub.py b/csub.py index 1449ab4..cf2f321 100644 --- a/csub.py +++ b/csub.py @@ -2,6 +2,7 @@ import argparse from datetime import datetime, timedelta +from pprint import pprint import re import subprocess import tempfile @@ -33,7 +34,7 @@ parser.add_argument( "-g", "--gpus", - type=int, + type=float, default=1, required=False, help="The number of GPUs requested (default 1)", @@ -41,9 +42,9 @@ parser.add_argument( "--cpus", type=int, - default=4, + default=1, required=False, - help="The number of CPUs requested (default 4)", + help="The number of CPUs requested (default 1)", ) parser.add_argument( "--memory", @@ -96,9 +97,9 @@ "--node_type", type=str, default="", - choices=["", "G9", "G10"], + choices=["", "g9", "g10"], help="node type to run on (default is empty, which means any node). \ - only exists for IC cluster: G9 for V100, G10 for A100. \ + only exists for IC cluster: g9 for V100, g10 for A100. \ leave empty for RCP", ) parser.add_argument( @@ -124,6 +125,19 @@ with open(args.user, "r") as file: user_cfg = yaml.safe_load(file) + # get current cluster and make sure argument matches + current_cluster = subprocess.run( + ["kubectl", "config", "current-context"], + stdout=subprocess.PIPE, + stderr=subprocess.PIPE, + text=True, + ).stdout.strip() + + if current_cluster == "rcp-caas-prod": + scratch_name = "mlo-scratch" + else: + scratch_name = f"runai-mlo-{user_cfg['user']}-scratch" + if args.name is None: args.name = f"{user_cfg['user']}-{datetime.now().strftime('%Y%m%d-%H%M%S')}" @@ -140,13 +154,8 @@ if args.train: workload_kind = "TrainingWorkload" - backofflimit = f""" - backoffLimit: - value: {args.backofflimit} -""" else: workload_kind = "InteractiveWorkload" - backofflimit = "" working_dir = user_cfg["working_dir"] if not args.no_symlinks: @@ -165,6 +174,8 @@ symlink_targets = "" symlink_paths = "" symlink_types = "" + + # this is the yaml file that will be submitted to the cluster cfg = f""" apiVersion: run.ai/v2alpha1 kind: {workload_kind} @@ -176,8 +187,8 @@ spec: name: value: {args.name} - arguments: - value: "/bin/zsh -c 'source ~/.zshrc && {args.command}'" # zshrc is just loaded to have some env variables ready + arguments: + value: "/bin/zsh -c 'source ~/.zshrc && {args.command}'" # zshrc is just loaded to have some env variables ready environment: items: HOME: @@ -214,30 +225,42 @@ value: {args.image} imagePullPolicy: value: Always - {backofflimit} pvcs: items: pvc--0: value: - claimName: mlo-scratch + claimName: {scratch_name} existingPvc: true path: /mloscratch readOnly: false + ## these two lines are necessary on RCP, not on the new IC runAsGid: value: {user_cfg['gid']} runAsUid: value: {user_cfg['uid']} - runAsUser: - value: true + ## + runAsUser: + value: true serviceType: value: ClusterIP username: value: {user_cfg['user']} + allowPrivilegeEscalation: # allow sudo + value: true """ + + #### some additional flags that can be added at the end of the config if args.node_type: cfg += f""" - nodeType: - value: {args.node_type} # G10 for A100, G9 for V100 (on IC cluster) + nodePools: + value: {args.node_type} # g10 for A100, g9 for V100 (only on IC cluster) +""" + if args.node_type == "g10" and not args.train: + # for interactive jobs on A100s (g10 nodes), we need to set the jobs preemptible + # see table "Types of Workloads" https://inside.epfl.ch/ic-it-docs/ic-cluster/caas/submit-jobs/ + cfg += f""" + preemptible: + value: true """ if args.host_ipc: cfg += f""" @@ -245,28 +268,54 @@ value: true """ + if args.train: + cfg += f""" + backoffLimit: + value: {args.backofflimit} +""" + with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml") as f: f.write(cfg) f.flush() if args.dry: print(cfg) else: + # Run the subprocess and capture stdout and stderr result = subprocess.run( ["kubectl", "apply", "-f", f.name], # check=True, - capture_output=True, - # text=True, + stdout=subprocess.PIPE, + stderr=subprocess.PIPE, + text=True, ) - print(result.stdout) - print(result.stderr) - print("\nThe following commands may come in handy:") - print(f"runai exec {args.name} -it zsh # opens an interactive shell on the pod") - print( - f"runai delete job {args.name} # kills the job and removes it from the list of jobs" - ) - print( - f"runai describe job {args.name} # shows information on the status/execution of the job" - ) - print("runai list jobs # list all jobs and their status") - print(f"runai logs {args.name} # shows the output/logs for the job") + # Check if there was an error + if result.returncode != 0: + print("Error encountered:") + # Prettify and print the stderr + pprint(result.stderr) + exit(1) + else: + print("Output:") + # Prettify and print the stdout + print(result.stdout) + + print("If the above says 'created', the job has been submitted.") + + print( + f"If the above says 'job unchanged', the job with name {args.name} " + f"already exists (and you might need to delete it)." + ) + + print("\nThe following commands may come in handy:") + print( + f"runai exec {args.name} -it zsh # opens an interactive shell on the pod" + ) + print( + f"runai delete job {args.name} # kills the job and removes it from the list of jobs" + ) + print( + f"runai describe job {args.name} # shows information on the status/execution of the job" + ) + print("runai list jobs # list all jobs and their status") + print(f"runai logs {args.name} # shows the output/logs for the job") From c2ca749e50a996be1cd410647d7ab38c804efcca Mon Sep 17 00:00:00 2001 From: EduardDurech <39579228+EduardDurech@users.noreply.github.com> Date: Tue, 23 Jul 2024 20:31:26 +0200 Subject: [PATCH 08/20] Set HOME to working_dir --- csub.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/csub.py b/csub.py index 26100eb..ffa297e 100644 --- a/csub.py +++ b/csub.py @@ -199,7 +199,7 @@ environment: items: HOME: - value: "/home/{user_cfg['user']}" + value: "{working_dir}" NB_USER: value: {user_cfg['user']} NB_UID: From 82dfef2da7ae2b7063b54c758f3bcabcd5f50cca Mon Sep 17 00:00:00 2001 From: EduardDurech <39579228+EduardDurech@users.noreply.github.com> Date: Tue, 23 Jul 2024 22:08:20 +0200 Subject: [PATCH 09/20] Update $HOME (#2) * Set HOME to working_dir * Update Kube YAML workingDir --- csub.py | 2 ++ 1 file changed, 2 insertions(+) diff --git a/csub.py b/csub.py index ffa297e..cc701f5 100644 --- a/csub.py +++ b/csub.py @@ -196,6 +196,8 @@ value: {args.name} arguments: value: "/bin/zsh -c 'source ~/.zshrc && {args.command}'" # zshrc is just loaded to have some env variables ready + workingDir: + value: {working_dir} environment: items: HOME: From f409984d225b249585d1db11ea0df79366a795cb Mon Sep 17 00:00:00 2001 From: EduardDurech <39579228+EduardDurech@users.noreply.github.com> Date: Tue, 23 Jul 2024 23:00:12 +0200 Subject: [PATCH 10/20] Update Git references --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 58611cc..56fb93d 100644 --- a/README.md +++ b/README.md @@ -81,7 +81,7 @@ curl -sL "https://dl.k8s.io/release/$(curl -s https://api.github.com/repos/kuber 2. Setup the kube config file: Take our template file [`kubeconfig.yaml`](kubeconfig.yaml) as your config in the home folder `~/.kube/config`. Note that the file on your machine has no suffix. ```bash -curl -o ~/.kube/config https://raw.githubusercontent.com/EduardDurech/getting-started/main/kubeconfig.yaml +curl -o tst01.txt https://raw.githubusercontent.com/EduardDurech/getting-started/IC-RCP_08-24/kubeconfig.yaml ``` 3. Install the run:ai CLI: @@ -138,7 +138,7 @@ However, we provide a few scripts in this repository to make your life easier to ## 4: Use this repo to start a job 1. Clone this repository and create a `user.yaml` file in the root folder of the repo using the template in `templates/user_template.yaml`. ```bash -git clone https://github.com/EduardDurech/getting-started.git +git clone -b IC-RCP_08-24 https://github.com/EduardDurech/getting-started.git cd getting-started touch user.yaml # then copy the content from templates/user_template.yaml inside here and update ``` From 946534694410a22fd1c09c4b931ccc40944914d4 Mon Sep 17 00:00:00 2001 From: EduardDurech <39579228+EduardDurech@users.noreply.github.com> Date: Tue, 23 Jul 2024 23:02:14 +0200 Subject: [PATCH 11/20] Update Git references --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 56fb93d..54ed9d7 100644 --- a/README.md +++ b/README.md @@ -81,7 +81,7 @@ curl -sL "https://dl.k8s.io/release/$(curl -s https://api.github.com/repos/kuber 2. Setup the kube config file: Take our template file [`kubeconfig.yaml`](kubeconfig.yaml) as your config in the home folder `~/.kube/config`. Note that the file on your machine has no suffix. ```bash -curl -o tst01.txt https://raw.githubusercontent.com/EduardDurech/getting-started/IC-RCP_08-24/kubeconfig.yaml +curl -o ~/.kube/config https://raw.githubusercontent.com/EduardDurech/getting-started/IC-RCP_08-24/kubeconfig.yaml ``` 3. Install the run:ai CLI: From 491c3221df7a14862178f0457f1706e3c80d9a2c Mon Sep 17 00:00:00 2001 From: EduardDurech <39579228+EduardDurech@users.noreply.github.com> Date: Wed, 24 Jul 2024 04:21:04 +0200 Subject: [PATCH 12/20] Remove references to [ic, rcp, rcp-prod]-cluster --- README.md | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/README.md b/README.md index 54ed9d7..19545ee 100644 --- a/README.md +++ b/README.md @@ -43,7 +43,6 @@ The step-by-step instructions for first time users to quickly get a job running. > [!TIP] > After completing the setup, the **TL;DR** of the interaction with the cluster (using the scripts in this repo) is: -> * Choose a cluster and just run the command to set it up: `ic-cluster`, `rcp-cluster`, or `rcp-cluster-prod` > > * Get a running job with one GPU that is reserved for you: `python csub.py -n sandbox` > @@ -147,7 +146,6 @@ touch user.yaml # then copy the content from templates/user_template.yaml inside 3. Create a pod with 1 GPU (you may need to install pyyaml with `pip install pyyaml` first). ```bash -rcp-cluster # switch to RCP cluster context python csub.py -n sandbox ``` @@ -198,8 +196,6 @@ For remote development (changing code, debugging, etc.), we recommend using VSCo > **Keep your files inside your home folder**: Importantly, when a job is restarted or killed, everything inside the container folders of `~/` are lost. This is why you need to work inside `/mloscratch/homes/`. For conda and other things (e.g. `~/.zshrc`), we have set up automatic symlinks to files that are persistent on scratch. > > To have a job that can run in the background, do `python csub.py -n sandbox --train --command "cd /mloscratch/homes//; python main.py "` -> -> There are differences between the clusters of IC and RCP, which require different tool versions (`runai-ic`, `runai-rcp`, ...). Since this is a bit of a hassle, we made it easy to switch between the clusters via the commands `ic-cluster`, `rcp-cluster` and `rcp-cluster-prod`. To make sure you're aware of the cluster you're using, the `csub` script asks you to set the cluster to use before submitting a job: `python csub.py -n sandbox --cluster ic-caas` (choosing between `["rcp-caas-test", "ic-caas", "rcp-caas-prod"]`). It only works when the cluster argument matches your currently chosen cluster. You're good to go :) It's up to you to customize your environment and install the packages you need. Read up on the rest of this README to learn more about the cluster and the scripts. Remember that you can switch between the two contexts of the IC cluster and RCP cluster with the command `runai config cluster ` as shown above -- for example, if you need a 80GB A100 GPU, use `runai config cluster rcp-caas-prod`. @@ -304,7 +300,7 @@ The python script `csub.py` is a wrapper around the run:ai CLI that makes it eas General usage: ```bash -python csub.py --n -g -t