-
Notifications
You must be signed in to change notification settings - Fork 0
/
presentation.qmd
507 lines (382 loc) · 15.3 KB
/
presentation.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
---
title: "Docker for researchers"
author: "Jørgen Aarmo Lund"
date: today
format: revealjs
---
## Introduction
* Jørgen Aarmo Lund, industry PhD student at the UiT Machine Learning group for DIPS AS
* "Data-driven pathways": inferring usage patterns in patient record systems from auditing logs
* Researching explainability, natural language processing
* DIPS develops e-health systems: patient records, laboratory services, hospital kiosks, and more
* Gradually moving over applications to containers
## Agenda {.smaller}
* Part 1: Introduction to containers
* What are containers?
* Why are they useful for ML research?
* Getting started with Docker
* Part 2: Putting together our own container images
* Basic Dockerfile syntax
* Where does the model go?
* Debugging tips
* Part 3: Deploying containers for ML research
* GPU and device access
* Deploying to UiT's GPU cluster
* Deploying to NRIS HPC clusters
## Follow along
Files available on
https://github.com/jaalu/vigs-docker-workshop
## Motivation for software developers
* IT around 2008[^1]: developers handing applications to sysadmins maintaining long-lived servers
* Downtime for manual installation
* Server and application maintenance intertwined
* Conflicts between dependencies
* Containers allow _isolating_ applications and running them with their own set of dependencies
[^1]: https://www.atlassian.com/devops/what-is-devops/history-of-devops
## Motivation for ML researchers
* Replicability: making experimental conditions visible
* Flexibility: easing transition from laptop tests to HPC training
* Reusability: showing findings work in other settings too!
## Docker {.smaller}
:::: {.columns}
::: {.column width="70%"}
* Docker allows isolating your script into a _container_, which:
* Runs isolated from other processes while sharing the OS
* Can package their own set of dependencies
* Can be packaged and started on other servers, including HPC clusters
* Maintained by Docker Inc., runtime open source
* _Docker Desktop_ packages the software with a GUI, free for researchers
:::
::: {.column width="30%"}
![](assets/moby.png)
:::
::::
## Docker - structure
![](assets/docker-structure.png){fig-align="center"}
## Key concepts
We separate between *containers* and *images*:
* A *container* is a standalone environment with your script and the dependencies it needs
* An *image* is the template for making your container
* Images can be saved to a *registry*, like Docker Hub
Containers are meant to be _disposable_: changes you want to keep - like your trained model - should be outside of the container!
## Installing Docker - options
* Docker Desktop: https://www.docker.com/
* Play With Docker: https://labs.play-with-docker.com
* Free online lab with VMs provisioned
* Docker also provides an `apt` repository
## Checking that Docker works
* When Docker is running, we can get a list of running containers with
```
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
```
<!--
e593fff04794 postgres "docker-entrypoint.s…" 10 seconds ago Up 9 seconds 5432/tcp stupefied_elion -->
* We can then retrieve an image with `docker pull`:
```
$ docker pull hello-world
```
* We can then build a container from the image with `docker run`:
```
$ docker run hello-world
```
## Running containers - custom commands
<!-- Custom commands, interactive, detached, rm, env variables -->
* Images specify a default command, but we can specify one ourselves in `docker run`:
```
$ docker run ubuntu echo Hello!
Hello!
```
## Running containers - detached
<!-- Custom commands, interactive, detached, rm, env variables -->
* Default: containers do not accept any input, but write to the terminal
* More likely you want a container which runs _detached_ in the background, with `--detach` or `-d`:
```
$ docker run -d hello-world
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d3a5ee04babd hello-world "/hello" About a minute ago Exited (0) About a minute ago elated_feistel
$ docker logs elated_feistel
```
* NOTE: Docker options placed _before_ the container image and the command
## Running containers - interactive
* Alternatively, we can specify that the container should set up a shell and accept input with `--interactive --tty`, or `-it` for short:
```
$ docker run -it python:3.9
Python 3.9.16 (main, May 4 2023, 06:16:43)
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
```
## Running containers - cleaning up:
* Containers will stick around after they finish running
* Nice for checking logs, restarting, but list easily clogged
* Passing `--rm` will delete the container after it exits:
```
$ docker run --rm hello-world
$ docker ps -a
```
## Running containers - configuration
* We can set environment variables in the container with `--env` or `-e`:
```
$ docker run -e MODEL_ARCH=resnet ubuntu
```
* If we want to expose network ports (e.g. for dashboards) we can map ports from the container to the host with `-p`:
```
$ docker run -p 8080:80 httpd
```
* NOTE: the order is host-container, so `-p 8080:80` will connect port `80` _on the container_ to port `8080` _on the host_
## Where do we keep the model?
"Containers are meant to be _disposable_: changes you want to keep should be outside of the container"
So where do we store the trained models?
## Where could we keep the model? {.smaller}
* Embed it as part of the image
* Not an option for training, gives us large images
* Copy it to/from the container after start
* `docker cp` can copy files
* Can bump into runtime storage limits
* Upload to/download from online server
* Extra warmup time, network traffic
* Weights & Biases, Hugging Face libraries provide functionality for this
## Where should we keep the model? {.smaller}
* Bind mounts
* Creates a temporary link between a directory on the host PC and a directory in the container
* Pros: Can see the directory, pull files quickly
* Cons: Assumes the storage is on your PC, not as flexible as volumes
* Volumes
* Docker creates and manages a persistent directory
* Pros: More flexible, can set up plugins to mount cloud storage as volumes
* Cons: Requires a running (temporary) container to copy files to host
## Mapping bind mounts
To mount a directory with a bind mount we can use `--mount`:
```
$ docker run --mount type=bind,source=$(pwd)/assets/,target=/pictures/ ubuntu
```
`source` points to the folder on the host (`assets` in the working directory), and `target` is the folder it will appear as in the container (`/pictures/`)
## Setting up a volume
To set up a volume we run `docker volume create`:
```
$ docker volume create my-volume
$ docker volume inspect my-volume
```
We can then mount it in the same way, but with `type=volume`:
```
$ docker run --mount type=volume,source=my-volume,target=/results/ ubuntu
```
## Break
<!-- Building containers -->
## Creating an image - FROM
* Template file conventionally named `Dockerfile` (no extension)
* Images start with a `FROM` statement, which specifies which image (`ubuntu`) and tag (`22.04`) to build on:
```
FROM ubuntu:22.04
```
* If you omit the tag, Docker will automatically use `latest` - but use one if you can!
## Building an image
* This turns out to be a complete Dockerfile:
```
FROM ubuntu:22.04
```
* Save this as `Dockerfile` (no extension!) in a project directory, and run
```
$ docker build . -t my-image
```
* `docker build` expects a _build context_ - the directory to pull project files from when building the image - like the current directory `.`
* `-t` ties the new image to a name and a _tag_
## Creating an image - RUN
* Once we have a base, we specify which commands to run to build the environment
* `RUN` statements are followed by commands to run - for instance, we can install packages:
```
RUN apt-get update && apt-get install -y python3
```
## Creating an image - COPY
<!--
* docker build takes a path to the _build context_ with the files COPY should retrieve
* Usually we just make this the current directory . where the Dockerfile is,
so `docker build .`
-->
* Earlier, we specified the _build context_ (usually the project directory)
* `COPY` copies files from the build context to the container:
```
COPY train.py /experiment/train.py
```
* We can set the _working directory_ in the container with `WORKDIR`
```
WORKDIR /experiment/
```
* The `.dockerignore` file specifies which files not to copy from the context
## Creating an image - ENV
* We can also specify which settings the container expects/respects when it runs - conventionally done through environment variables
* `ENV` sets the default value for an environment variable:
* Default values can be overridden with `docker run -e`
## Creating an image - CMD
* Finally, we specify what should happen when the container _starts_ with `CMD`:
```
CMD python train.py
```
## Full image example
Using all of the commands:
```
FROM ubuntu:22.04
ENV BATCH_SIZE=128
RUN apt-get update && apt-get install -y python3
COPY train.py /experiment/train.py
CMD python3 /experiment/train.py
```
## Debugging tip 1
* What if: your image fails to build on step 14?
* You can temporarily turn off Buildkit to get intermediate images for each layer:
On Linux:
```
$ DOCKER_BUILDKIT=0 docker build .
```
On Windows:
```
set DOCKER_BUILDKIT=0& docker build .
```
* You can create a container from the intermediate image and retry the step:
```
$ docker exec -it <intermediate-image-id> bash
# pip install ...
```
## Debugging tip 2
* What if: the training suddenly stops and nothing happens?
* Find the running container
```
$ docker ps -a
e593fff04794 postgres "docker-entrypoint.s…" 10 seconds ago Up 9 seconds 5432/tcp stupefied_elion
```
* and start an interactive shell inside it:
```
$ docker exec -it stupefied_elion bash
```
## Archiving images
You can also save images as archives with `docker save`
```
$ docker save mnist-demo > mnist-demo.tar
```
and load them with `docker load`
```
$ docker load < mnist-demo.tar
```
## Making cache-friendly images
* Images are composed of multiple *layers*: each set of changes made by RUN and COPY makes up a layer. Try:
`$ docker image inspect python`
* To avoid new images for every build, Docker saves extra info for each layer:
* For RUN, the command is saved
* For COPY, a checksum of the added files is saved
* If the commands run/the files added by the layer are the same, and the last layer is the same, the layer is _reused_
## Making cache-friendly images pt. 2
* For this reason, the most frequent and smallest changes should come _last_ in your image, e.g.
1. Installing system packages
2. Installing Python packages
3. Adding your script
4. Running your script
* The commands should, as far as possible, have the same results each time you run them
## System package installation
* When using `apt`, group together updating and package installation:
```
apt-get update && apt-get install -y python3
```
* **NOTE**: This _doesn't_ guarantee that you get the latest packages _every_ time you build, but makes sure the package index makes sense when you install the packages
* Possible to lock `apt` packages to specific versions, but specifying the distro usually sufficient
## Language package installation {.smaller}
* Generally we want a _lock file_ with exact package versions, which we can then restore as part of building the image
* In Python, `pip freeze` produces a list of all packages installed
```
$ pip freeze > requirements.txt
```
* Good practice to set up a virtual environment with `venv` to isolate exactly which packages the project needs
* In R, `renv` (formerly `packrat`) can create similar virtual environments with lock files:
```
renv::snapshot() # saves project deps to renv.lock
renv::restore() # restores deps from renv.lock
```
## M1 Mac: installing amd64 packages
* Common problem on M1 Macs: older packages/libraries without ARM binaries
* We can ask Docker to have the container act as an Intel Mac with `--platform=linux/amd64`:
```
docker run --platform=linux/amd64 ubuntu uname -a
```
## Building images - conclusion
* Order changes from least to most frequent
* Commands should be deterministic
* Virtual environments and lock files useful to have reproducible containers
## GPU/device access
* Nvidia Container Runtime lets you run CUDA code in containers:
```
$ apt-get install nvidia-container-runtime
```
* Can use the `--gpus` switch to grant access to GPU:
```
$ docker run --gpus all ubuntu
```
* Windows: requires WSL 2, newer CUDA driver
* Premade CUDA Docker images: https://github.com/NVIDIA/nvidia-docker/wiki/CUDA
* Also possible to grant access to USB/hardware devices with `--device`
## Packaging for cloud services
* You can set up your own container registry with the cloud provider:
* Amazon ECR
* Azure Container Registry
* Google Cloud Container Registry
* To push the image to your cloud registry, tag your image with the URL of the registry and push:
```
$ docker tag mnist-demo jludemo.azurecr.io/mnist-demo
$ docker push jludemo.azurecro.io/mnist.demo
```
## Packaging for Springfield (UiT) {.smaller}
* UiT's Springfield cluster uses Kubernetes for _orchestration_ across multiple nodes
* Docker Desktop allows setting up your PC as a single-node Kubernetes cluster
* Kubernetes lets us define a **Job** which runs one or more containers until completion:
```
kind: Job
apiVersion: batch/v1
metadata:
name: your-training-job
spec:
template:
spec:
containers:
- name: your-training
image: "your-training-image"
workingDir: /storage
command: ["sh", "train.sh"]
volumeMounts:
- name: storage
mountPath: /storage
volumes:
- name: storage
persistentVolumeClaim:
claimName: storage
restartPolicy: OnFailure
backoffLimit: 0
```
## Packaging for NRIS HPC (Betzy, LUMI) {.smaller}
* As long as architecture is the same, we can now push images and run the containers on HPC clusters!
* However: container libraries most likely _not_ tuned for the HPC cluster
* If libraries are available as Lmod modules, the modules will be faster
* But containers are still useful for
* Portability
* Specifying package/library versions
## Packaging for NRIS HPC - Singularity (Betzy, LUMI) {.smaller}
* Singularity, the container runtime installed on NRIS HPC computers, supports converting Docker images to Singularity .sif images. Running
```
$ singularity pull --name train.sif docker://jlu015/train:latest
```
will retrieve the image `jlu015/train` from Docker Hub and save it as `train.sif`
* To run the default command specified by `CMD`, we can call `singularity run` with the image:
```
$ singularity run train.sif
```
* Alternatively, `singularity exec` will run a specific command:
```
$ singularity exec train.sif echo Hello world!
```
## Packaging for NRIS HPC - Writing a SLURM job (Betzy, LUMI) {.smaller}
See https://documentation.sigma2.no/code_development/guides/containers.html#singularity-in-job-scripts
## Resources
* Docker in Y Minutes:
* https://learnxinyminutes.com/docs/docker/
* The Play with Docker exercises:
* https://training.play-with-docker.com/
* NRIS' documentation on containers:
* https://documentation.sigma2.no/