Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems pulling a docker image on a Perlmutter Compute node #80

Closed
asnaylor opened this issue Jul 7, 2023 · 2 comments
Closed

Problems pulling a docker image on a Perlmutter Compute node #80

asnaylor opened this issue Jul 7, 2023 · 2 comments
Assignees

Comments

@asnaylor
Copy link

asnaylor commented Jul 7, 2023

Was trying to pull nvcr.io/nvidia/tritonserver:22.02-py3 on a Perlmutter compute node but the squash failed. I can pull this image fine on a login node.

$ podman-hpc pull nvcr.io/nvidia/tritonserver:22.02-py3
WARN[0000] "/" is not a shared mount, this could cause issues or missing mounts with rootless containers
ERRO[0000] Image nvcr.io/nvidia/tritonserver:22.02-py3 exists in local storage but may be corrupted (remove the image to resolve the issue): layer not known
Trying to pull nvcr.io/nvidia/tritonserver:22.02-py3...
Getting image source signatures
Copying blob 0627d99a5c7e [--------------------------------------] 0.0b / 11.3KiB
Copying blob 0627d99a5c7e [--------------------------------------] 0.0b / 11.3KiB
Copying blob 08c01a0ec47e [--------------------------------------] 0.0b / 27.2MiB
Copying blob 2c5671971cd1 [--------------------------------------] 22.2KiB / 2.1GiB
Copying blob 0bd41856d1bc [--------------------------------------] 0.0b / 86.9MiB
Copying blob 5eedfe82c26b [--------------------------------------] 0.0b / 76.5MiB
Copying blob 2514f64065e8 [--------------------------------------] 0.0b / 177.8KiB
Copying blob 0627d99a5c7e [--------------------------------------] 0.0b / 11.3KiB
Copying blob 08c01a0ec47e [=>------------------------------------] 1.7MiB / 27.2MiB
Copying blob 2c5671971cd1 [--------------------------------------] 2.1MiB / 2.1GiB
Copying blob 0bd41856d1bc [--------------------------------------] 0.0b / 86.9MiB
Copying blob 5eedfe82c26b [>-------------------------------------] 1.1MiB / 76.5MiB
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e [===========>--------------------------] 8.6MiB / 27.2MiB
Copying blob 2c5671971cd1 [--------------------------------------] 8.6MiB / 2.1GiB
Copying blob 0bd41856d1bc [--------------------------------------] 17.3KiB / 86.9MiB
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e [=======================>--------------] 16.9MiB / 27.2MiB
Copying blob 2c5671971cd1 [--------------------------------------] 17.4MiB / 2.1GiB
Copying blob 0bd41856d1bc [--------------------------------------] 103.3KiB / 86.9MiB
Copying blob 0627d99a5c7e done
Copying blob 0627d99a5c7e done
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e done
Copying blob 2c5671971cd1 [>-------------------------------------] 49.7MiB / 2.1GiB
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e done
Copying blob 2c5671971cd1 [>-------------------------------------] 58.4MiB / 2.1GiB
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e done
Copying blob 2c5671971cd1 [=>------------------------------------] 84.7MiB / 2.1GiB
Copying blob 0bd41856d1bc [======================>---------------] 53.2MiB / 86.9MiB
Copying blob 5eedfe82c26b [====================================>-] 74.2MiB / 76.5MiB
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e done
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e done
Copying blob 0627d99a5c7e done
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e done
Copying blob 2c5671971cd1 [==>-----------------------------------] 153.3MiB / 2.1GiB
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e done
Copying blob 2c5671971cd1 [===>----------------------------------] 220.3MiB / 2.1GiB
Copying blob 0bd41856d1bc done
Copying blob 5eedfe82c26b done
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e done
Copying blob 2c5671971cd1 [========>-----------------------------] 478.6MiB / 2.1GiB
Copying blob 0627d99a5c7e done
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e done
Copying blob 2c5671971cd1 done
Copying blob 0bd41856d1bc done
Copying blob 5eedfe82c26b done
Copying blob 2514f64065e8 done
Copying blob ec1508216c18 done
Copying blob 749074f382f7 done
Copying blob a7e6794ed569 done
Copying blob 4f1ad9e2a154 done
Copying blob 09cd2bfa6cba done
Copying blob 9da140dedb4f done
Copying blob 2935286a91f6 done
Copying blob 2b0599a695e2 skipped: already exists
Copying blob 628e96fc9140 skipped: already exists
Copying blob d2eb54715c6f skipped: already exists
Copying blob a6d158916ac9 skipped: already exists
Copying blob 5caa897e63ad skipped: already exists
Copying blob 164c096fb0d9 skipped: already exists
Copying blob ff1be69fcc70 skipped: already exists
Copying blob 38972466c8c5 skipped: already exists
Copying blob d260d4926e86 skipped: already exists
Copying blob 665a51e67490 skipped: already exists
Copying blob 4df3be64118e skipped: already exists
Copying blob 6072fed44aff skipped: already exists
Copying blob 878d8a1d0950 skipped: already exists
Copying blob aaaa03356373 skipped: already exists
Copying blob ec889413fc1a skipped: already exists
Copying blob 1d0881aa8a74 skipped: already exists
Copying blob 30efd5ebf026 skipped: already exists
Copying blob d3ae827fe332 skipped: already exists
2023/07/07 14:37:05 bolt.Close(): funlock error: errno 524
2023/07/07 14:37:05 bolt.Close(): funlock error: errno 524
Copying config d52ac03519 done
2023/07/07 14:37:05 bolt.Close(): funlock error: errno 524
2023/07/07 14:37:05 bolt.Close(): funlock error: errno 524
2023/07/07 14:37:05 bolt.Close(): funlock error: errno 524
Writing manifest to image destination
Storing signatures
d52ac03519ab5da3e240477c27defe10687fbcc2484cf6e62d38006638a0bae7
WARN[0068] Failed to add pause process to systemd sandbox cgroup: dial unix /run/user/75235/bus: connect: no such file or directory
INFO: Migrating image to /pscratch/sd/a/asnaylor/storage
ERROR:root:Squash Failed
ERROR:root:
ERROR:root:time="2023-07-07T14:37:05-07:00" level=warning msg="The cgroupv2 manager is set to systemd but there is no systemd user session available"
time="2023-07-07T14:37:05-07:00" level=warning msg="For using systemd, you may need to login using an user session"
time="2023-07-07T14:37:05-07:00" level=warning msg="Alternatively, you can enable lingering with: `loginctl enable-linger 75235` (possibly as root)"
time="2023-07-07T14:37:05-07:00" level=warning msg="Falling back to --cgroup-manager=cgroupfs"
time="2023-07-07T14:37:05-07:00" level=warning msg="The cgroupv2 manager is set to systemd but there is no systemd user session available"
time="2023-07-07T14:37:05-07:00" level=warning msg="For using systemd, you may need to login using an user session"
time="2023-07-07T14:37:05-07:00" level=warning msg="Alternatively, you can enable lingering with: `loginctl enable-linger 75235` (possibly as root)"
time="2023-07-07T14:37:05-07:00" level=warning msg="Falling back to --cgroup-manager=cgroupfs"
time="2023-07-07T14:37:05-07:00" level=warning msg="Can't read link \"/tmp/75235_hpc/storage/overlay/l/JEZ2HQREEFDRMUHHFJ3NCKBF27\" because it does not exist. A storage corruption might have occurred, attempting to recreate the missing symlinks. It might be best wipe the storage to avoid further errors due to storage corruption."
Error: readlink /tmp/75235_hpc/storage/overlay/l/JEZ2HQREEFDRMUHHFJ3NCKBF27: no such file or directory

As a temporary fix @lastephey suggested doing:

podman-hpc pull --storage-opt mount_program=/usr/bin/fuse-overlayfs-wrap nvcr.io/nvidia/tritonserver:22.02-py3
@lastephey lastephey self-assigned this Jul 7, 2023
@lastephey
Copy link
Collaborator

Thanks @asnaylor. I think the easiest fix is to add this to default_pull_args.

The only question is whether using fuse-overlays rather than native adds a lot of slowdown on logins. If that's the case, we could add some logic for podman-hpc to determine if it's on a compute or login, although that would make things a bit messy. I'll do some testing.

@lastephey
Copy link
Collaborator

Should be addressed by #82

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants