Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudaPackages: add jetson support #242050

Closed
wants to merge 3 commits into from

Conversation

SomeoneSerge
Copy link
Contributor

@SomeoneSerge SomeoneSerge commented Jul 7, 2023

Description of changes

It's a pretty small patch recovering jetson support as originally attempted in #194791, hopefully can be merged without any extra effort once ofborg goes through. I successfully built cuda_nvcc on an nvidia jetson host. One prerequisite for getting more complex stuff to work (like pytorch) would be #233581

CC #158350 @NixOS/cuda-maintainers

Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
❯ nix eval .#cudaPackages.cuda_nvcc.manifestAttribute
"linux-x86_64"
❯ nix eval .#cudaPackages.cuda_nvcc.meta.platforms
[ "aarch64-linux" "powerpc64le-linux" "x86_64-linux" ]
❯ nix eval .#pkgsCross.aarch64-multiplatform.cudaPackages.cuda_nvcc.manifestAttribute
"linux-aarch64"
❯ nix eval .#pkgsCross.aarch64-multiplatform.cudaPackages.cuda_nvcc.meta.platforms
[ "aarch64-linux" "powerpc64le-linux" "x86_64-linux" ]
What doesn't work

Idk how to use cross, but I tried this on an x86-64 host and it failed:

❯ NIXPKGS_ALLOW_UNFREE=1 nix build --impure .#pkgsCross.aarch64-multiplatform.cudaPackages.cuda_nvcc
...
       > error: auto-patchelf could not satisfy dependency libstdc++.so.6 wanted by /nix/store/9kv83k8nnnaxhv7wy2h27k15ww61xc75-cuda_nvcc-aarch64-unknown-linux-gnu-11.8.89/bin/__nvcc_device_query
       > error: auto-patchelf could not satisfy dependency libgcc_s.so.1 wanted by /nix/store/9kv83k8nnnaxhv7wy2h27k15ww61xc75-cuda_nvcc-aarch64-unknown-linux-gnu-11.8.89/bin/__nvcc_device_query
...
       > auto-patchelf failed to find all the required dependencies.
...
       For full logs, run 'nix log /nix/store/rmb71904ksdm8sbkn8wscnrzihvp5nkb-cuda_nvcc-aarch64-unknown-linux-gnu-11.8.89.drv'.

It's a shot in the dark, but I think this might be related to #226165

@SomeoneSerge
Copy link
Contributor Author

Result of nixpkgs-review pr 242050 --extra-nixpkgs-config '{ cudaCapabilities = [ "8.6" ]; cudaSupport = true; }' run on x86_64-linux 1

@SomeoneSerge SomeoneSerge force-pushed the cudaPackages-jetson branch from 1c84b91 to 3f5805d Compare July 7, 2023 13:11
@ofborg ofborg bot added the 8.has: package (new) This PR adds a new package label Jul 7, 2023
@ofborg ofborg bot requested review from ConnorBaker and samuela July 7, 2023 13:37
@ofborg ofborg bot added 11.by: package-maintainer This PR was created by the maintainer of the package it changes 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux labels Jul 7, 2023
@SomeoneSerge SomeoneSerge added backport release-23.05 6.topic: cuda Parallel computing platform and API labels Jul 7, 2023
@ConnorBaker
Copy link
Contributor

@SomeoneSerge I ended up adding support for multiple arch in a PR I have brewing: #240498

As you’ve noticed, it’s a bit of trouble currently to figure out whether to choose the Linux4Tegra or SBSA packages for aarch64. Is there a different double we have for Jetson specifically?

@SomeoneSerge
Copy link
Contributor Author

@ConnorBaker, I'm still unsure what exactly the sbsa builds are for, but I do know that the builds marked linux-aarch64 are meant for jetsons, which is why in this PR I give those the priority

RE: #240498

Great! I guess the question is how long is it going to take you to merge that PR v. merge this and rebase yours? I tried to limit mine to enabling jetson support specifically with the purpose of avoiding collisions, but I guess it's not that simple 🙃

@ConnorBaker
Copy link
Contributor

@SomeoneSerge I definitely need to take out the multi-arch stuff from that PR; it's a can of worms. Here's a summary of what I've learned:

Both the Linux 4 Tegra (Jetson devices, NVIDIA redist manifests refer to this as linux-aarch64) and SBSA (server-grade ARM setups, referred to as linux-sbsa) are effectively aarch64-linux. However, the packages are NOT interchangeable, as I understand it -- they're built with different configurations and target different hardware.

That means if we want to support both, we need to:

  1. Ensure cudaFlags verifies that capabilities for Jetson and non-Jetson devices are not mixed
  2. Choose the redist package to use depending on the capabilities present

For the first point: Unlike other GPUs which can be slotted into both x86_64 or SBSA (ARM) servers, Jetson capabilities are tied to aarch64-linux. If Jetson capabilities are present in config.cudaCapabilities, our hostPlatform.system must be aarch64-linux. (That is, we must either be building on an aarch64-linux device, in which case our buildPlatform and hostPlatform are the same, or we are cross-compiling to aarch64-linux from a different platform.) Effectively: the presence of any Jetson capabilities in config.cudaCapabilities necessitates that we are both building for aarch64-linux and all capabilities in config.cudaCapabilities are Jetson capabilities.

For the second point: we must know whether we are building for Jetson so we can correctly decide whether to use the linux-aarch64 or linux-sbsa redist package when our hostPlatform is aarch64-linux.

@ofborg ofborg bot requested a review from samuela July 8, 2023 18:57
@SomeoneSerge
Copy link
Contributor Author

jetson capabilities are tied to aarch64-linux

Oh, I see. So we might want to

  • introduce a separate package set, e.g. cudaPackages_jetson, where cudaFlags only contain sm_62, sm_72, and sm_87,
  • and have the cudaPackages attribute default to linux-sbsa for aarch64 devices

we need to ... Ensure cudaFlags verifies that capabilities for Jetson and non-Jetson devices are not mixed

Is it that we need this, or is it something "nice to have"?

Unlike other GPUs which can be slotted into both x86_64 or SBSA (ARM) servers

I think I finally get it, thanks! SBSA binaries are for when we wire a pci-e gpu to a generic aarch64 host?

@ConnorBaker
Copy link
Contributor

ConnorBaker commented Jul 8, 2023

Is it that we need this, or is it something "nice to have"?

If we continue to use only a single cudaPackages package set and choose the redistributable based on selected capabilities, I'm of the opinion that it is something that we need to have. Consider the case where we have mixed capabilities and

  • the hostPlatform.system is aarch64-linux:
    • We'll be using the Jetson redistributables, which do not support other capabilities
  • the hostPlatform.system is not aarch64-linux:
    • We'll be using non-Jetson redistributables, which do not support Jetson capabilities

If we introduce a cudaPackages_jetson package, what would you envision happening with cudaFlags? Would it be the same as cudaPackages.cudaFlags, but only allow config.cudaCapabilities to contain capabilities for Jetson devices? If so, my understand is that using packages from cudaPackages_jetson would trigger a check to make sure only Jetson capabilities are requested by config.cudaCapabilities.

I think I finally get it, thanks! SBSA binaries are for when we wire a pci-e gpu to a generic aarch64 host?

Yes -- apparently SBSA is the name of a specification for ARM-based servers https://en.wikipedia.org/wiki/Server_Base_System_Architecture.

@SomeoneSerge
Copy link
Contributor Author

SomeoneSerge commented Jul 8, 2023

If we introduce a cudaPackages_jetson package, what would you envision happening with cudaFlags

I pushed an example in the last commit: the idea would be just to override cudaFlags for the package set:

❯ nix eval .#pkgsCross.aarch64-multiplatform.cudaPackages.cuda_nvcc.manifestAttribute
"linux-sbsa"
❯ nix eval .#pkgsCross.aarch64-multiplatform.cudaPackages_jetson.cuda_nvcc.manifestAttribute
"linux-aarch64"

Instead of a hard-coded list, we could form one from gpus.nix. I looked into this, we might want to replace dontDefaultAfter with something like jetsonCompatible and jetsonOnly?

Jetson device owners may overlay their nixpkgs with cudaPackages = final.cudaPackages_jetson and get their opencv and pytorch running

What's maybe embarrassing is that cudaPackages_jetson would ignore user-specified config.cudaCapabilities

Alternatively,

we could introduce a config.jetson :: bool option and keep, as you point out, a single cudaPackages set

@ConnorBaker
Copy link
Contributor

Instead of a hard-coded list, we could form one from gpus.nix. I looked into this, we might want to replace dontDefaultAfter with something like jetsonCompatible and jetsonOnly?

Take a look at the changes I made to

Overall, those changes allow us to build the user-requested Jetson capabilities. (They must be requested by the user through config.cudaCapabilities though, as Jetson capabilities are excluded by the isDefault predicate in flags.nix.)

"8.7"
];
};
manifestAttribute = "linux-aarch64";
Copy link
Contributor Author

@SomeoneSerge SomeoneSerge Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would hinder future attempts at cross-compilation:

❯ nix eval -f cross-jetson.nix cudaPackages.cuda_nvcc.manifestAttribute
"linux-aarch64"
❯ nix eval -f cross-jetson.nix buildPackages.cudaPackages.cuda_nvcc.manifestAttribute
"linux-aarch64"

Expected: "linux-aarch64", "linux-x86_64"

Consequence (watch out, I could be wrong about everything):

  1. We should always choose a tag (linux-x86_64, linux-aarch64, linux-sbsa) that is compatible with the current hostPlatform.system
  2. For the CUDA libraries that come with PTX text (e.g. libcublas) we should choose, among host-compatible tags, one that ships all of the requested cuda capabilities. If there isn't one, we should mark the package broken. We should not mark nvcc as broken
# cross-jetson.nix
(import ./. {
  config.allowUnfree = true;
  config.cudaSupport = true;
  config.cudaCapabilities = [ "7.2" ];
  # config.cudaCapabilities = [ "8.6" ];
  overlays = [ (final: prev: { cudaPackages = prev.cudaPackages_jetson; }) ];
}).pkgsCross.aarch64-multiplatfor

PoC: SomeoneSerge#4

@SomeoneSerge SomeoneSerge force-pushed the cudaPackages-jetson branch from b7685fd to 679ad3d Compare July 10, 2023 17:37
@SomeoneSerge
Copy link
Contributor Author

Superseded by #256324

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: cuda Parallel computing platform and API 8.has: package (new) This PR adds a new package 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux 11.by: package-maintainer This PR was created by the maintainer of the package it changes
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants