Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix device list of loaded executable in PJRT plugin for multiple GPUs #19369

Merged
merged 7 commits into from
Dec 6, 2024

Conversation

PragmaTwice
Copy link
Member

@PragmaTwice PragmaTwice commented Dec 4, 2024

It closes #19366, and blocks #19279.

After this PR, ClientOptions::Compile will first check the device assignment in the compile options, and then return the corresponding device list with the loaded executable.

To achieve this, we introduce protobuf via FetchContent in this PR, which is scoped to the PJRT plugin. Compile options will be passed by the PJRT client encoded in protobuf, and in this plugin we decode it first and then retrieve some interesting fields.

ci-exactly: build_packages, test_pjrt

@PragmaTwice PragmaTwice force-pushed the pjrt-device-list branch 2 times, most recently from 6a2d394 to 3ae5fd7 Compare December 5, 2024 13:07
@PragmaTwice PragmaTwice marked this pull request as ready for review December 5, 2024 13:29
@PragmaTwice
Copy link
Member Author

PragmaTwice commented Dec 5, 2024

It works now and is ready for review : )

@PragmaTwice PragmaTwice requested a review from ScottTodd December 5, 2024 13:29
Copy link
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, the FetchContent for protobuf isn't as bad as I was expecting.

integrations/pjrt/cmake/protobuf_cc_library.cmake Outdated Show resolved Hide resolved
integrations/pjrt/cmake/protobuf_cc_library.cmake Outdated Show resolved Hide resolved
integrations/pjrt/CMakeLists.txt Outdated Show resolved Hide resolved
Comment on lines +139 to +141
// LINT.ThenChange(
// https://www.tensorflow.org/code/tensorflow/compiler/xla/tools/driver.cc
// )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(no action needed for this third_party code)

classic, 404 error :P

(looks like the new link is https://github.com/openxla/xla/blob/main/xla/tools/driver.cc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there's many broken links in XLA since it's separated out from TensorFlow. The path compiler/xla is now moved to third_party/xla and is syncing with the openxla/xla repo.

)
target_link_libraries(${_NAME}
PUBLIC
protobuf::libprotobuf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For a separate PR that could land before or after this, not needed in this PR)

Fun fun, this increases the build time from 2 minutes (logs here) to 8 minutes (logs here). From 451 build targets to 1035. That's still fast enough to not be too concerned, but we can do better if it starts to impact development iteration time.

We could enable ccache to help a bit there. I'd probably add https://github.com/hendrikmuhs/ccache-action to the workflow, like here:

- name: ccache
uses: hendrikmuhs/ccache-action@ed74d11c0b343532753ecead8a951bb09bb34bc9 # v1.2.14
with:
key: ${{ github.job }}-${{ matrix.name }}
save: ${{ needs.setup.outputs.write-caches == 1 }}
- name: CMake - configure
run: |
cmake \
-G Ninja \
-B ${BUILD_DIR} \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DCMAKE_C_COMPILER_LAUNCHER=ccache \
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
-DIREE_BUILD_COMPILER=OFF \
-DIREE_BUILD_PYTHON_BINDINGS=ON \
-DIREE_BUILD_SAMPLES=ON \
-DIREE_ENABLE_LLD=ON \
-DIREE_ENABLE_ASSERTIONS=ON \
${{matrix.driver-options}}
- name: CMake - build
run: cmake --build ${BUILD_DIR} -- -k 0
and set these environment variables:

    export CMAKE_C_COMPILER_LAUNCHER=ccache
    export CMAKE_CXX_COMPILER_LAUNCHER=ccache

that might just work...

Copy link
Member Author

@PragmaTwice PragmaTwice Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I also noticed that. It includes about 100 targets to build the protoc compiler for C++ codegen and 400+ targets to build protobuf library and its deps (absl, protobuf-lite..). More than the original total number of jobs. I'll try ccache in the future : )

@ScottTodd ScottTodd self-requested a review December 6, 2024 21:48
@ScottTodd ScottTodd merged commit d88d0a7 into iree-org:main Dec 6, 2024
26 checks passed
ScottTodd added a commit that referenced this pull request Dec 12, 2024
As discussed in
#19418 (comment),
#19418 (review)
and #19418 (comment),
here we support to read `env_option_overrides` as IREE compile flags
from `compile_options` passed by frontends like JAX in a per-compilation
basis.

Most of these code already exists but has been commented due to some
problems: `compile_options` was not yet available in that time, but it's
now introduced by #19369.

A simple use case is shown below, also as a test case:

https://github.com/iree-org/iree/blob/c37a80212dd4a541762fc9fdaaa615b6d0a62829/integrations/pjrt/test/test_compile_options.py#L9-L15

ci-exactly: build_packages, test_pjrt

---------

Signed-off-by: PragmaTwice <[email protected]>
Co-authored-by: Scott Todd <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PJRT plugin fails to execute in JAX while multiple GPUs are provided
2 participants