Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Try to fix test_model.sh #5361

Closed
wants to merge 1 commit into from
Closed

Conversation

larryliu0820
Copy link
Contributor

Trying to fix failures such as https://github.com/pytorch/executorch/actions/runs/10855583772/job/30128512970

For the time missing operator error is repro'ed on my Mac, I see cmake-out/ was not cleaned up and merged.yaml was missing linear.out because it was super old.

Now without deep understanding of how CI job workers cache previous runs, I'm refactoring build_cmake_executor_runner in test_model.sh to make sure it always build clean.

Copy link

pytorch-bot bot commented Sep 13, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5361

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Cancelled Job

As of commit e5509b0 with merge base 034e098 (image):

NEW FAILURE - The following job has failed:

CANCELLED JOB - The following job was cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 13, 2024
Trying to fix failures such as https://github.com/pytorch/executorch/actions/runs/10855583772/job/30128512970

For the time missing operator error is repro'ed on my Mac, I see cmake-out/
was not cleaned up and merged.yaml was missing linear.out because it was
super old.

Now without deep understanding of how CI job workers cache previous
runs, I'm refactoring `build_cmake_executor_runner` in test_model.sh to
make sure it always build clean.
@facebook-github-bot
Copy link
Contributor

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

}

run_portable_executor_runner() {
# Run test model
if [[ "${BUILD_TOOL}" == "buck2" ]]; then
buck2 run //examples/portable/executor_runner:executor_runner -- --model_path "./${MODEL_NAME}.pte"
elif [[ "${BUILD_TOOL}" == "cmake" ]]; then
if [[ ! -f ${CMAKE_OUTPUT_DIR}/executor_runner ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure that this would've broken anything, but it does look suspicious. nice find!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this does fix it, it feels like there's a higher-level issue that this is working around. For a given commit/PR, calling build_cmake_executor_runner once or five times should create the same result, so caching should be safe. But if skipping the cache fixes things, then it implies that there's possibly an older version of CMAKE_OUTPUT_DIR sitting around. And if it's left over from a previous job run, then we have some pretty serious hermeticity issues.

But if it's not left over from a previous run, then are we calling run_portable_executor_runner multiple times in a single job? And why would one call produce a different executor_runner binary than another call? The code in the repo hasn't changed, and it's always built with the same cmake flags.

Unless calls to this are interleaved with some other "cmake" call that itself overwrites CMAKE_OUTPUT_DIR with a different build configuration.

Copy link
Contributor

@swolchok swolchok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it fixes the broken builds, ship it!

&& cd ${CMAKE_OUTPUT_DIR} \
&& retry cmake -DCMAKE_BUILD_TYPE=Release \
rm -rf ${CMAKE_OUTPUT_DIR}
cmake -DCMAKE_BUILD_TYPE=Debug \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use retry cmake ... to keep the logic from before? Or are you removing it intentionally?

&& cd ${CMAKE_OUTPUT_DIR} \
&& retry cmake -DCMAKE_BUILD_TYPE=Release \
rm -rf ${CMAKE_OUTPUT_DIR}
cmake -DCMAKE_BUILD_TYPE=Debug \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention the move from Release to Debug in the PR summary. Consider adding a comment here explaining why we use this build mode here.

But besides that and removing retry, I don't see a behavior change here: it still removes the directory and generates the cmake system.

}

run_portable_executor_runner() {
# Run test model
if [[ "${BUILD_TOOL}" == "buck2" ]]; then
buck2 run //examples/portable/executor_runner:executor_runner -- --model_path "./${MODEL_NAME}.pte"
elif [[ "${BUILD_TOOL}" == "cmake" ]]; then
if [[ ! -f ${CMAKE_OUTPUT_DIR}/executor_runner ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this does fix it, it feels like there's a higher-level issue that this is working around. For a given commit/PR, calling build_cmake_executor_runner once or five times should create the same result, so caching should be safe. But if skipping the cache fixes things, then it implies that there's possibly an older version of CMAKE_OUTPUT_DIR sitting around. And if it's left over from a previous job run, then we have some pretty serious hermeticity issues.

But if it's not left over from a previous run, then are we calling run_portable_executor_runner multiple times in a single job? And why would one call produce a different executor_runner binary than another call? The code in the repo hasn't changed, and it's always built with the same cmake flags.

Unless calls to this are interleaved with some other "cmake" call that itself overwrites CMAKE_OUTPUT_DIR with a different build configuration.

@facebook-github-bot
Copy link
Contributor

@larryliu0820 merged this pull request in bfce743.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants