[CI] Try to fix test_model.sh #5361

larryliu0820 · 2024-09-13T20:44:39Z

Trying to fix failures such as https://github.com/pytorch/executorch/actions/runs/10855583772/job/30128512970

For the time missing operator error is repro'ed on my Mac, I see cmake-out/ was not cleaned up and merged.yaml was missing linear.out because it was super old.

Now without deep understanding of how CI job workers cache previous runs, I'm refactoring build_cmake_executor_runner in test_model.sh to make sure it always build clean.

pytorch-bot · 2024-09-13T20:44:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5361

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Cancelled Job

As of commit e5509b0 with merge base 034e098 ():

NEW FAILURE - The following job has failed:

trunk / test-models-macos (cmake, linear, xnnpack-quantization-delegation, macos-m1-stable, 90) / macos-job (gh)
Library not loaded: @rpath/liblz4.1.dylib

CANCELLED JOB - The following job was cancelled. Please retry:

trunk / test-models-macos (cmake, mv3, xnnpack-quantization-delegation, macos-m1-stable, 90) / macos-job (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Trying to fix failures such as https://github.com/pytorch/executorch/actions/runs/10855583772/job/30128512970 For the time missing operator error is repro'ed on my Mac, I see cmake-out/ was not cleaned up and merged.yaml was missing linear.out because it was super old. Now without deep understanding of how CI job workers cache previous runs, I'm refactoring `build_cmake_executor_runner` in test_model.sh to make sure it always build clean.

facebook-github-bot · 2024-09-13T21:37:53Z

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

swolchok · 2024-09-13T21:47:03Z

.ci/scripts/test_model.sh

 }

 run_portable_executor_runner() {
  # Run test model
  if [[ "${BUILD_TOOL}" == "buck2" ]]; then
    buck2 run //examples/portable/executor_runner:executor_runner -- --model_path "./${MODEL_NAME}.pte"
  elif [[ "${BUILD_TOOL}" == "cmake" ]]; then
-    if [[ ! -f ${CMAKE_OUTPUT_DIR}/executor_runner ]]; then


not sure that this would've broken anything, but it does look suspicious. nice find!

If this does fix it, it feels like there's a higher-level issue that this is working around. For a given commit/PR, calling build_cmake_executor_runner once or five times should create the same result, so caching should be safe. But if skipping the cache fixes things, then it implies that there's possibly an older version of CMAKE_OUTPUT_DIR sitting around. And if it's left over from a previous job run, then we have some pretty serious hermeticity issues.

But if it's not left over from a previous run, then are we calling run_portable_executor_runner multiple times in a single job? And why would one call produce a different executor_runner binary than another call? The code in the repo hasn't changed, and it's always built with the same cmake flags.

Unless calls to this are interleaved with some other "cmake" call that itself overwrites CMAKE_OUTPUT_DIR with a different build configuration.

swolchok

if it fixes the broken builds, ship it!

dbort · 2024-09-13T22:18:30Z

.ci/scripts/test_model.sh

-    && cd ${CMAKE_OUTPUT_DIR} \
-    && retry cmake -DCMAKE_BUILD_TYPE=Release \
+  rm -rf ${CMAKE_OUTPUT_DIR}
+  cmake -DCMAKE_BUILD_TYPE=Debug \


Should this use retry cmake ... to keep the logic from before? Or are you removing it intentionally?

dbort · 2024-09-13T22:25:27Z

.ci/scripts/test_model.sh

-    && cd ${CMAKE_OUTPUT_DIR} \
-    && retry cmake -DCMAKE_BUILD_TYPE=Release \
+  rm -rf ${CMAKE_OUTPUT_DIR}
+  cmake -DCMAKE_BUILD_TYPE=Debug \


Please mention the move from Release to Debug in the PR summary. Consider adding a comment here explaining why we use this build mode here.

But besides that and removing retry, I don't see a behavior change here: it still removes the directory and generates the cmake system.

dbort · 2024-09-13T22:37:47Z

.ci/scripts/test_model.sh

 }

 run_portable_executor_runner() {
  # Run test model
  if [[ "${BUILD_TOOL}" == "buck2" ]]; then
    buck2 run //examples/portable/executor_runner:executor_runner -- --model_path "./${MODEL_NAME}.pte"
  elif [[ "${BUILD_TOOL}" == "cmake" ]]; then
-    if [[ ! -f ${CMAKE_OUTPUT_DIR}/executor_runner ]]; then


If this does fix it, it feels like there's a higher-level issue that this is working around. For a given commit/PR, calling build_cmake_executor_runner once or five times should create the same result, so caching should be safe. But if skipping the cache fixes things, then it implies that there's possibly an older version of CMAKE_OUTPUT_DIR sitting around. And if it's left over from a previous job run, then we have some pretty serious hermeticity issues.

But if it's not left over from a previous run, then are we calling run_portable_executor_runner multiple times in a single job? And why would one call produce a different executor_runner binary than another call? The code in the repo hasn't changed, and it's always built with the same cmake flags.

Unless calls to this are interleaved with some other "cmake" call that itself overwrites CMAKE_OUTPUT_DIR with a different build configuration.

facebook-github-bot · 2024-09-13T22:47:51Z

@larryliu0820 merged this pull request in bfce743.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 13, 2024

larryliu0820 force-pushed the cleanup_test_model branch from 5fd7bc2 to 2d3c2e7 Compare September 13, 2024 21:12

larryliu0820 force-pushed the cleanup_test_model branch from 2d3c2e7 to e5509b0 Compare September 13, 2024 21:30

larryliu0820 added the ciflow/trunk label Sep 13, 2024

swolchok reviewed Sep 13, 2024

View reviewed changes

swolchok approved these changes Sep 13, 2024

View reviewed changes

dbort reviewed Sep 13, 2024

View reviewed changes

facebook-github-bot closed this in bfce743 Sep 13, 2024

facebook-github-bot added the Merged label Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Try to fix test_model.sh #5361

[CI] Try to fix test_model.sh #5361

larryliu0820 commented Sep 13, 2024

pytorch-bot bot commented Sep 13, 2024 •

edited

Loading

facebook-github-bot commented Sep 13, 2024

swolchok Sep 13, 2024

dbort Sep 13, 2024

swolchok left a comment

dbort Sep 13, 2024

dbort Sep 13, 2024

dbort Sep 13, 2024

facebook-github-bot commented Sep 13, 2024

[CI] Try to fix test_model.sh #5361

[CI] Try to fix test_model.sh #5361

Conversation

larryliu0820 commented Sep 13, 2024

pytorch-bot bot commented Sep 13, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5361

❌ 1 New Failure, 1 Cancelled Job

facebook-github-bot commented Sep 13, 2024

swolchok Sep 13, 2024

Choose a reason for hiding this comment

dbort Sep 13, 2024

Choose a reason for hiding this comment

swolchok left a comment

Choose a reason for hiding this comment

dbort Sep 13, 2024

Choose a reason for hiding this comment

dbort Sep 13, 2024

Choose a reason for hiding this comment

dbort Sep 13, 2024

Choose a reason for hiding this comment

facebook-github-bot commented Sep 13, 2024

pytorch-bot bot commented Sep 13, 2024 •

edited

Loading