Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build failure masked as a RUN_ERROR #713

Open
tygoetsch opened this issue Nov 1, 2023 · 6 comments
Open

Build failure masked as a RUN_ERROR #713

tygoetsch opened this issue Nov 1, 2023 · 6 comments
Assignees

Comments

@tygoetsch
Copy link
Collaborator

tgoetsch@ch-fe1:/usr/projects/hpctools/tgoetsch/repos/pav2-lanl2-> cat /usr/projects/hpctest/pavilion/2.4/working_dir/test_runs/914/job/info
{"id": "7647116", "sys_name": "chicoma"}tgoetsch@ch-fe1:/usr/projects/hpctools/tgoetsch/repos/pav2-lanl2-> cat /usr/projects/hpctest/pavilion/2.4/working_dir/test_runs/914/status
1698862272.972506 STATUS_CREATED Created status file.
1698862272.984902 CREATED Test directory and status file created.
1698862272.990294 BUILD_CREATED Builder created.
1698862272.995562 CREATED Test directory setup complete.
1698862278.854854 BUILD_WAIT Waiting on lock for build 4804c9b55cc8e944.
1698862278.859590 BUILDING Starting build 4804c9b55cc8e944.
1698862278.930399 BUILDING Extracting tarfile /usr/projects/hpctest/test_src/ior.tgz for build /usr/projects/hpctest/pavilion/2.4/working_dir/builds/4804c9b55cc8e944
1698862279.017952 BUILD_ERROR Error setting up build directory '/usr/projects/hpctest/pavilion/2.4/working_dir/builds/4804c9b55cc8e944': Error extracting file '/usr/projects/hpctest/test_src/ior.tgz'\n Could not extract tarfile '/usr/projects/hpctest/test_src/ior.tgz' into '/usr/projects/hpctest/pavilion/2.4/working_dir/builds/4804c9b55cc8e944': [Errno 2] No such file or directory: '/usr/projects/hpctest/pavilion/2.4/working_dir/builds/4804c9b55cc8e944/./doc/sphinx/userDoc/tutorial.rst'
1698862369.929424 SCHEDULED Test kicked off (individually) under slurm scheduler with 500 nodes.
1698862386.513605 PREPPING_RUN Converting run template into run script.
1698862386.514956 RUNNING Starting the run script.
1698862386.518336 RUN_ERROR Unknown error while running test. Refer to the kickoff log.
tgoetsch@ch-fe1:/usr/projects/hpctools/tgoetsch/repos/pav2-lanl2-> cat /usr/projects/hpctest/pavilion/2.4/working_dir/test_runs/914/job/kickoff.log

@dmageeLANL dmageeLANL self-assigned this Nov 9, 2023
@dmageeLANL
Copy link
Collaborator

Yeah that's annoying I'll look into it.

@tygoetsch
Copy link
Collaborator Author

Hey Dan, I think the cause is the lack of atomic file creation on Chicoma. So presumably the only issue on Pavilion's part is that the error was misreported.

@dmageeLANL
Copy link
Collaborator

This comes from Line 427 in the build method in lib/pavilion/builder.py:

if not self._build(self.path, cancel_event, test_id, tracker):`

In the _build method from line 516 in builder.py:

        try:
            self._setup_build_dir(build_dir, tracker)
        except TestBuilderError as err:
            tracker.error(
                note=("Error setting up build directory '{}': {}"
                      .format(build_dir, err)))
            return False

This fails, returns False, and writes error messages to status from error messages at extract.py: 134, builder.py:713, builder.py:520. It returns False and triggers the failpath logic but does not trigger a cancel event and so the show continues. I don't understand why it doesn't trigger a cancel event. But it seems like there's some underlying logic that doesn't cause the test run to cancel on a build error. You built these structures @pflarr, so what was your thought process and what types of BUILD_ERRORS will cause the runs to cancel?

@Paul-Ferrell
Copy link
Collaborator

That it doesn't trigger a cancel is almost certainly a bug.
This has only been popping up with the atomic write issue on the Shasta filesystems though, so it went undetected for quite a while.

@dmageeLANL
Copy link
Collaborator

Right. That's because Cray Shasta systems are the only systems where builder._setup_build_dir fails. So you never see this BUILD_ERROR in other contexts. Easy fix.

@dmageeLANL
Copy link
Collaborator

Ok, actually. Looking closer at it. I think it's fixed already. See the passage below from builder.py:TestBuilder.build.

                    with lockfile.LockFilePoker(lock):
                        # Attempt to perform the actual build, this shouldn't
                        # raise an exception unless something goes terribly
                        # wrong.
                        # This will also set the test status for
                        # non-catastrophic cases.
                        if not self._build(self.path, cancel_event, test_id, tracker):

                            try:
                                self.path.rename(self.fail_path)
                            except FileNotFoundError as err:
                                tracker.error(
                                    "Failed to move build {} from {} to "
                                    "failure path {}"
                                    .format(self.name, self.path,
                                            self.fail_path), err)
                                try:
                                    self.fail_path.mkdir()
                                except OSError as err2:
                                    tracker.error(
                                        "Could not create fail directory for "
                                        "build {} at {}"
                                        .format(self.name, self.fail_path, err2))
                            if cancel_event is not None:
                                cancel_event.set()

                            return False

if self._build returns False. Which it does in the original case (where the status file shows 'Error setting up build directory'), and the cancel_event is not None (it's a threading.Event type), then cancel_event should get set. Perhaps the version you were using had something missing there, but it should work as far as I can tell. If you can recreate it with the current master, let me know and I'll poke at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants