Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tests/integration_tests/cli/test_integration_cli.py::test_failing_job_cli_error_message #6863

Closed
xjules opened this issue Dec 29, 2023 · 11 comments · Fixed by #6922
Closed
Assignees
Labels

Comments

@xjules
Copy link
Contributor

xjules commented Dec 29, 2023

Describe the bug
The following test fails, but only when running it as github workflow:

def test_failing_job_cli_error_message():
        # modify poly_eval.py
        with open("poly_eval.py", mode="a", encoding="utf-8") as poly_script:
            poly_script.writelines(["    raise RuntimeError('Argh')"])
    
        args = Mock()
        args.config = "poly_high_min_reals.ert"
        parser = ArgumentParser(prog="test_main")
    
        parser = ArgumentParser(prog="test_main")
        parsed = ert_parser(
            parser,
            [TEST_RUN_MODE, "poly.ert"],
        )
        expected_substrings = [
            "Realization: 0 failed after reaching max submit (2)",
            "job poly_eval failed",
            "Process exited with status code 1",
            "Traceback",
            "raise RuntimeError('Argh')",
            "RuntimeError: Argh",
        ]
        try:
            run_cli(parsed)
        except ErtCliError as error:
            for substring in expected_substrings:
>               assert substring in f"{error}"
E               AssertionError: assert 'Realization: 0 failed after reaching max submit (2)' in 'Experiment failed!\n'

To reproduce
It only fails on github action workflows.
more info here for example: https://github.com/equinor/komodo-releases/actions/runs/7352338393/job/20017761397

Expected behaviour
The test should pass

Screenshots
If applicable, add screenshots to help explain your problem.

Environment

  • OS: [e.g. RHEL7]
  • ERT/Komodo release: bleeiding
  • Python version
  • Remote/HPC execution involved: no

Additional context
This test fails when running with job_queue while scheduler execution is skipped. Apparently the error message is not composed correctly.

@xjules xjules added the bug label Dec 29, 2023
@jonathan-eq jonathan-eq self-assigned this Dec 29, 2023
@jonathan-eq jonathan-eq moved this to In Progress in SCOUT Dec 29, 2023
@jonathan-eq
Copy link
Contributor

As of today, my bisect tests stopped working completely. The test does not complete, and fails after two hours on ert.services._base_service.ServerBootFail. Marking this as blocked until this is no longer occuring.

@jonathan-eq
Copy link
Contributor

It is still working on nightly builds, so it is my github actions workflow setup that is not working correctly.

@jonathan-eq jonathan-eq moved this from In Progress to Todo in SCOUT Jan 4, 2024
@jonathan-eq
Copy link
Contributor

Depends on #6888 for better error logging

@jonathan-eq jonathan-eq moved this from Todo to In Progress in SCOUT Jan 4, 2024
@jonathan-eq
Copy link
Contributor

jonathan-eq commented Jan 4, 2024

NB: This error only occurs for job queue, not scheduler
It occurs for both modes. For job queue constantly, and scheduler sometimes.

@berland
Copy link
Contributor

berland commented Jan 4, 2024

Possible lead: The XML part of this was changed in the legacy code while implementing support for it in scheduler.

@xjules
Copy link
Contributor Author

xjules commented Jan 5, 2024

NB: This error only occurs for job queue, not scheduler

Last 3 days it has failed on both.

@berland berland self-assigned this Jan 9, 2024
@berland
Copy link
Contributor

berland commented Jan 10, 2024

The error is reproducible by logging in to a linappnode, su-ing to f_scout_ci and replicating commands in run_tests_one_project.yml

@berland
Copy link
Contributor

berland commented Jan 10, 2024

A hypothesis is that the runpath poly_example is being reused:

[f_scout_ci@st-linapp1192 iter-0]$ pwd
/private/f_scout_ci/ert/pytest_tmp_dir/pytest-of-f_scout_ci/pytest-3/popen-gw3/poly_example0/test_data/poly_out/realization-1/iter-0
[f_scout_ci@st-linapp1192 iter-0]$ ls -l
total 344
-rw-rw-r-- 1 f_scout_ci f_scout_ci  272 Jan 10 09:25 JOB_LOG
-rw-rw-r-- 1 f_scout_ci f_scout_ci 1691 Jan 10 09:25 jobs.json
-rw-rw-r-- 1 f_scout_ci f_scout_ci 1690 Jan 10 09:24 jobs.json_backup_2024-01-10_09-25-17Z
drwxrwxr-x 2 f_scout_ci f_scout_ci  104 Jan 10 09:25 logs
-rw-rw-r-- 1 f_scout_ci f_scout_ci   28 Jan 10 09:25 OK
-rw-rw-r-- 1 f_scout_ci f_scout_ci   97 Jan 10 09:25 parameters.json
-rw-rw-r-- 1 f_scout_ci f_scout_ci   97 Jan 10 09:24 parameters.json_backup_2024-01-10_09-25-17Z
-rw-rw-r-- 1 f_scout_ci f_scout_ci   52 Jan 10 09:25 parameters.txt
-rw-rw-r-- 1 f_scout_ci f_scout_ci   52 Jan 10 09:24 parameters.txt_backup_2024-01-10_09-25-17Z
-rw-rw-r-- 1 f_scout_ci f_scout_ci    0 Jan 10 09:25 poly_eval.stderr.0
-rw-rw-r-- 1 f_scout_ci f_scout_ci    0 Jan 10 09:25 poly_eval.stdout.0
-rw-rw-r-- 1 f_scout_ci f_scout_ci  184 Jan 10 09:25 poly.out
-rw-rw-r-- 1 f_scout_ci f_scout_ci  128 Jan 10 09:25 STATUS
-rw-rw-r-- 1 f_scout_ci f_scout_ci  501 Jan 10 09:25 status.json
[f_scout_ci@st-linapp1192 iter-0]$ cat JOB_LOG
09:24:48  Calling: /private/f_scout_ci/ert/pytest_tmp_dir/pytest-of-f_scout_ci/pytest-3/popen-gw3/poly_example0/test_data/poly_eval.py
09:25:19  Calling: /private/f_scout_ci/ert/pytest_tmp_dir/pytest-of-f_scout_ci/pytest-3/popen-gw3/poly_example0/test_data/poly_eval.py

@berland
Copy link
Contributor

berland commented Jan 10, 2024

A hypothesis is that the runpath poly_example is being reused:

Proven to be a wrong lead.

@berland
Copy link
Contributor

berland commented Jan 10, 2024

A new lead is that something is wrong with the logger setup.

This command will give a failing test within a few minutes:

$ python -m pytest -n 4 --benchmark-disable --eclipse-simulator --durations=0 -v --dist load -k "integration_cli" tests

while this will pass:

$ python -m pytest -n 4 --benchmark-disable --eclipse-simulator --durations=0 -v --dist load -k "integration_cli" tests --log-cli-level=DEBUG

@berland
Copy link
Contributor

berland commented Jan 10, 2024

Bisecting the komodo nightly build logs, the last good version is from December 18. Then the good/bad state of nightly builds is masked by pydantic issues, until December 29 where we have first failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants