Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat (Python) Unit Tests are flaky on Buildkite macOS x86_64 agents especially on 8.13 branch #39613

Closed
dliappis opened this issue May 17, 2024 · 5 comments · Fixed by #39860
Assignees
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@dliappis
Copy link
Contributor

Flaky Test

Stack Trace

tests/system/test_registrar.py

Buildkite failure link: https://buildkite.com/elastic/filebeat/builds/6343#018f825c-39a5-4588-8257-9bac31116e69/1804-1935

=================================== FAILURES ===================================
________________________ Test.test_restart_state_reset _________________________
self = <test_registrar.Test testMethod=test_restart_state_reset>
    def test_restart_state_reset(self):
        """
        Test that ttl is set to -1 after restart and no inputs covering it
        """
        self.render_config_template(
            path=os.path.abspath(self.working_dir) + "/log/*",
            clean_inactive="10s",
            ignore_older="5s"
        )
        os.mkdir(self.working_dir + "/log/")
        testfile_path = self.working_dir + "/log/test.log"
        with open(testfile_path, 'w') as testfile:
            testfile.write("Hello World\n")
        filebeat = self.start_beat()
        # Wait until state written
>       self.wait_until(
            lambda: self.output_has(lines=1),
            max_timeout=30)
tests/system/test_registrar.py:991:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <test_registrar.Test testMethod=test_restart_state_reset>
cond = <function Test.test_restart_state_reset.<locals>.<lambda> at 0x10eb79870>
max_timeout = 30, poll_interval = 0.1, name = 'cond', err_msg = ''
    def wait_until(self, cond, max_timeout=20, poll_interval=0.1, name="cond", err_msg=""):
        """
        TODO: this can probably be a "wait_until_output_count", among other things, since that could actually use `self`, and this can become an internal function
        Waits until the cond function returns true,
        or until the max_timeout is reached. Calls the cond
        function every poll_interval seconds.
        If the max_timeout is reached before cond() returns
        true, an exception is raised.
        """
        start = datetime.now()
        while not cond():
            if datetime.now() - start > timedelta(seconds=max_timeout):
                print("Test has failed, here are the Beat logs")
                for l in self.get_log_lines():
                    print(l)
>               raise WaitTimeoutError(
                    f"Timeout waiting for condition '{name}'. Waited {max_timeout} seconds: {err_msg}")
E               beat.beat.WaitTimeoutError: Timeout waiting for condition 'cond'. Waited 30 seconds:
../libbeat/tests/system/beat/beat.py:449: WaitTimeoutError

tests/system/test_shutdown.py

Buildkite failure link: https://buildkite.com/elastic/filebeat/builds/6343#018f858e-f9e1-46d0-ac2a-806e13454195/1889-1939

______________________________ Test.test_shutdown ______________________________
self = <test_shutdown.Test testMethod=test_shutdown>
    @unittest.skipIf(platform.platform().startswith("Windows-7"),
                     "Flaky test: https://github.com/elastic/beats/issues/22795")
    def test_shutdown(self):
        """
        Test starting and stopping Filebeat under load.
        """
        self.nasa_logs()
        self.render_config_template(
            path=os.path.abspath(self.working_dir) + "/log/*",
            ignore_older="1h"
        )
        for i in range(1, 5):
            proc = self.start_beat(logging_args=["-e", "-v"])
            time.sleep(.5)
>           proc.check_kill_and_wait()
tests/system/test_shutdown.py:31:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../libbeat/tests/system/beat/beat.py:153: in check_kill_and_wait
    return self.check_wait(exit_code=exit_code)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <beat.beat.Proc object at 0x105d7a9e0>, exit_code = 0
    def check_wait(self, exit_code=0):
        """
        check_wait waits for the process to exit, and checks the return code of the process
        """
        actual_exit_code = self.wait()
>       assert actual_exit_code == exit_code, f"Expected exit code to be {exit_code}, but it was {actual_exit_code}"
E       AssertionError: Expected exit code to be 0, but it was -15
../libbeat/tests/system/beat/beat.py:133: AssertionError
- generated xml file: /Users/admin/builds/bk-agent-prod-orka-1715932705515043799/elastic/filebeat/filebeat/build/TEST-python-unit.xml -
============================= slowest 20 durations =============================
27.73s call     filebeat/tests/system/test_crawler.py::Test::test_encodings
15.23s call     filebeat/tests/system/test_harvester.py::Test::test_close_timeout
10.66s call     filebeat/tests/system/test_reload_inputs.py::Test::test_reload_same_input
9.70s call     filebeat/tests/system/test_stdin.py::Test::test_stdin_eof
9.20s call     filebeat/tests/system/test_registrar.py::Test::test_restart_state_reset_ttl_with_space
8.49s call     filebeat/tests/system/test_registrar.py::Test::test_restart_state
8.15s call     filebeat/tests/system/test_registrar.py::Test::test_restart_state_reset_ttl_no_clean_inactive
8.05s call     filebeat/tests/system/test_registrar.py::Test::test_registry_file_default_permissions
7.30s call     filebeat/tests/system/test_registrar.py::Test::test_rotating_file_with_restart
7.28s call     filebeat/tests/system/test_registrar.py::Test::test_rotating_file_inode
7.03s call     filebeat/tests/system/test_registrar.py::Test::test_restart_state_reset_ttl
6.95s call     filebeat/tests/system/test_reload_inputs.py::Test::test_reload_add
6.91s call     filebeat/tests/system/test_reload_inputs.py::Test::test_start_stop
6.87s call     filebeat/tests/system/test_registrar.py::Test::test_state_after_rotation_ignore_older
6.82s call     filebeat/tests/system/test_reload_inputs.py::Test::test_reload_same_config
6.54s call     filebeat/tests/system/test_harvester.py::Test::test_decode_error
6.48s call     filebeat/tests/system/test_crawler.py::Test::test_utf8
6.36s call     filebeat/tests/system/test_input.py::Test::test_rotating_close_inactive_low_write_rate
6.27s call     filebeat/tests/system/test_input.py::Test::test_harvester_limit
6.13s call     filebeat/tests/system/test_input.py::Test::test_input_filter_dropfields
=========================== short test summary info ============================
FAILED tests/system/test_shutdown.py::Test::test_shutdown - AssertionError: Expected exit code to be 0, but it was -15
============ 1 failed, 189 passed, 175 skipped in 608.83s (0:10:08) ============
@dliappis dliappis added the flaky-test Unstable or unreliable test cases. label May 17, 2024
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 17, 2024
@dliappis dliappis added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label May 17, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 17, 2024
@dliappis
Copy link
Contributor Author

Also seen in CI runs for merge commits against main e.g. in https://buildkite.com/elastic/filebeat/builds/6466

@alexsapran
Copy link
Contributor

@pierrehilbert this is something that is affecting your CI metrics.

@VihasMakwana VihasMakwana self-assigned this Jun 10, 2024
@VihasMakwana
Copy link
Contributor

This test is related to this https://elastic.slack.com/archives/C047NDNGUMU/p1717091056290869.
On MacOS, the FQDN lookup takes some time (~5 seconds) before signal handlers are even set up.

We can fix this behaviour in either of the following ways:

  1. Wait for a bit longer (~10 seconds) for MacOS for the filebeat to start up.
  2. Skip this test, as the underlying FQDN is itself flaky and throws warning.

cc: @pierrehilbert

@pierrehilbert
Copy link
Collaborator

Let's go with the option 1 and if this is not solving the issue, let's skip the test only on MacOS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants