Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve SteadyTimer tests #1129

Closed
wants to merge 1 commit into from
Closed

Conversation

dirk-thomas
Copy link
Member

Rebased and squashed #1120 in the hope that it further reduces the flakyness of the SteadyTimer test.

@dirk-thomas
Copy link
Member Author

I triggered five builds of the PR job (266 - 270). Four out of the five passed but one still failed the test. While that already makes the test less flaky it would be nice to get it even less flaky...

instead of checking when the callback was actually called,
check for when it was added to the callback queue.

This *should* make the test more reliable.
@dirk-thomas dirk-thomas force-pushed the improve_steady_timer_tests branch from 8b13dbb to a371c9f Compare August 10, 2017 17:44
@dirk-thomas
Copy link
Member Author

I increased the late-threshold from 0.2s to 0.5s. All newly triggered builds (271 - 278) passed (one was unstable but due to a different flaky test). Calling it a day. Thanks @flixr for the original patch.

@dirk-thomas
Copy link
Member Author

And the final 9th build failed again 😞 @flixr Please look into this test and propose a patch which passes reliably. This is rather urgent since Lunar is waiting for ros_comm to be re-released and I only want to do this with a green CI.

@kmactavish
Copy link
Contributor

Is there any point of the late check now? It's relaxed so much it might as well be removed. I vote just get rid of it. The build farm is just not real-time enough.

The whole test is by nature stochastic, which is not ideal. An improved (but extreme) test would be to make sure that SteadyTimer didn't trigger early when the system time was intentionally jumped back.

@dirk-thomas
Copy link
Member Author

With the introduction of SteadyTime the TimerManager has been switch to use that instead of WallTime. I would expect that the test coverage for the new SteadyTimer is at least as good as the existing test coverage for WallTimer. So just removing the test is imo not an option. In fact the tests for these two timers should even be identical (and pass) in order to justify that this change likely doesn't introduce any regressions.

@kmactavish
Copy link
Contributor

kmactavish commented Aug 10, 2017

Right, I meant just the late check, not the whole test.

I just saw that there was the WallTimerHelper, which is essentially the same. Ideally there would be less code duplication between those two test helpers so that they would be nearly idententical.

That begs the question. Why is SteadyTimer failing, and WallTimer not? The WallTimer test has been there for 7 years. Are the builds using docker, or virtual machines? (edit: I see it's docker, there goes that theory)

@flixr
Copy link
Contributor

flixr commented Aug 11, 2017

@dirk-thomas I could never reproduce a test failure on my machine, so this is a bit hard to investigate.
Is there a way to get to the more detailed test failure on jenkins?
Or where can I find the exact docker file and steps to run it in the same configuration as on the buildfarm?

@dirk-thomas
Copy link
Member Author

Is there a way to get to the more detailed test failure on jenkins?

I am not sure what detailed information beside the console output and the junit result file you are referring to. Maybe you can clarify what you are looking for.

Or where can I find the exact docker file and steps to run it in the same configuration as on the buildfarm?

You can run the devel job locally following these instructions: https://github.com/ros-infrastructure/ros_buildfarm/blob/master/doc/jobs/devel_jobs.rst#run-the-devel-job-locally

@flixr
Copy link
Contributor

flixr commented Aug 11, 2017

I am not sure what detailed information beside the console output and the junit result file you are referring to. Maybe you can clarify what you are looking for.

I only found the standard output.
In the test are lines like
ROS_ERROR("Call came at wrong time (%f vs. %f)", expected_next_call_.toSec(), start.toSec());
that would help but I couldn't find this output anywhere...

You can run the devel job locally following these instructions: https://github.com/ros-infrastructure/ros_buildfarm/blob/master/doc/jobs/devel_jobs.rst#run-the-devel-job-locally

Thanks, will check this out...

@mikepurvis
Copy link
Member

I don't have loads to contribute here, except to say that we run our tests on a heavily virtualized environment, and experience a ton of this kind of thing, particularly as the tests rise in complexity—
when your thread sleeps and you're suddenly at the mercy of both your Linux scheduler and an external hypervisor, you really have no idea when you're going to wake up again.

And yet, these kinds of tests still do have value. Basically, I think this reality has to be accepted, and should be fixed at the tooling layer: specifically, rostest should learn to execute N retries on a failing test to truly prove the failure, rather than giving up the first time.

@dirk-thomas
Copy link
Member Author

@flixr I only found the standard output.
In the test are lines like ROS_ERROR("Call came at wrong time (%f vs. %f)", expected_next_call_.toSec(), start.toSec()); that would help but I couldn't find this output anywhere...

I think this is a problem with rostest which doesn't capture stderr correctly in all cases.

@mikepurvis rostest should learn to execute N retries on a failing test to truly prove the failure, rather than giving up the first time.

While that might be valuable in some cases I disagree that this should be applied generally. A test which even fails only sometime might be a real bug and shouldn't be swept under the carpet. Imo the tests should be written in a way that they are resilient to these kind of timing differences.

@kmactavish
Copy link
Contributor

kmactavish commented Aug 11, 2017

@mikepurvis Hypervisor scheduling + clock adjustment was my thought if the slaves were VMs, but docker is pretty bare metal... What is strange is that there has never been a problem with CLOCK_REALTIME (WallTime), only with CLOCK_MONOTONIC (SteadyTime) in nearly identical tests. You would think that WallTime should be just as susceptible if it's coming from slow wakeup from the sleep. It does seem likely that it has to do with virtualization / containerization, since the tests seem to pass fine on PCs.

@flixr If you're going to try to debug locally, you could try testing under high load (to simulate parallel builds) with stress, and tighten the thresholds to be more sensitive.

@mikepurvis
Copy link
Member

@kmactavish Yeah, but the dockerized buildfarm slaves are themselves running on Amazon's infrastructure: http://build.ros.org/computer/

So yeah, tests that are perfectly reliable running on someone's laptop or a robot computer can absolutely fall to pieces when attempted by the buildfarm machines. Having at least a per-test option in CMake for "please retry N times" would be great, and would help avoid what's happened here, where a test's timing constraints are relaxed to the point where it is no longer all that meaningful.

@flixr
Copy link
Contributor

flixr commented Aug 11, 2017

I think this is a problem with rostest which doesn't capture stderr correctly in all cases.

@dirk-thomas Is there an issue for this? Any idea how to fix that?
It is impossible to tell what/where it is going wrong without that info.
If we can get #1130 to fail with this info it should at least give us some more insight.

In general I think it is also impossible to make these tests 100% reliable as @mikepurvis pointed out.
But it is indeed curious that there didn't seem to be a problem with the multipleWallTimeCallbacks so far...

@dirk-thomas
Copy link
Member Author

Is there an issue for this?

No.

Any idea how to fix that?

No.

It is impossible to tell what/where it is going wrong without that info.
If we can get #1130 to fail with this info it should at least give us some more insight.

I have looked at it in the past and it didn't seem like an easy to address problem.

In general I think it is also impossible to make these tests 100% reliable as @mikepurvis pointed out.
But it is indeed curious that there didn't seem to be a problem with the multipleWallTimeCallbacks so far...

The fact that it did work and test before and doesn't now is alarming enough for me to not release the current state. Maybe it is not that important and only a "test issue" but I don't want to take a risk with this.

@kmactavish
Copy link
Contributor

kmactavish commented Aug 11, 2017

Update: I can get the tests to fail for both multipleSteadyTimeCallbacks and multipleWallTimeCallbacks on my i7 laptop:

  • by reducing the threshold to ~0.012s late, under normal load,
  • by reducing the threshold to ~0.07s late (only 30% tighter than now), and running:
    stress --cpu 8 --io 4 --vm 4 --vm-bytes 128M.

In short, these tests have always been flaky, SteadyTimer just doubled the number of them, increasing the chance of failure. It also runs first of the two, so it's the first failure you see.

Proposed fix: Keep the tight check for the timer being called early, change the late check for both helpers to 1s or even 10s, since it's unreliable in high-load, containerized test environments.

P.S. @flixr I added a cout above the ROS_ERROR to see this. It looks like ROS_ERROR is only captured for the first test, and not subsequent ones.

@dirk-thomas
Copy link
Member Author

Replaced by #1132.

@dirk-thomas dirk-thomas deleted the improve_steady_timer_tests branch August 11, 2017 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants