Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround random test_suite_platform fail in time test #7419

Merged

Conversation

yuhaoth
Copy link
Contributor

@yuhaoth yuhaoth commented Apr 10, 2023

Description

We got random test_suite_platform fails in CI tests
The CI reports with this PR

Time: delay seconds ............................................... FAILED
  elapsed_secs >= delay_secs
  at line 87, /var/lib/build/tests/suites/test_suite_platform.function

I think it is due to time(time_t*) returns Epoch time which is discontinuous.

Gatekeeper checklist

  • changelog Not required, this is testing
  • backport .Not required, test not present in LTS
  • tests This is a tests change

Notes for the submitter

Please refer to the contributing guidelines, especially the
checklist for PR contributors.

@yuhaoth yuhaoth marked this pull request as draft April 10, 2023 09:08
@yuhaoth yuhaoth force-pushed the test/random-time-test-fail branch from dfb63fa to c9c3e62 Compare April 11, 2023 06:18
@yuhaoth yuhaoth changed the title WIP: test random test fails Workaround random test_suite_platform fail Apr 11, 2023
@yuhaoth yuhaoth added bug needs-review Every commit must be reviewed by at least two team members, needs-ci Needs to pass CI tests needs-reviewer This PR needs someone to pick it up for review priority-high High priority - will be reviewed soon and removed DO-NOT-MERGE labels Apr 11, 2023
@yuhaoth yuhaoth marked this pull request as ready for review April 11, 2023 07:50
@yuhaoth yuhaoth added component-test Test framework and CI scripts and removed needs-ci Needs to pass CI tests labels Apr 13, 2023
tests/suites/test_suite_platform.function Outdated Show resolved Hide resolved
tests/suites/test_suite_platform.function Outdated Show resolved Hide resolved
tests/suites/test_suite_platform.function Outdated Show resolved Hide resolved
Signed-off-by: Jerry Yu <[email protected]>
@yuhaoth yuhaoth force-pushed the test/random-time-test-fail branch from 398386d to 2f1e85f Compare April 17, 2023 08:53
xkqian
xkqian previously approved these changes Apr 17, 2023
Copy link
Contributor

@xkqian xkqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -84,7 +84,16 @@ void time_delay_seconds(int delay_secs)
sleep_ms(delay_secs * 1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need these tests?

Historically, we had somewhat similar tests in the timing module. They often failed on the CI and so we ended up removing them. I fear that we've reintroduced the problem, and this pull request is just one of many.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Should I remove these tests here? Or another PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delay test were removed.

I think we should keep *get

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely removing the tests may be too much. How about keeping the calls to the functions (so at least we know they don't e.g. crash), but not testing delays?

Possibly even check that t1 = ms_time(); sleep(small); t2 = ms_time(); ASSERT(t2 > t1). @yuhaoth Can you clarify

Built-in mbedtls_time function returns the number of seconds since the
Epoch. That is affected by discontinuous jumps and cause test fail.
Workaround it with 1 seconds tollerance.
Epoch. That is affected by discontinuous jumps. And nanosleep use
CLOCK_MONOTONIC(monotonically-increasing time source), That will cause
negative elapsed time difference.

What can cause a negative elapsed time difference? E.g. Can this happen from automatic drift adjustment? I would have expected that t2 - t1 might be less than small, but not negative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this happen from automatic drift adjustment?

It happens only in time_delay_seconds because time source are different. Built-in mbedtls_time was defined as standard time function, the time source is CLOCK_REALTIME . And nanosleep take CLOCK_MONOTONIC as time source.

If CLOCK_MONOTONIC is faster than CLOCK_REALTIME and CLOCK_REALTIME was adjusted during sleep, sometime t2 - t1 < small happens.

And I think time_delay_milliseconds should not be removed now :) . It use same time source.

Possibly even check that t1 = ms_time(); sleep(small); t2 = ms_time(); ASSERT(t2 > t1).

This check can not resolve the issue.

Copy link
Contributor Author

@yuhaoth yuhaoth Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just update the comments. But I did not mention automatic drift adjustment. I think that's enough.

And I revert last commit without Time: delay seconds test

See Mbed-TLS#1517. They often failed on the CI.

Signed-off-by: Jerry Yu <[email protected]>
@yuhaoth yuhaoth force-pushed the test/random-time-test-fail branch from 1d7ddfb to 4852bb8 Compare April 18, 2023 07:02
usleep(milliseconds * 1000);
#endif
}
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI says this endif needs to stay

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fixed

yuhaoth added 2 commits April 18, 2023 17:01
The test has some issues we can not avoid. Put
it in code to avoid it is re-inroduced again

Signed-off-by: Jerry Yu <[email protected]>
* CLOCK_REALTIME and returns the number of seconds since the Epoch. And
* `nanosleep` uses CLOCK_MONOTONIC. The time sources are out of sync.
*
* If CLOCK_MONOTONIC is faster than CLOCK_REALTIME and `nanosleep` exits at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely it's not really about whether one clock is "faster" than the other (I would expect them to tick at the same rate all other things being equal) but if one is adjusted and the other not - which is what happens when one is a "wallclock" timer and the other is a monotonic timer

Copy link
Contributor Author

@yuhaoth yuhaoth Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's more easily understand. I will change that.

But I think this problem can be abstracted into two different rate clock problems. The wall clock is come from remote and another one come from local. Due to implementation issues,wall clock shows discontinue jumps problem. If wall clock is updated very frequently,discontinue jumps will disappear. And user will get two different rate clocks.

I would expect them to tick at the same rate all other things being equal

It should not be expected, CPU monotonic clock source come from crystal oscillator with PLL. They are not high precision. And due to many reason,it might run faster or slower than standard time. That's why we need NTP service to adjust the time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wall clock is come from remote

I don't understand. "Wall clock" in this context means that this particular timer from the kernel should match as closely as possible to the time that someone looking at their watch would see. So time may be stepped forwards or backwards as daylight savings happens (for example). A monotonic clock must by definition only ever increase.

The ticks that advance these clocks come from local hardware, which may not be precise. So there is frequency adjustment, which should affect all clocks, so that when each clock says "one second has passed" as close to one second has possible has actually passed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Wall clock" in this context means that this particular timer from the kernel should match as closely as possible to the time that someone looking at their watch would see.

I mean it appears to be from a remote server if updated fast enough.

So time may be stepped forwards or backwards as daylight savings happens (for example). A monotonic clock must by definition only ever increase.

I do not think daylight savings will affect the value of time() , for it is the number of seconds since the Epoch .

which should affect all clocks

No. CLOCK_BOOTTIME and CLOCK_*_CPUTIME_ID will not be affected by time adjustment. CLOCK_MONOTONIC is affected by the incremental adjustments performed, that's different with CLOCK_REALTIME.
If decreasing adjustment peformed, CLOCK_REALTIME will change and CLOCK_MONOTONIC will not change.

Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I'm not sure I like time_delay_milliseconds as it is, but I don't want to increase the scope of this pull request so I am not requesting any changes there. The priority is to avoid random failures.

@@ -76,6 +74,13 @@ void time_delay_milliseconds(int delay_ms)
/* END_CASE */

/* BEGIN_CASE depends_on:MBEDTLS_HAVE_TIME */

/*
* WARNING: DONOT ENABLE THIS TEST. RESERVE IT HERE TO KEEP THE REASON.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: English:

Suggested change
* WARNING: DONOT ENABLE THIS TEST. RESERVE IT HERE TO KEEP THE REASON.
* WARNING: DO NOT ENABLE THIS TEST. We keep the code here to document the reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, could this change be made, then I will approve

@tom-cosgrove-arm tom-cosgrove-arm removed the needs-reviewer This PR needs someone to pick it up for review label Apr 19, 2023
Signed-off-by: Jerry Yu <[email protected]>
@gilles-peskine-arm gilles-peskine-arm changed the title Workaround random test_suite_platform fail Workaround random test_suite_platform fail in time test Apr 28, 2023
Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving this because I'd like the random failues to stop. I'm not fully happy with keeping dead code, but we can remove it later.

Copy link
Contributor

@xkqian xkqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tom-cosgrove-arm
Copy link
Contributor

tom-cosgrove-arm commented Apr 28, 2023

@xkqian If you're the second person approving this PR, could you set the approved label and remove needs-review?

@gilles-peskine-arm gilles-peskine-arm removed the needs-review Every commit must be reviewed by at least two team members, label Apr 28, 2023
@gilles-peskine-arm gilles-peskine-arm merged commit 14d6b11 into Mbed-TLS:development Apr 28, 2023
@xkqian
Copy link
Contributor

xkqian commented May 4, 2023

@xkqian If you're the second person approving this PR, could you set the approved label and remove needs-review?

Sorry, I thought maybe also need your approval and forgot to ping you @tom-cosgrove-arm . I will take care next time. Thanks.

@yuhaoth yuhaoth deleted the test/random-time-test-fail branch December 6, 2023 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug component-test Test framework and CI scripts priority-high High priority - will be reviewed soon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants