Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set stored kcal to healthy kcal for testing #46941

Merged
merged 7 commits into from
Feb 6, 2021

Conversation

actual-nh
Copy link
Contributor

@actual-nh actual-nh commented Jan 22, 2021

Summary

Infrastructure "Set stored kcal to healthy kcal for testing"

Purpose of change

Randomizing test orders (#46473) reveals that one type of problematic interference is due to variable stored calories from prior activity.

Describe the solution

In clear_character() (in tests/player_helpers.cpp), set stored kcal to healthy kcal. (Note: Setting to any less causes errors; setting to any more puts BMI above 25.)

Testing

See #46473. (From that, some additional setting of vitamins may also be needed, but this is a start.)

Additional context

See #46934 for another case where it appears likely this will help. Never mind on that one; a more frequent symptom is the "test subject" starving to death in the middle of a test...

Randomizing test orders reveals that one type of problematic
interference is due to variable stored calories from prior activity.

(cherry picked from commit 306bfb4cbd817d4db60e92807025549050a576ff)
@BrettDong BrettDong added the Code: Tests Measurement, self-control, statistics, balancing. label Jan 22, 2021
@BrettDong
Copy link
Member

BrettDong commented Jan 28, 2021

Probably you need to update the numbers in unit tests:

../tests/weary_test.cpp:170: FAILED:
1371   CHECK( info.transition_minutes( 0, 1, 130_minutes ) == Approx( 130 ).margin( 5 ) )
1372 with expansion:
1373   120 == Approx( 130.0 )
1374 with messages:
1375   Weariness: 3 Max Full Exert: EXTRA_EXERCISE Mult: 1
1376   BMR: 1741 Intake: 932 Tracker: 7 Thresh: 846 At: 0
1377   Cal: 55393/55000 Fatigue: 289 Morale: 0 Wgt: 76825 (BMI 25.1)
1378   Transition: Weariness lvl from 0 to 1 at 120 min (W 1024 Th 1020)

@actual-nh
Copy link
Contributor Author

actual-nh commented Jan 28, 2021

No longer applicable: I am not sure why the Github build tests are still showing up as "in progress" - the g++-7 one had test failures, and the others were accordingly cancelled.

Probably you need to update the numbers in unit tests:

Looks like it. The debug_weary_info output at the start indicates the prior test had formerly left the calories at over 55000, influencing the test; now, it isn't (at least via calories). That the results differed between g++-7 and the Travis+Appveyor is interesting, and may imply it would be preferable to set those differing by 10 between actual and expected to the mid-point.

../tests/weary_test.cpp:170: FAILED:
1371   CHECK( info.transition_minutes( 0, 1, 130_minutes ) == Approx( 130 ).margin( 5 ) )
1372 with expansion:
1373   120 == Approx( 130.0 )

1377 Cal: 55393/55000 Fatigue: 289 Morale: 0 Wgt: 76825 (BMI 25.1)
1378 Transition: Weariness lvl from 0 to 1 at 120 min (W 1024 Th 1020)

Previous weary testing was distored by variable incoming kcal
stored, with the current default test order giving more kcal
than the expected 55,000.
@actual-nh
Copy link
Contributor Author

Interesting. The earlier build had given differing results between the General (g++-7) and Appveyor (and Travis - don't know yet on this one for Travis; seems to be delayed) for the 24-hour weary digging task - different enough that I couldn't set the expected test results to cover both of them:

Previous Travis:

  Transition: Weariness lvl from 0 to 1 at 120 min (W 1024 Th 1020)
  Transition: Weariness lvl from 1 to 2 at 250 min (W 2020 Th 1000)
  Transition: Weariness lvl from 2 to 3 at 360 min (W 2735 Th 983)
  Transition: Weariness lvl from 3 to 4 at 470 min (W 3228 Th 968)
  Transition: Weariness lvl from 4 to 5 at 590 min (W 3558 Th 951)
  Transition: Weariness lvl from 5 to 6 at 715 min (W 3805 Th 932)
  Transition: Weariness lvl from 6 to 7 at 810 min (W 3945 Th 920)
  Transition: Weariness lvl from 7 to 8 at 885 min (W 4059 Th 910)

Previous Appveyor:

  Transition: Weariness lvl from 0 to 1 at 120 min (W 1027 Th 1020)
  Transition: Weariness lvl from 1 to 2 at 250 min (W 2023 Th 1000)
  Transition: Weariness lvl from 2 to 3 at 360 min (W 2739 Th 983)
  Transition: Weariness lvl from 3 to 4 at 465 min (W 3207 Th 968)
  Transition: Weariness lvl from 4 to 5 at 590 min (W 3549 Th 951)
  Transition: Weariness lvl from 5 to 6 at 715 min (W 3795 Th 933)
  Transition: Weariness lvl from 6 to 7 at 815 min (W 3941 Th 919)
  Transition: Weariness lvl from 7 to 8 at 890 min (W 4056 Th 909)

Previous General (g++7):

  Transition: Weariness lvl from 0 to 1 at 125 min (W 1078 Th 1019)
  Transition: Weariness lvl from 1 to 2 at 250 min (W 2023 Th 1000)
  Transition: Weariness lvl from 2 to 3 at 360 min (W 2737 Th 983)
  Transition: Weariness lvl from 3 to 4 at 470 min (W 3208 Th 968)
  Transition: Weariness lvl from 4 to 5 at 600 min (W 3574 Th 949)
  Transition: Weariness lvl from 5 to 6 at 735 min (W 3765 Th 930)
  Transition: Weariness lvl from 6 to 7 at 840 min (W 3928 Th 916)
  Transition: Weariness lvl from 7 to 8 at 925 min (W 4037 Th 905)

Now, Appveyor (and General g++-7) match(es) the General (g++-7) one above. I think I'll try using sufficiently close to the General/g++-7 numbers above.

@BrettDong
Copy link
Member

What is the reason do you think that the results differ? Is it because some unit test elsewhere corrupted internal state, or is it related to compiler / platform?

At least currently, two different sources (Github General build g++-7
and Appveyor) are giving the same numbers for the 24-hour digging
weary test. This updates the target test numbers to match these,
with a bit of wiggle room in the direction of the earlier results
(partially since Travis hasn't given any results yet).
@actual-nh
Copy link
Contributor Author

actual-nh commented Jan 30, 2021

What is the reason do you think that the results differ? Is it because some unit test elsewhere corrupted internal state, or is it related to compiler / platform?

Appveyor and the others differ in whether tests are run in parallel, so it's difficult to tell there. The Github and Travis runs should not be differing in test order, so there is at least an intermittent compiler/platform issue.

@BrettDong
Copy link
Member

Does it give the exact same result if run unit tests with the same RNG seed and on the exact same environment, say locally, multiple times?

@actual-nh
Copy link
Contributor Author

actual-nh commented Jan 30, 2021

Does it give the exact same result if run unit tests with the same RNG seed and on the exact same environment, say locally, multiple times?

Actually, running just the weary tests 3 times locally, with different orderings (not much reordering for 3 tests, admittedly...) and different RNG seeds, gives consistent results - different from any of the above. I'll try doing a run of the full set of tests locally, to see if those match the earlier local results or something else, but it will take a while.

EDIT: The one g++-7 is complaining about gives the same test results on 2 runs locally; I had not run it the first couple of times I was checking just the 24-hour tests, which I turn out to have actually done 4 times with the same results (scrolled back far enough to see that!); will try another couple of times running just the weary tests as well as a full test run. That the Github g++-7 one is now giving different results, in contrast to those run locally with just the weary tests, says that it's at least partially something from another test - one that uses the differing RNG seed - corrupting internal state.

@actual-nh
Copy link
Contributor Author

actual-nh commented Jan 30, 2021

And now the Github g++-7 one is giving differing results on one of the two weary_recovery tasks... and I didn't touch the targets for those in the most recent commit! (See edited comment above for some thoughts on that.)

Note that one limitation of running all tests is that there doesn't seem a way to do --success for a subset of tests in Catch - so I'm either only going to see where tests fail, or will need to collect lots and lots of output... I suppose it won't be enough for my disk capacity to have problems.

Oh. It would help to just do the subset of tests that's currently in the same process in build.sh; will try that.

@actual-nh
Copy link
Contributor Author

This one has all non-slow tests:
all_test1_log.txt.gz
Extract:
all_test1_weary_log_extract.txt

This one has just the weary tests, run with randomized order 4 times (2 before any changing of test targets, 2 after most recent update of test targets):
all_weary_tests_log.txt

I'm going to try doing just the weary tests 3 times, in the default order (or the ordering in my test list file - need to check which), to see if the RNG seed is what's causing different results in the second 2 of the above.

@actual-nh
Copy link
Contributor Author

Hrm. I'm going to need to put in a clear_avatar() before the initial debug_weary_info. I had meant this to help tell what was happening just before, but it's getting affected by differing heights (and probably ages). This will be temporary, since it is redundant with do_activity having clear_avatar (unless do_activity has a parameter added for whether to clear the avatar; not sure how acceptable this would be).

Variable initial heights and weights is making having debug_weary_info
before do_activity not useful. This adds a clear_avatar before them;
unless do_activity is modified to take a flag as to whether to clear
the avatar itself, this is likely temporary.
@actual-nh
Copy link
Contributor Author

actual-nh commented Jan 30, 2021

OK, with that done, there are no differences from different seeds in the default order (which is not dictated by the filter order in a file, BTW):

No differences between seeds, so long as they're in the same order, is also confirmed by various of the below with randomized order, with one slight exception so far.

Let's see if changing the order, but still only running the weary tests, will do things... yes; the problem with my earlier thinking that it wasn't was inadequate sample size, at least for the ones other than the 24 hour tests, looks like:

  • Different order 1 (assorted/24h/recovery): weary1_orderR_log.txt
  • Default order (assorted/recovery/24h): weary2_orderR_log.txt; duplicates weary2_log.txt
  • Different order 2 (24h/recovery/assorted): weary3_orderR_log.txt
  • Different order 1, new RNG seed: weary4_orderR_log.txt
  • Different order 3 (recovery/assorted/24h), new RNG seed: weary5_orderR_log.txt
  • Different order 2, new RNG seed: weary6_orderR_log.txt; duplicates weary3_orderR_log.txt
  • (Skipped uploading weary7 - duplicate of weary1_orderR_log.txt; it would be very nice if there was some way of specifying the order...)
  • Different order 4 (24h/assorted/recovery), new RNG seed: weary8_orderR_log.txt
  • (Skipped uploading weary9 - duplicate of weary3_orderR_log.txt)
  • (Skipped uploading wearyA - duplicate of weary1_log.txt)
  • (Skipped uploading wearyB - duplicate of weary3_orderR_log.txt)
  • (Skipped uploading wearyC - duplicate of weary5_orderR_log.txt)
  • (Skipped uploading wearyD - duplicate of weary8_orderR_log.txt)
  • Different order 5 (recovery/24h/assorted): wearyE_orderR_log.txt

My current thinking with regard to test targets, BTW, is that the ones with that set of weary tests first is probably the most reliable for what the test should be doing - provided that there isn't other evidence such as weariness level fluctuations indicating otherwise, which is unfortunately true (at least in my local runs) for both the 24 hour digging task and the assorted 8 and 12 hour digging tasks (and, for that matter, with digging 8 hours then waiting/resting 8 hours). The vehicle work one, with combinations of working and resting, is hard to tell, but until I see a run with fewer fluctuations, 995 instead of 980 seems the way to go.

Weary1_orderR_log.txt and weary4_orderR_log.txt, while not differing in order, do differ slightly - the weary thresholds fluctuate a bit in the vehicle work one:

<   Transition: Weariness lvl from 1 to 0 at 980 min (W 938 Th 940)
<   Transition: Weariness lvl from 0 to 1 at 985 min (W 944 Th 941)
<   Transition: Weariness lvl from 1 to 0 at 995 min (W 888 Th 942)
---
>   Transition: Weariness lvl from 1 to 0 at 980 min (W 938 Th 941)
>   Transition: Weariness lvl from 0 to 1 at 985 min (W 944 Th 942)
>   Transition: Weariness lvl from 1 to 0 at 995 min (W 888 Th 943)

OTOH, this is not true for any of the other orderings.

See comments in CleverRaven#46941 - most specifically issuecomment-770297143.
@actual-nh
Copy link
Contributor Author

actual-nh commented Jan 31, 2021

The Mac build (in the General build matrix) failed when the crafting_skill_gain test process aborted:
Assertion failed: (test_world != nullptr), function init_global_game_state, file ../tests/test_main.cpp, line 130.
I somehow doubt this is due to my modifications, particularly since they work fine locally. @BrettDong, if you or someone else on a 10.15+ machine could check out the saved cata_test "artifact"? (BTW, I have to wonder whether there is some unexpected interaction - file access/writing most likely - between the different test processes.)

@BrettDong
Copy link
Member

Your guess is correct. I managed to reproduce this error if I run tests in parallel like build-scripts/build.sh does. I'm looking into this.

@BrettDong
Copy link
Member

Should be fixed in #47134.

@actual-nh
Copy link
Contributor Author

actual-nh commented Jan 31, 2021

I am now somewhat debating whether to remove the added clear_avatar() calls (and the debug_weary_info() calls they make useful); they are otherwise redundant with the same call done in do_activity. But they can give useful information, and there are still problems with weariness according to various of the weary tests (both interaction with other tests and the problematic fluctuation in weary levels during various tests - I may actually add tests to check for those). Any thoughts? @BrettDong? @anothersimulacrum?

@anothersimulacrum
Copy link
Member

If it gives more information that may help diagnose a test failure, I don't see reason to strip it out.

@actual-nh
Copy link
Contributor Author

If it gives more information that may help diagnose a test failure, I don't see reason to strip it out.

Well, the Travis tests finally completed with no errors; I think this is ready.

Comment on lines 72 to 74
// This sets HP to max, clears addictions and morale,
// and sets hunger, thirst, fatigue and such to zero
dummy.environmental_revert_effect();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyone have any idea whether this originally intended to reset calories along with hunger "and such"?

@kevingranade kevingranade merged commit 2f8f32f into CleverRaven:master Feb 6, 2021
@actual-nh actual-nh deleted the set_test_kcal branch February 6, 2021 02:30
actual-nh added a commit to actual-nh/Cataclysm-DDA that referenced this pull request Feb 6, 2021
Start on adding tests for unrealistic fluctuations in weary level;
see CleverRaven#46384 (and some cases in CleverRaven#46941) for example problems. The initial
tests look for problems with the weary_recovery task of digging for 8
hours then waiting for 8 hours; weary level should not go down in the
first 8 hours, and should not go up in the second 8 hours.
feinorgh pushed a commit to feinorgh/Cataclysm-DDA that referenced this pull request Feb 8, 2021
* Set stored kcal to healthy kcal for testing

Randomizing test orders reveals that one type of problematic
interference is due to variable stored calories from prior activity.

(cherry picked from commit 306bfb4cbd817d4db60e92807025549050a576ff)

* Match (some) weary testing to kcal resetting

Previous weary testing was distored by variable incoming kcal
stored, with the current default test order giving more kcal
than the expected 55,000.

* Update numbers to match consensus test results

At least currently, two different sources (Github General build g++-7
and Appveyor) are giving the same numbers for the 24-hour digging
weary test. This updates the target test numbers to match these,
with a bit of wiggle room in the direction of the earlier results
(partially since Travis hasn't given any results yet).

* Clear_avatar before initial debug_weary_info

Variable initial heights and weights is making having debug_weary_info
before do_activity not useful. This adds a clear_avatar before them;
unless do_activity is modified to take a flag as to whether to clear
the avatar itself, this is likely temporary.

* Change the last target for the vehicle weary test

See comments in CleverRaven#46941 - most specifically issuecomment-770297143.

Co-authored-by: actual-nh <[email protected]>
actual-nh added a commit to actual-nh/Cataclysm-DDA that referenced this pull request Feb 21, 2021
Start on adding tests for unrealistic fluctuations in weary level;
see CleverRaven#46384 (and some cases in CleverRaven#46941) for example problems. The initial
tests look for problems with the weary_recovery task of digging for 8
hours then waiting for 8 hours; weary level should not go down in the
first 8 hours, and should not go up in the second 8 hours.

(cherry picked from commit bdd942b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Code: Tests Measurement, self-control, statistics, balancing.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants