-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set stored kcal to healthy kcal for testing #46941
Conversation
Randomizing test orders reveals that one type of problematic interference is due to variable stored calories from prior activity. (cherry picked from commit 306bfb4cbd817d4db60e92807025549050a576ff)
Probably you need to update the numbers in unit tests:
|
No longer applicable:
Looks like it. The debug_weary_info output at the start indicates the prior test had formerly left the calories at over 55000, influencing the test; now, it isn't (at least via calories). That the results differed between g++-7 and the Travis+Appveyor is interesting, and may imply it would be preferable to set those differing by 10 between actual and expected to the mid-point.
|
…anch 'master' of https://github.com/CleverRaven/Cataclysm-DDA into set_test_kcal
Previous weary testing was distored by variable incoming kcal stored, with the current default test order giving more kcal than the expected 55,000.
Interesting. The earlier build had given differing results between the General (g++-7) and Appveyor (and Travis - don't know yet on this one for Travis; seems to be delayed) for the 24-hour weary digging task - different enough that I couldn't set the expected test results to cover both of them: Previous Travis:
Previous Appveyor:
Previous General (g++7):
Now, Appveyor (and General g++-7) match(es) the General (g++-7) one above. I think I'll try using sufficiently close to the General/g++-7 numbers above. |
What is the reason do you think that the results differ? Is it because some unit test elsewhere corrupted internal state, or is it related to compiler / platform? |
At least currently, two different sources (Github General build g++-7 and Appveyor) are giving the same numbers for the 24-hour digging weary test. This updates the target test numbers to match these, with a bit of wiggle room in the direction of the earlier results (partially since Travis hasn't given any results yet).
Appveyor and the others differ in whether tests are run in parallel, so it's difficult to tell there. The Github and Travis runs should not be differing in test order, so there is at least an intermittent compiler/platform issue. |
Does it give the exact same result if run unit tests with the same RNG seed and on the exact same environment, say locally, multiple times? |
Actually, running just the weary tests 3 times locally, with different orderings (not much reordering for 3 tests, admittedly...) and different RNG seeds, gives consistent results - different from any of the above. I'll try doing a run of the full set of tests locally, to see if those match the earlier local results or something else, but it will take a while. EDIT: The one g++-7 is complaining about gives the same test results on 2 runs locally; I had not run it the first couple of times I was checking just the 24-hour tests, which I turn out to have actually done 4 times with the same results (scrolled back far enough to see that!); will try another couple of times running just the weary tests as well as a full test run. That the Github g++-7 one is now giving different results, in contrast to those run locally with just the weary tests, says that it's at least partially something from another test - one that uses the differing RNG seed - corrupting internal state. |
And now the Github g++-7 one is giving differing results on one of the two weary_recovery tasks... and I didn't touch the targets for those in the most recent commit! (See edited comment above for some thoughts on that.) Note that one limitation of running all tests is that there doesn't seem a way to do Oh. It would help to just do the subset of tests that's currently in the same process in build.sh; will try that. |
This one has all non-slow tests: This one has just the weary tests, run with randomized order 4 times (2 before any changing of test targets, 2 after most recent update of test targets): I'm going to try doing just the weary tests 3 times, in the default order (or the ordering in my test list file - need to check which), to see if the RNG seed is what's causing different results in the second 2 of the above. |
Hrm. I'm going to need to put in a clear_avatar() before the initial debug_weary_info. I had meant this to help tell what was happening just before, but it's getting affected by differing heights (and probably ages). This will be temporary, since it is redundant with do_activity having clear_avatar (unless do_activity has a parameter added for whether to clear the avatar; not sure how acceptable this would be). |
Variable initial heights and weights is making having debug_weary_info before do_activity not useful. This adds a clear_avatar before them; unless do_activity is modified to take a flag as to whether to clear the avatar itself, this is likely temporary.
OK, with that done, there are no differences from different seeds in the default order (which is not dictated by the filter order in a file, BTW): No differences between seeds, so long as they're in the same order, is also confirmed by various of the below with randomized order, with one slight exception so far. Let's see if changing the order, but still only running the weary tests, will do things... yes; the problem with my earlier thinking that it wasn't was inadequate sample size, at least for the ones other than the 24 hour tests, looks like:
My current thinking with regard to test targets, BTW, is that the ones with that set of weary tests first is probably the most reliable for what the test should be doing - provided that there isn't other evidence such as weariness level fluctuations indicating otherwise, which is unfortunately true (at least in my local runs) for both the 24 hour digging task and the assorted 8 and 12 hour digging tasks (and, for that matter, with digging 8 hours then waiting/resting 8 hours). The vehicle work one, with combinations of working and resting, is hard to tell, but until I see a run with fewer fluctuations, 995 instead of 980 seems the way to go. Weary1_orderR_log.txt and weary4_orderR_log.txt, while not differing in order, do differ slightly - the weary thresholds fluctuate a bit in the vehicle work one:
OTOH, this is not true for any of the other orderings. |
See comments in CleverRaven#46941 - most specifically issuecomment-770297143.
The Mac build (in the General build matrix) failed when the crafting_skill_gain test process aborted: |
Your guess is correct. I managed to reproduce this error if I run tests in parallel like |
Should be fixed in #47134. |
I am now somewhat debating whether to remove the added |
If it gives more information that may help diagnose a test failure, I don't see reason to strip it out. |
Well, the Travis tests finally completed with no errors; I think this is ready. |
// This sets HP to max, clears addictions and morale, | ||
// and sets hunger, thirst, fatigue and such to zero | ||
dummy.environmental_revert_effect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyone have any idea whether this originally intended to reset calories along with hunger "and such"?
Start on adding tests for unrealistic fluctuations in weary level; see CleverRaven#46384 (and some cases in CleverRaven#46941) for example problems. The initial tests look for problems with the weary_recovery task of digging for 8 hours then waiting for 8 hours; weary level should not go down in the first 8 hours, and should not go up in the second 8 hours.
* Set stored kcal to healthy kcal for testing Randomizing test orders reveals that one type of problematic interference is due to variable stored calories from prior activity. (cherry picked from commit 306bfb4cbd817d4db60e92807025549050a576ff) * Match (some) weary testing to kcal resetting Previous weary testing was distored by variable incoming kcal stored, with the current default test order giving more kcal than the expected 55,000. * Update numbers to match consensus test results At least currently, two different sources (Github General build g++-7 and Appveyor) are giving the same numbers for the 24-hour digging weary test. This updates the target test numbers to match these, with a bit of wiggle room in the direction of the earlier results (partially since Travis hasn't given any results yet). * Clear_avatar before initial debug_weary_info Variable initial heights and weights is making having debug_weary_info before do_activity not useful. This adds a clear_avatar before them; unless do_activity is modified to take a flag as to whether to clear the avatar itself, this is likely temporary. * Change the last target for the vehicle weary test See comments in CleverRaven#46941 - most specifically issuecomment-770297143. Co-authored-by: actual-nh <[email protected]>
Start on adding tests for unrealistic fluctuations in weary level; see CleverRaven#46384 (and some cases in CleverRaven#46941) for example problems. The initial tests look for problems with the weary_recovery task of digging for 8 hours then waiting for 8 hours; weary level should not go down in the first 8 hours, and should not go up in the second 8 hours. (cherry picked from commit bdd942b)
Summary
Infrastructure "Set stored kcal to healthy kcal for testing"
Purpose of change
Randomizing test orders (#46473) reveals that one type of problematic interference is due to variable stored calories from prior activity.
Describe the solution
In
clear_character()
(in tests/player_helpers.cpp), set stored kcal to healthy kcal. (Note: Setting to any less causes errors; setting to any more puts BMI above 25.)Testing
See #46473. (From that, some additional setting of vitamins may also be needed, but this is a start.)
Additional context
See #46934 for another case where it appears likely this will help.Never mind on that one; a more frequent symptom is the "test subject" starving to death in the middle of a test...