-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite random forest gtests #4038
Conversation
Can you clarify what this means? From the quick look of it, this PR rewrites the gtests to be more like pytest, with parametrized tests and |
Here is an article from the author of hypothesis: In this case I'm defining the input space of parameters and data, then using a sampling algorithm to generate (potentially many) test cases. Each time we run a test, we test the outputs for consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, awesome PR 🙏🏻
will dig into the bugs, thank you for exposing them rory 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake changes LGTM
@RAMitchell |
I have limited the google tests to around 6s to not monopolise CI time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the comments I made, everything looks good to go.
bool operator==(const SparseTreeNode<DataT, LabelT, IdxT>& lhs, | ||
const SparseTreeNode<DataT, LabelT, IdxT>& rhs) | ||
{ | ||
return (lhs.prediction == rhs.prediction) && (lhs.colid == rhs.colid) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect the floating points values such as prediction
etc, match exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've disabled any checks like this for now, but in the future I don't think that's an unreasonable goal for classification.
@@ -289,11 +290,6 @@ class DecisionTree { | |||
(std::numeric_limits<L>::is_integer) ? CRITERION::ENTROPY : CRITERION::MSE; | |||
|
|||
validity_check(tree_params); | |||
if (tree_params.n_bins > n_sampled_rows) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this check removed? Is it moved to some place else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not correct because the quantiles have already been computed at this stage, checked with @venkywonka on this. I don't necessarily see any reason to enforce nbins < nrows.
{ | ||
TestAccuracyImprovement(); | ||
// Bugs | ||
// TestDeterminism(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is classification reproducibility still a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The node queue using atomic needs to be reworked in a later pr I think.
Codecov Report
@@ Coverage Diff @@
## branch-21.08 #4038 +/- ##
===============================================
Coverage ? 85.72%
===============================================
Files ? 230
Lines ? 18191
Branches ? 0
===============================================
Hits ? 15595
Misses ? 2596
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
@gpucibot merge |
Use a property based testing methodology to consolidate most of the tests an extend them to a broader range of inputs. Coverage of input parameters is significantly increased and code size is way down. The majority of tests are now generated in rf_test.cu. Some dead code is also removed. Testing has exposed a few bugs that should be resolved in later PRs. - The max_leaves parameter is not obeyed in some cases - Hard cuda crash for n_bins > 128, presumably due to shared memory requirements - Classification algorithms are sometimes not deterministic (not 100% sure if this is expected or not) Authors: - Rory Mitchell (https://github.com/RAMitchell) Approvers: - Venkat (https://github.com/venkywonka) - Robert Maynard (https://github.com/robertmaynard) - Vinay Deshpande (https://github.com/vinaydes) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4038
Use a property based testing methodology to consolidate most of the tests an extend them to a broader range of inputs. Coverage of input parameters is significantly increased and code size is way down. The majority of tests are now generated in rf_test.cu.
Some dead code is also removed.
Testing has exposed a few bugs that should be resolved in later PRs.