Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor + impute #282
refactor + impute #282
Changes from 7 commits
04b7e30
d661e8e
fa912e1
324b6f1
510f2ae
3a012ef
44c5310
52b7e2a
6b181a3
a7f9336
59bc486
17356cb
554862d
d8dbfa6
a56388a
ca73159
e2bb5d3
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all these print statements (here and elsewhere in the PR) intended to be in the final code, or were they used for debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here (in the examples package) the print statements are for the purposes of demonstration -- how do we use the different fields, as as basis of explanation. Of course, they also serve a dual purpose of debugging -- if we change the algorithm and we get a bad value for the "expected/likely value" then that allows one to inspect what exactly happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why the number of updates can be more than the input tuples seen by the overall program? Can the following be a cause?
Say I have point 1-10, and point 15, and I have internal shingling enabled, my shingle size is 6, to get the ball rolling, I need to impute point 11-14. Then I need to update rcf with point 11~14 and the total updates is 4 more than the input tuples seen by the overall program.
Also, can this happen (the number of updates more than the input tuples seen by the overall program) in external shingling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the concerns, related to any model is that "would we update the model with the values which we just imputed from the same model?" There are some scenarios (specially if number of imputations being low) where that makes sense. But if there are more errors/missing entries then it may make sense to control that -- this is done by the useImputedFraction (we only admit points where the ratio of imputed to total is below useImputedFraction).