-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing the transformer behaviour on sparse features #122
Conversation
…iner and Dataset.createTransformers to expose this behaviour.
…p more immutable as it should be.
…a provenance serialization issue in IDFTransformation.
Core/src/main/java/org/tribuo/transform/transformations/MeanStdDevTransformation.java
Show resolved
Hide resolved
Co-authored-by: Jack Sullivan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the changes that we discussed on Friday, but looking over it now, the implicit zeroes vs. densify behavior interactions aren't really explicitly characterized anywhere. I think it would be a good idea to have one place that clearly explains how all four possible settings work that can referred to for more detail in all the various constructors/methods where the interaction is important.
Ok, sure. I'll add something to the package-info in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nitpicky word-choice changes, but otherwise it looks good to me.
Co-authored-by: Jack Sullivan <[email protected]>
Co-authored-by: Jack Sullivan <[email protected]>
Co-authored-by: Jack Sullivan <[email protected]>
Co-authored-by: Jack Sullivan <[email protected]>
Thanks for the updates. |
Description
This PR adds implementations of
observeSparse
to all the transformers (apart fromIDFTransformation
which already had it). As a consequence to preserve the 4.0 behaviour the transformation fitting methods grow a new argument which turns on or off the use of theobserveSparse
method. It also switchesDoubleFieldProcessor
over so that it always emits values if they are parsable doubles. Before it would elide zero values, but this makes it much harder to implement transformations which should touch every value because they are numerical (rather than categorical encoded as a double).One downside of this implementation is that it's not possible to change the behaviour of
observeSparse
on a per feature basis (due to it being a tricky breaking API change), which makes it difficult to apply an IDFTransformation to text features while simultaneously transforming features which should ignore the implicit zeros. If this behaviour is required then twoTransformationMap
s can be applied in sequence to aDataset
. We'll note this in the release notes for 4.1.Motivation
The observeSparse change is so the transformers actually have the documented behaviour, and then that rippled into adding new overloads for the transformation fitting methods. The
DoubleFieldProcessor
change is because it's tricky to correctly transform numerical data without it.