-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline proposal #1332
base: master
Are you sure you want to change the base?
Pipeline proposal #1332
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
It is a bit difficult to review such a massive PR that contains multiple key points. So here I am focusing on the Imputer class.
In general I like it and I think we should have this. But there are various things that need to be improved.
Here are some general comments to start with:
uuid4
is not definedTransformerBase
is note defined- In the methods
ar
typically stands for "array". For vaex, it is probably better and more consistent to useexpr
that stands for "expression" instead. Will help with maintaining this going forward. - What you call "mode" is in fact the median. What you call "common" is in fact the mode. This makes me quite confused reviewing the code further.. so better to address it immediately.
- Do we really need to have separate methods for things like mode, median etc.. since they are standard things in vaex and 1-liners? Maybe for the mode/common would make sense.. if this is reuseable.
- We have a dedicated docstring for the "prefix", at the top of the transformations.py file. Se how the other classes use it.
- All the methods are public, but none have docstrings. I think any public method (especially those that are newly introduces) should have at least a basic docstring, explaing the intent of the method and params/outputs.
I'd say let's address this 1st before continuing this forward. I am also fine with splitting this PR input 2-3 PRs, but that is up to you.
9d04565
to
3826627
Compare
Ok I've taken over the developement here (temporarily?) since I like this and want to push this forward! So I have refactored the
There is one problem as follows: This is currently not possible when doing state_transfer. I wonder if a fix here is possible, it would be quite nice to have this feature (somehow specifically tied to the Also, inspired by @xdssio, I moved the |
* pipeline - to wrap state handling * imputer - fillna automatically * countna - small feature to count missing value on the entire dataset * to_records - used for more standard json input/output
f32bac4
to
ef7695f
Compare
369423b
to
6927b35
Compare
This is an implementation of a new Pipeline which wraps a few standard solutions needs and the vaex state.
General idea:
Any transformation you do on the dataframe as long as you start at the "raw" data you will use in production, is saved so that you can use the same infrastructure to solve all the problems.
fit
,transform
,fit_transform
for the sklearn API.inference
function that output what you would need in a server.An Imputer transformer which used in default on the pipeline but can be used in many ways by providing a strategy.
Added to the Dataframe:
[{key:value,key:value,...},...]
countna
: which counts all missing value in dataframeMissing
predict
to complete the sklearn API.partial_fit
if possible.show
: a way to view all the steps.Examples:
With fit: