-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simplify NaN filling in FlashLoader and make it lazy #162
Conversation
Update workflow manager with later changes
Pull Request Test Coverage Report for Build 6456380408
💛 - Coveralls |
I thinl that always worked, no? As far as I understand the issue, the problem addressed by fill_nan is that some of the slow channels might not have any valid value within a given parquet file. Then, ffill will not be able to fill in any reasonable value. The solution is to get a value from the previous file, where such a value existed. One option might be to split the work of fill_nan up: get a valid value for all slow parameters before storing files, and save them as zeroth event value. Thus we make sure, that later this ffill approach will always work. |
You are right. That was not the complete solution. Actually I wrote the fill nan method long before. |
The idea is good, but not sure your implementation does the right thing yet. Looking at the doc of map_overlap, the before, after refer to single rows, I. E. Entries. Might be it picks a filled one of a slow column, might be not. Anyways, you want to use before, no? Do we have test data that actually allow to test this problem? |
You're right about picking or not, depends on whether there is an entry at the end or not. I have been having a deeper look at it. |
I changed the parameters of the map_overlap function to load n lines from the previous partition with I also merged this change in my current work in progress branc WorkflowManager2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the correction I implemented, I think this is good to go.
Havent checked tests though. maybe we could implement some proper tests for this too.
No, you merged your changes into this PR. I don't think this is a good idea... |
# calculate the number of rows in each partition | ||
with ProgressBar(): | ||
print("Computing dataframe shape...") | ||
nrows = dataframe.map_partitions(len).compute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values should be in the parquet metadata called num_rows generally. Statistics, which dask also loads, have metadata like null_count etc
https://github.com/dask/dask/blob/928a95aa56f60da33a4e724ea2ca97797c612968/dask/dataframe/io/parquet/core.py#L569
And hence that's what I am trying to figure out: how to access those null_counts and that would make it very easy to know if we have an nan only column or not.
Your idea is also very good. If we can just directly use the metadata num_rows, there would be no need for compute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you find access to these informations that would be great. I always found it weird it needs to compute this... but I guess there are lazy operations which could delete rows, making it a not really fixed value.
for now, all my tests returned 0 nans, and computing the lenght takes a few secs, but if that goes, even better!
Noticed the changes! I think the merged ones can be rolled back but your other change works well! Now that we have test data access, would make sense to write a testing framework around it and use that to test this feature. |
closing as new PR created #165 |
Ops, I did not mean to merge the workflow related changes here. My bad! thanks for taking care of this @zain-sohail |
After a few attempts with different options, I found that apparently (perhaps recently) multiple files can be loaded directly using read_parquet from dask so if I do that and just perform an ffill on the necessary channels, it works. dask seems to be handling the chunking etc. at least from the tests I did