Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simplify NaN filling in FlashLoader and make it lazy #162

Closed
wants to merge 16 commits into from

Conversation

zain-sohail
Copy link
Member

@zain-sohail zain-sohail commented Oct 7, 2023

After a few attempts with different options, I found that apparently (perhaps recently) multiple files can be loaded directly using read_parquet from dask so if I do that and just perform an ffill on the necessary channels, it works. dask seems to be handling the chunking etc. at least from the tests I did

@coveralls
Copy link
Collaborator

coveralls commented Oct 7, 2023

Pull Request Test Coverage Report for Build 6456380408

  • 18 of 18 (100.0%) changed or added relevant lines in 2 files are covered.
  • 67 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.1%) to 90.456%

Files with Coverage Reduction New Missed Lines %
sed/core/processor.py 33 91.35%
sed/loader/flash/loader.py 34 77.73%
Totals Coverage Status
Change from base Build 6421960344: 0.1%
Covered Lines: 4227
Relevant Lines: 4673

💛 - Coveralls

@rettigl
Copy link
Member

rettigl commented Oct 7, 2023

I thinl that always worked, no? As far as I understand the issue, the problem addressed by fill_nan is that some of the slow channels might not have any valid value within a given parquet file. Then, ffill will not be able to fill in any reasonable value. The solution is to get a value from the previous file, where such a value existed.
This mechanism does not work with this suggested solution, as far as I can tell.

One option might be to split the work of fill_nan up: get a valid value for all slow parameters before storing files, and save them as zeroth event value. Thus we make sure, that later this ffill approach will always work.

@zain-sohail
Copy link
Member Author

I thinl that always worked, no? As far as I understand the issue, the problem addressed by fill_nan is that some of the slow channels might not have any valid value within a given parquet file. Then, ffill will not be able to fill in any reasonable value. The solution is to get a value from the previous file, where such a value existed. This mechanism does not work with this suggested solution, as far as I can tell.

One option might be to split the work of fill_nan up: get a valid value for all slow parameters before storing files, and save them as zeroth event value. Thus we make sure, that later this ffill approach will always work.

You are right. That was not the complete solution. Actually I wrote the fill nan method long before.
But anyways, I came up with another idea (using map_overlap on dataframe) to do it lazy but just missed commiting it

@rettigl
Copy link
Member

rettigl commented Oct 7, 2023

I thinl that always worked, no? As far as I understand the issue, the problem addressed by fill_nan is that some of the slow channels might not have any valid value within a given parquet file. Then, ffill will not be able to fill in any reasonable value. The solution is to get a value from the previous file, where such a value existed. This mechanism does not work with this suggested solution, as far as I can tell.
One option might be to split the work of fill_nan up: get a valid value for all slow parameters before storing files, and save them as zeroth event value. Thus we make sure, that later this ffill approach will always work.

You are right. That was not the complete solution. Actually I wrote the fill nan method long before. But anyways, I came up with another idea (using map_overlap on dataframe) to do it lazy but just missed commiting it

The idea is good, but not sure your implementation does the right thing yet. Looking at the doc of map_overlap, the before, after refer to single rows, I. E. Entries. Might be it picks a filled one of a slow column, might be not. Anyways, you want to use before, no? Do we have test data that actually allow to test this problem?

@zain-sohail
Copy link
Member Author

I thinl that always worked, no? As far as I understand the issue, the problem addressed by fill_nan is that some of the slow channels might not have any valid value within a given parquet file. Then, ffill will not be able to fill in any reasonable value. The solution is to get a value from the previous file, where such a value existed. This mechanism does not work with this suggested solution, as far as I can tell.
One option might be to split the work of fill_nan up: get a valid value for all slow parameters before storing files, and save them as zeroth event value. Thus we make sure, that later this ffill approach will always work.

You are right. That was not the complete solution. Actually I wrote the fill nan method long before. But anyways, I came up with another idea (using map_overlap on dataframe) to do it lazy but just missed commiting it

The idea is good, but not sure your implementation does the right thing yet. Looking at the doc of map_overlap, the before, after refer to single rows, I. E. Entries. Might be it picks a filled one of a slow column, might be not. Anyways, you want to use before, no? Do we have test data that actually allow to test this problem?

You're right about picking or not, depends on whether there is an entry at the end or not. I have been having a deeper look at it.
I think the only way to support an empty or series of empty chunks is to do this operation in a serial manner, unless we have apriori information that there won't be an empty chunk in a certain size of data (asking dima is a good start)

@steinnymir
Copy link
Member

I changed the parameters of the map_overlap function to load n lines from the previous partition with n = min(partition sizes).
This seems to work correctly now and adds just a few seconds to the loading time (compared to 10 minutes with the previous fill_na)

I also merged this change in my current work in progress branc WorkflowManager2

Copy link
Member

@steinnymir steinnymir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the correction I implemented, I think this is good to go.
Havent checked tests though. maybe we could implement some proper tests for this too.

@rettigl
Copy link
Member

rettigl commented Oct 9, 2023

I also merged this change in my current work in progress branc WorkflowManager2

No, you merged your changes into this PR. I don't think this is a good idea...

# calculate the number of rows in each partition
with ProgressBar():
print("Computing dataframe shape...")
nrows = dataframe.map_partitions(len).compute()
Copy link
Member Author

@zain-sohail zain-sohail Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values should be in the parquet metadata called num_rows generally. Statistics, which dask also loads, have metadata like null_count etc
https://github.com/dask/dask/blob/928a95aa56f60da33a4e724ea2ca97797c612968/dask/dataframe/io/parquet/core.py#L569
And hence that's what I am trying to figure out: how to access those null_counts and that would make it very easy to know if we have an nan only column or not.
Your idea is also very good. If we can just directly use the metadata num_rows, there would be no need for compute.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you find access to these informations that would be great. I always found it weird it needs to compute this... but I guess there are lazy operations which could delete rows, making it a not really fixed value.

for now, all my tests returned 0 nans, and computing the lenght takes a few secs, but if that goes, even better!

@zain-sohail
Copy link
Member Author

With the correction I implemented, I think this is good to go. Havent checked tests though. maybe we could implement some proper tests for this too.

Noticed the changes! I think the merged ones can be rolled back but your other change works well!
Please read my comment on the file.

Now that we have test data access, would make sense to write a testing framework around it and use that to test this feature.

@zain-sohail
Copy link
Member Author

closing as new PR created #165

@zain-sohail zain-sohail closed this Oct 9, 2023
@steinnymir
Copy link
Member

Ops, I did not mean to merge the workflow related changes here. My bad! thanks for taking care of this @zain-sohail

@zain-sohail zain-sohail deleted the flash-lazy branch November 7, 2023 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants