simplify NaN filling in FlashLoader and make it lazy #162

zain-sohail · 2023-10-07T14:11:15Z

After a few attempts with different options, I found that apparently (perhaps recently) multiple files can be loaded directly using read_parquet from dask so if I do that and just perform an ffill on the necessary channels, it works. dask seems to be handling the chunking etc. at least from the tests I did

Update workflow manager with later changes

coveralls · 2023-10-07T14:19:59Z

Pull Request Test Coverage Report for Build 6456380408

18 of 18 (100.0%) changed or added relevant lines in 2 files are covered.
67 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.1%) to 90.456%

Files with Coverage Reduction	New Missed Lines	%
sed/core/processor.py	33	91.35%
sed/loader/flash/loader.py	34	77.73%

Totals
Change from base Build 6421960344:	0.1%
Covered Lines:	4227
Relevant Lines:	4673

💛 - Coveralls

rettigl · 2023-10-07T15:16:57Z

I thinl that always worked, no? As far as I understand the issue, the problem addressed by fill_nan is that some of the slow channels might not have any valid value within a given parquet file. Then, ffill will not be able to fill in any reasonable value. The solution is to get a value from the previous file, where such a value existed.
This mechanism does not work with this suggested solution, as far as I can tell.

One option might be to split the work of fill_nan up: get a valid value for all slow parameters before storing files, and save them as zeroth event value. Thus we make sure, that later this ffill approach will always work.

zain-sohail · 2023-10-07T16:09:49Z

I thinl that always worked, no? As far as I understand the issue, the problem addressed by fill_nan is that some of the slow channels might not have any valid value within a given parquet file. Then, ffill will not be able to fill in any reasonable value. The solution is to get a value from the previous file, where such a value existed. This mechanism does not work with this suggested solution, as far as I can tell.

One option might be to split the work of fill_nan up: get a valid value for all slow parameters before storing files, and save them as zeroth event value. Thus we make sure, that later this ffill approach will always work.

You are right. That was not the complete solution. Actually I wrote the fill nan method long before.
But anyways, I came up with another idea (using map_overlap on dataframe) to do it lazy but just missed commiting it

rettigl · 2023-10-07T17:33:38Z

I thinl that always worked, no? As far as I understand the issue, the problem addressed by fill_nan is that some of the slow channels might not have any valid value within a given parquet file. Then, ffill will not be able to fill in any reasonable value. The solution is to get a value from the previous file, where such a value existed. This mechanism does not work with this suggested solution, as far as I can tell.
One option might be to split the work of fill_nan up: get a valid value for all slow parameters before storing files, and save them as zeroth event value. Thus we make sure, that later this ffill approach will always work.

You are right. That was not the complete solution. Actually I wrote the fill nan method long before. But anyways, I came up with another idea (using map_overlap on dataframe) to do it lazy but just missed commiting it

The idea is good, but not sure your implementation does the right thing yet. Looking at the doc of map_overlap, the before, after refer to single rows, I. E. Entries. Might be it picks a filled one of a slow column, might be not. Anyways, you want to use before, no? Do we have test data that actually allow to test this problem?

zain-sohail · 2023-10-09T06:59:27Z

I thinl that always worked, no? As far as I understand the issue, the problem addressed by fill_nan is that some of the slow channels might not have any valid value within a given parquet file. Then, ffill will not be able to fill in any reasonable value. The solution is to get a value from the previous file, where such a value existed. This mechanism does not work with this suggested solution, as far as I can tell.
One option might be to split the work of fill_nan up: get a valid value for all slow parameters before storing files, and save them as zeroth event value. Thus we make sure, that later this ffill approach will always work.

You are right. That was not the complete solution. Actually I wrote the fill nan method long before. But anyways, I came up with another idea (using map_overlap on dataframe) to do it lazy but just missed commiting it

The idea is good, but not sure your implementation does the right thing yet. Looking at the doc of map_overlap, the before, after refer to single rows, I. E. Entries. Might be it picks a filled one of a slow column, might be not. Anyways, you want to use before, no? Do we have test data that actually allow to test this problem?

You're right about picking or not, depends on whether there is an entry at the end or not. I have been having a deeper look at it.
I think the only way to support an empty or series of empty chunks is to do this operation in a serial manner, unless we have apriori information that there won't be an empty chunk in a certain size of data (asking dima is a good start)

steinnymir · 2023-10-09T12:12:15Z

I changed the parameters of the map_overlap function to load n lines from the previous partition with n = min(partition sizes).
This seems to work correctly now and adds just a few seconds to the loading time (compared to 10 minutes with the previous fill_na)

I also merged this change in my current work in progress branc WorkflowManager2

steinnymir

With the correction I implemented, I think this is good to go.
Havent checked tests though. maybe we could implement some proper tests for this too.

rettigl · 2023-10-09T12:15:30Z

I also merged this change in my current work in progress branc WorkflowManager2

No, you merged your changes into this PR. I don't think this is a good idea...

zain-sohail · 2023-10-09T12:28:03Z

sed/loader/flash/loader.py

+            # calculate the number of rows in each partition
+            with ProgressBar():
+                print("Computing dataframe shape...")
+                nrows = dataframe.map_partitions(len).compute()


These values should be in the parquet metadata called num_rows generally. Statistics, which dask also loads, have metadata like null_count etc
https://github.com/dask/dask/blob/928a95aa56f60da33a4e724ea2ca97797c612968/dask/dataframe/io/parquet/core.py#L569
And hence that's what I am trying to figure out: how to access those null_counts and that would make it very easy to know if we have an nan only column or not.
Your idea is also very good. If we can just directly use the metadata num_rows, there would be no need for compute.

If you find access to these informations that would be great. I always found it weird it needs to compute this... but I guess there are lazy operations which could delete rows, making it a not really fixed value.

for now, all my tests returned 0 nans, and computing the lenght takes a few secs, but if that goes, even better!

zain-sohail · 2023-10-09T12:35:05Z

With the correction I implemented, I think this is good to go. Havent checked tests though. maybe we could implement some proper tests for this too.

Noticed the changes! I think the merged ones can be rolled back but your other change works well!
Please read my comment on the file.

Now that we have test data access, would make sense to write a testing framework around it and use that to test this feature.

zain-sohail · 2023-10-09T20:14:41Z

closing as new PR created #165

steinnymir · 2023-10-10T07:08:05Z

Ops, I did not mean to merge the workflow related changes here. My bad! thanks for taking care of this @zain-sohail

steinnymir and others added 11 commits August 15, 2023 14:46

base structure for wf steps based on dfops

3a5e807

pre-processing on SedProcessor

f8350f6

Workflow manager, alternative to SedProcessor

c232ca8

WIP tutorial

d420b80

Merge branch 'WorkflowManager2' into WorkflowManager3

484f3fd

Merge pull request #158 from OpenCOMPES/WorkflowManager3

4527ec1

Update workflow manager with later changes

add progress bar to flash parquet forward filling

1742074

fix linting

9c63996

avoid progressbar when loading a single file

0fe8475

remove debug print

b99833e

simplify NaN filling in FlashLoader and make it lazy

6464926

zain-sohail mentioned this pull request Oct 7, 2023

FLASH loader issues and feature request #160

Closed

4 tasks

add the map_overlap method to do intrafile filling

74df4e5

steinnymir and others added 4 commits October 9, 2023 09:24

remove unused line

edcab8b

fix helper functions for testing

bdcbacc

corrected map_overlap lenght to minimum part. size

5b69796

Merge branch 'WorkflowManager2' into flash-lazy

24b81c3

steinnymir approved these changes Oct 9, 2023

View reviewed changes

zain-sohail commented Oct 9, 2023

View reviewed changes

zain-sohail mentioned this pull request Oct 9, 2023

NaN filing between partitions with overlap #165

Merged

3 tasks

zain-sohail closed this Oct 9, 2023

zain-sohail deleted the flash-lazy branch November 7, 2023 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simplify NaN filling in FlashLoader and make it lazy #162

simplify NaN filling in FlashLoader and make it lazy #162

zain-sohail commented Oct 7, 2023 •

edited

Loading

coveralls commented Oct 7, 2023 •

edited

Loading

rettigl commented Oct 7, 2023

zain-sohail commented Oct 7, 2023

rettigl commented Oct 7, 2023

zain-sohail commented Oct 9, 2023

steinnymir commented Oct 9, 2023

steinnymir left a comment

rettigl commented Oct 9, 2023

zain-sohail Oct 9, 2023 •

edited

Loading

steinnymir Oct 9, 2023

zain-sohail commented Oct 9, 2023

zain-sohail commented Oct 9, 2023

steinnymir commented Oct 10, 2023

simplify NaN filling in FlashLoader and make it lazy #162

simplify NaN filling in FlashLoader and make it lazy #162

Conversation

zain-sohail commented Oct 7, 2023 • edited Loading

coveralls commented Oct 7, 2023 • edited Loading

Pull Request Test Coverage Report for Build 6456380408

💛 - Coveralls

rettigl commented Oct 7, 2023

zain-sohail commented Oct 7, 2023

rettigl commented Oct 7, 2023

zain-sohail commented Oct 9, 2023

steinnymir commented Oct 9, 2023

steinnymir left a comment

Choose a reason for hiding this comment

rettigl commented Oct 9, 2023

zain-sohail Oct 9, 2023 • edited Loading

Choose a reason for hiding this comment

steinnymir Oct 9, 2023

Choose a reason for hiding this comment

zain-sohail commented Oct 9, 2023

zain-sohail commented Oct 9, 2023

steinnymir commented Oct 10, 2023

zain-sohail commented Oct 7, 2023 •

edited

Loading

coveralls commented Oct 7, 2023 •

edited

Loading

zain-sohail Oct 9, 2023 •

edited

Loading