Skip to content

Commit

Permalink
corrected map_overlap lenght to minimum part. size
Browse files Browse the repository at this point in the history
  • Loading branch information
steinnymir committed Oct 9, 2023
1 parent 74df4e5 commit 5b69796
Showing 1 changed file with 8 additions and 1 deletion.
9 changes: 8 additions & 1 deletion sed/loader/flash/loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from typing import Union

import dask.dataframe as dd
from dask.diagnostics import ProgressBar
import h5py
import numpy as np
from joblib import delayed
Expand Down Expand Up @@ -742,8 +743,14 @@ def forward_fill_partition(df):
df[channels] = df[channels].ffill()
return df

# calculate the number of rows in each partition
with ProgressBar():
print("Computing dataframe shape...")
nrows = dataframe.map_partitions(len).compute()

This comment has been minimized.

Copy link
@zain-sohail

zain-sohail Oct 9, 2023

Member

These values should be in the parquet metadata called num_rows statistics generally.
statistics, which dask also loads, have metadata like null_count etc
https://github.com/dask/dask/blob/928a95aa56f60da33a4e724ea2ca97797c612968/dask/dataframe/io/parquet/core.py#L569
And hence that's what I am trying to figure out: how to access those null_counts and that would make it very easy to know if we have an nan only column or not.
Your idea is also very good. If we can just directly use the metadata num_rows, there would be no need for compute.

max_part_size = min(nrows)

# Use map_overlap to apply forward_fill_partition
dataframe = dataframe.map_overlap(forward_fill_partition, before=0, after=1)
dataframe = dataframe.map_overlap(forward_fill_partition, before=max_part_size+1, after=0)

# Remove the NaNs from per_electron channels
dataframe = dataframe.dropna(
Expand Down

0 comments on commit 5b69796

Please sign in to comment.