-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactored flashloader can't normalize data #459
Comments
This is somehow related to bam correction. In the timed_dataframe, delayStage positions are not the same as in the dataframe after bam correction, if parquets are created with the new flash loader. Currently, no idea why. |
The computation time with the parquet generated with the new loader also is >2 as long. |
I think the slowdown is because of this section being offloaded to dask sed/sed/loader/flash/loader.py Lines 610 to 614 in 86978c0
This was requested feature by Davide (#307) that the timed dataframe should contain all available values from pulses and not just ones where electrons are detected. |
Indeed, but these should also only contain valid pulses, where actual shots exist |
Yes, this comes from the flash data structure of storing 1000 pulses for each train.
This is just because nan drop takes place after loading the buffer files. |
That's not the reason. Even if I disable the sector split, it's still slow. |
|
There should be a permanent fix. Probably beet to look for the first nonzero entry from the back or so. Another config entry I would avoid if possible |
Might be it's slow because of indexes |
If directly loading the parquet files, first thing to notice is that the new loader produces ~6.5 times more entries compared to the old one, and also the number of entries and remaining NaN is much different:
New:
Clearly, the performance with the old files is much superior, also their storage size is a fac.tor 2 or so smaller. This should be changed again. |
Regarding the timed_dataframe, does it really make a difference if this is computed on the pulse rather than the train level? This would make things super much simple and more performant. I think it's worth checking if that makes a practical difference, if you e.g. take the mean of a train and normalize by this. |
Yes this is simply because we are not dropping nans from electron columns before saving the files (how it was done before): sed/sed/loader/flash/loader.py Lines 610 to 614 in 86978c0
If we do that, the dataframe becomes much smaller. This doesn't work with if we want to keep the pulse data even without electrons. One solution could be to save a df and df_timed parquet and then load both per file. Technically it's also possible to save a parquet for each run and just divide the 'per file' into row groups.
I can of course bring back the old behavior (saving after dropping nans) so we can run tests to see if that is the performance bottleneck.
I thought we just had a discussion today about monochromator and using the pulse channel for it instead of train. Perhaps I misunderstood but I can't shed light on whether it's useful or not. @balerion if you are still using sed, could you explain your usecase? |
as expected, from my tests, binning performance stays same if i remove nans before saving (index is monotonic then like in main branch version. You can test from branch fix-459 |
closed by #469 |
No description provided.
The text was updated successfully, but these errors were encountered: