-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stack overwrites previous data? (plus other issues) #371
Comments
On a similar note re. stacking (subdaily) and real-time processing, if I imagine pulling in one day of data at a time and stacking subdaily, e.g. 12 hr with 1 hr sampling rate.... based on the current implementation, would I expect that the first 11 stacks are constructed using less data? e.g. first stack only one hour, second only two hours etc. i see some old comment/code from msnoise 1.6 (below) suggesting that we should be pulling the updated days - mov_stack, but i don't see this in the current get_results_all function? Appears to me, just from the code, that it is reading only the .h5 files for individual days where flag='T', not any previous (so would not stack properly). Am I wrong?
|
Also, current implementation presumably assumes that new CCF jobs are adjacent in time, but won't necessarily always be the case (often is, but if someone got access to more data for different time periods... these would all go into the same pandas database prior to rolling average). I guess simply resampling to corr_duration and filling with nan prior to applying rolling mean (stacking) will do the trick. |
Re nan+mov: question is also if we re-mask after roll? I would say yes, don't wanna create data. |
Yeah i think thats a good idea. I was similarly thinking, for fixing issue where not pulling in prior data for stacking, that could pull in additional ccfs via get_results_all (by modifying the datetime.datetime list input), and similarly remove the additional 'past' days post-roll. |
In fact, it's okay re. rolling with gaps. I didn't notice a resample line already exists, so its already filling with nan and then dropping nan after. So just the issue of using past data to fix. |
Hehe ok! And reading "enough" data to allow the rolling stats! |
This seems very problematic also:
looks to me like the reference will be built only with recently processed CCFs. Quick check using print statement seems to confirm this (printing '_' to see if contains only newly processed data). |
very bad idea indeed :-) as said, the stack2 stuff was written on the "process all archive", which is a bad idea. Actually, I thought of moving the ref stack out of the "jobs" world, if you run |
Think that's a good idea, running the stack reset between the commands always felt a bit untidy. I can make the change and update the pull request. Actually, i'd forgotten about the existence of the -s job ^^. I guess because easy to extract anyway with rolling output; but would also assume now redundant if tuple mov_stack working as intended. |
It'd be amazing if you could do all this indeed in that PR! Please make sure to adapt the documentation :) (multiple places, including the how-tos :) ) |
Taking a look at stacking code, it seems like the previous stacked data would always be overwritten by new days (e.g. with flag='T'), since overwrite is set to True by default.
I ran a quick test also to confirm this is happening. Is this intended behaviour? i.e., if you were running in real-time... the stacks directory would always just contain stacked CCFs for most recent data.
Just thinking that perhaps the idea is to continue through and process dv/v, update the output files (assuming these are not overwritten but just inserting new result) and not worry about keeping old stacked data (since its all contained within cross-correlation directory). But then, the plotting functions should also use cross-correlation directory, not the stacked data.
The text was updated successfully, but these errors were encountered: