Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack overwrites previous data? (plus other issues) #371

Open
asyates opened this issue Sep 26, 2024 · 10 comments
Open

Stack overwrites previous data? (plus other issues) #371

asyates opened this issue Sep 26, 2024 · 10 comments

Comments

@asyates
Copy link
Contributor

asyates commented Sep 26, 2024

Taking a look at stacking code, it seems like the previous stacked data would always be overwritten by new days (e.g. with flag='T'), since overwrite is set to True by default.

xr_save_ccf(sta1, sta2, components, filterid, mov_stack, taxis, xx, overwrite=True)

I ran a quick test also to confirm this is happening. Is this intended behaviour? i.e., if you were running in real-time... the stacks directory would always just contain stacked CCFs for most recent data.

Just thinking that perhaps the idea is to continue through and process dv/v, update the output files (assuming these are not overwritten but just inserting new result) and not worry about keeping old stacked data (since its all contained within cross-correlation directory). But then, the plotting functions should also use cross-correlation directory, not the stacked data.

@asyates
Copy link
Contributor Author

asyates commented Sep 26, 2024

On a similar note re. stacking (subdaily) and real-time processing, if I imagine pulling in one day of data at a time and stacking subdaily, e.g. 12 hr with 1 hr sampling rate.... based on the current implementation, would I expect that the first 11 stacks are constructed using less data? e.g. first stack only one hour, second only two hours etc.

i see some old comment/code from msnoise 1.6 (below) suggesting that we should be pulling the updated days - mov_stack, but i don't see this in the current get_results_all function? Appears to me, just from the code, that it is reading only the .h5 files for individual days where flag='T', not any previous (so would not stack properly). Am I wrong?

                    # TODO: load only the updated dates +- max(mov_stack)+1
                    # Note: this would no longer be needed if the stack is h5
                    nstack, stack_total = get_results(
                        db, sta1, sta2, filterid, components, datelist, format=format, params=params)

@asyates asyates changed the title Stack overwrites previous data? Stack overwrites previous data? (plus other issues) Sep 26, 2024
@asyates
Copy link
Contributor Author

asyates commented Sep 26, 2024

Also, current implementation presumably assumes that new CCF jobs are adjacent in time, but won't necessarily always be the case (often is, but if someone got access to more data for different time periods... these would all go into the same pandas database prior to rolling average).

I guess simply resampling to corr_duration and filling with nan prior to applying rolling mean (stacking) will do the trick.

@ThomasLecocq
Copy link
Member

Re nan+mov: question is also if we re-mask after roll? I would say yes, don't wanna create data.

@asyates
Copy link
Contributor Author

asyates commented Sep 26, 2024

Yeah i think thats a good idea.

I was similarly thinking, for fixing issue where not pulling in prior data for stacking, that could pull in additional ccfs via get_results_all (by modifying the datetime.datetime list input), and similarly remove the additional 'past' days post-roll.

@asyates
Copy link
Contributor Author

asyates commented Sep 26, 2024

In fact, it's okay re. rolling with gaps. I didn't notice a resample line already exists, so its already filling with nan and then dropping nan after. So just the issue of using past data to fix.

@ThomasLecocq
Copy link
Member

Hehe ok! And reading "enough" data to allow the rolling stats!

@asyates
Copy link
Contributor Author

asyates commented Sep 27, 2024

This seems very problematic also:

                c = get_results_all(db, sta1, sta2, filterid, components, days, format="xarray")
                # print(c)
                # dr = xr_save_ccf(sta1, sta2, components, filterid, 1, taxis, c)
                dr = c
                if stype == "ref":
                    start, end, datelist = build_ref_datelist(db)
                    start = np.array(start, dtype=np.datetime64)
                    end = np.array(end, dtype=np.datetime64)
                    _ = dr.where(dr.times >= start, drop=True)
                    _ = _.where(_.times <= end, drop=True)
                    # TODO add other stack methods here! using apply?
                    _ = _.mean(dim="times")
                    xr_save_ref(sta1, sta2, components, filterid, taxis, _)
                    continue

looks to me like the reference will be built only with recently processed CCFs. Quick check using print statement seems to confirm this (printing '_' to see if contains only newly processed data).

@ThomasLecocq
Copy link
Member

very bad idea indeed :-) as said, the stack2 stuff was written on the "process all archive", which is a bad idea. Actually, I thought of moving the ref stack out of the "jobs" world, if you run stack -r you'll re-compute the REF, that's it, no need to check for jobs or else. That way, the STACK jobs will only be used for the -m or -s jobs. (to check, do -S have any meaning now that you can do the ('30d', '30d') mov_stack ?

@asyates
Copy link
Contributor Author

asyates commented Sep 30, 2024

Think that's a good idea, running the stack reset between the commands always felt a bit untidy. I can make the change and update the pull request.

Actually, i'd forgotten about the existence of the -s job ^^. I guess because easy to extract anyway with rolling output; but would also assume now redundant if tuple mov_stack working as intended.

@ThomasLecocq
Copy link
Member

It'd be amazing if you could do all this indeed in that PR! Please make sure to adapt the documentation :) (multiple places, including the how-tos :) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants