-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel and epoch based association #277
Conversation
New sources analysis needs updating
@srggrs with the recent updates this can start to be looked at as to what you think. Along with the ability to split the input sky region into groups to associate in parallel as explained above, it can also now allow the user to define epochs, i.e. which images count as one 'epoch' that should be considered as the same overall time. An example of this is without epoch mode the source looks like this because of the duplicate measurements in the epochs: So users have the choice what they want to do. Currently epoch mode is activated is the user inputs dictionaries with the keys being the epoch number (for all file inputs), e.g.:
Essentially internally what happens is that the images are sorted into 'epoch' groups. Importantly the user sees no difference if they enter just a list of images, in this case it converts the list into a dictionary where each image is it's own epoch (sorted in date order). In the association what happens is:
After association the only other change when using epoch mode is the calculation of the 'missing' images used in new sources and forced extraction. Here it will check that each epoch that it was supposed to be seen in is accounted for, if a forced extraction is required it will again always choose the image from the sky region closest to the source. Bits that are not clear on how to do:
|
* Fixed the ordering of epochs in the ideal coverage dataframe. Dask seemed to randomise them. * Added a check such that if a an epoch was already in the source but wasn't the ideal image, the pipeline now won't force extract from the same epoch.
Co-authored-by: Serg <[email protected]>
Also a minor template fix
I see there's a switch to turn this on when creating a pipeline run with the UI, but there doesn't appear to be a way to define the epoch dict that it expects. Is that right? |
Co-authored-by: Andrew O'Brien <[email protected]>
Co-authored-by: Andrew O'Brien <[email protected]>
Yeah the parallel option can be taken advantage of whether it's epoch based or not hence that option in the config. For actually using the epoch based option, triggered when dictionaries are entered, my idea at the moment is to have a good documentation page on this and if users want to set this (purely through the website) they can use the text editor on the job config page. I think a more sophisticated entry method is definitely required but could be done after that fact. Partly because I know that initially our usage numbers will be low to begin with (as in actually constructing and running custom jobs). |
* Validation is now run before images are linked to selavy, noise and bkg. * Background images are able to be added even if Monitor is False. * Check added for whether background images are defined in data linking.
Removed unnecessary indexing on corrections.
@srggrs conflicts resolved (it was just the change log). |
I've developed a little bit of a parallel association method, essentially what it does is:
These sky region groups don't overlap and hence don't need to be associated together. Instead it ships these off to have the association run in parallel and gathers the results at the end.
It combines the results and corrects the source ID numbers and relation information.
I've added this as an option as you should only really see speed improvements if you have a lot of images over a number of sky regions. If you only have a small job it's better to keep it switched off (just the initialise time of the scheduler pushes it to be slower).
I've kept this as a draft for now as it needs to be tested on a large dataset first, I've tested it locally and it seems to run fine. This is about as parallel as you could get it at the moment without adjusting the the kind of ASKAP data we are getting. Ideally you could envision having one FITS file and one catalogue per epoch but of course these will be a nightmare (and a silly idea) to deal with. So we have what we have at the moment.Fixes #259.