Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel and epoch based association #277

Merged
merged 62 commits into from
Oct 13, 2020
Merged

Conversation

ajstewart
Copy link
Contributor

@ajstewart ajstewart commented Aug 20, 2020

I've developed a little bit of a parallel association method, essentially what it does is:

  • Analyses the sky regions to be processed and groups them into overlapping regions. E.g. the attached plot shows the results of this grouping using the full run sky regions information. Each different colour is a different 'group'.

Screen Shot 2020-08-07 at 23 06 04

  • These sky region groups don't overlap and hence don't need to be associated together. Instead it ships these off to have the association run in parallel and gathers the results at the end.

  • It combines the results and corrects the source ID numbers and relation information.

I've added this as an option as you should only really see speed improvements if you have a lot of images over a number of sky regions. If you only have a small job it's better to keep it switched off (just the initialise time of the scheduler pushes it to be slower).

I've kept this as a draft for now as it needs to be tested on a large dataset first, I've tested it locally and it seems to run fine. This is about as parallel as you could get it at the moment without adjusting the the kind of ASKAP data we are getting. Ideally you could envision having one FITS file and one catalogue per epoch but of course these will be a nightmare (and a silly idea) to deal with. So we have what we have at the moment.

Fixes #259.

@ajstewart ajstewart added the enhancement New feature or request label Aug 20, 2020
@ajstewart
Copy link
Contributor Author

ajstewart commented Aug 26, 2020

@srggrs with the recent updates this can start to be looked at as to what you think.

Along with the ability to split the input sky region into groups to associate in parallel as explained above, it can also now allow the user to define epochs, i.e. which images count as one 'epoch' that should be considered as the same overall time. An example of this is without epoch mode the source looks like this because of the duplicate measurements in the epochs:
Screen Shot 2020-08-26 at 15 13 41
using epoch mode results in the following:
Screen Shot 2020-08-26 at 15 13 52

So users have the choice what they want to do.

Currently epoch mode is activated is the user inputs dictionaries with the keys being the epoch number (for all file inputs), e.g.:

IMAGE_FILES = {
    # insert images file path(s) here
    1: sorted(glob.glob('/import/ada1/askap/PILOT/release/EPOCH01/COMBINED/STOKESI_IMAGES/*.fits')),
    2: sorted(glob.glob('/import/ada1/askap/PILOT/release/EPOCH02/COMBINED/STOKESI_IMAGES/*.fits')),
    3: sorted(glob.glob('/import/ada1/askap/PILOT/release/EPOCH03x/COMBINED/STOKESI_IMAGES/*.fits')),
    4: sorted(glob.glob('/import/ada1/askap/PILOT/release/EPOCH04x/COMBINED/STOKESI_IMAGES/*.fits')),
    5: sorted(glob.glob('/import/ada1/askap/PILOT/release/EPOCH05x/COMBINED/STOKESI_IMAGES/*.fits')),
    6: sorted(glob.glob('/import/ada1/askap/PILOT/release/EPOCH06x/COMBINED/STOKESI_IMAGES/*.fits')),
    7: sorted(glob.glob('/import/ada1/askap/PILOT/release/EPOCH07x/COMBINED/STOKESI_IMAGES/*.fits')),
    8: sorted(glob.glob('/import/ada1/askap/PILOT/release/EPOCH08/COMBINED/STOKESI_IMAGES/*.fits')),
    9: sorted(glob.glob('/import/ada1/askap/PILOT/release/EPOCH09/COMBINED/STOKESI_IMAGES/*.fits')),
    10: sorted(glob.glob('/import/ada1/askap/PILOT/release/EPOCH10x/COMBINED/STOKESI_IMAGES/*.fits')),
    11: sorted(glob.glob('/import/ada1/askap/PILOT/release/EPOCH11x/COMBINED/STOKESI_IMAGES/*.fits')),
}

Essentially internally what happens is that the images are sorted into 'epoch' groups. Importantly the user sees no difference if they enter just a list of images, in this case it converts the list into a dictionary where each image is it's own epoch (sorted in date order).

In the association what happens is:

  • The prep_df for the association now takes into account epochs. It will load all measurements from all the images requested in the epoch.
  • The measurements are then analysed to drop duplicates. A new user config duplicate_radius governs what is considered to be a duplicate source. It will always take the most 'central' source as the actual measurement, i.e. the one closest to the centre of its respective sky region (commonly in radio data this is considered to be the most reliable datapoint').
  • Perform association as normal.
  • If parallel is selected it will do this per sky region group.

After association the only other change when using epoch mode is the calculation of the 'missing' images used in new sources and forced extraction. Here it will check that each epoch that it was supposed to be seen in is accounted for, if a forced extraction is required it will again always choose the image from the sky region closest to the source.

Bits that are not clear on how to do:

  • How to alter the UI such that that users can easily use the website to set up such a run. However I'm not too fussed about this as I consider it an advanced mode that you should use if you know what you are doing, so allowing the user to edit the config in the website may be enough.
  • Related to above there isn't an explicit switch to turn on, it's just activated when the user enters dicts.
  • Pipeline run may need to specify that it was run in epoch mode?

@ajstewart ajstewart requested a review from srggrs August 26, 2020 05:42
* Fixed the ordering of epochs in the ideal coverage dataframe. Dask seemed to randomise them.
* Added a check such that if a an epoch was already in the source but wasn't the ideal image, the pipeline now won't force extract from the same epoch.
@ajstewart ajstewart requested a review from marxide October 2, 2020 15:27
@marxide
Copy link
Contributor

marxide commented Oct 6, 2020

I see there's a switch to turn this on when creating a pipeline run with the UI, but there doesn't appear to be a way to define the epoch dict that it expects. Is that right?

@ajstewart
Copy link
Contributor Author

I see there's a switch to turn this on when creating a pipeline run with the UI, but there doesn't appear to be a way to define the epoch dict that it expects. Is that right?

Yeah the parallel option can be taken advantage of whether it's epoch based or not hence that option in the config.

For actually using the epoch based option, triggered when dictionaries are entered, my idea at the moment is to have a good documentation page on this and if users want to set this (purely through the website) they can use the text editor on the job config page.

I think a more sophisticated entry method is definitely required but could be done after that fact. Partly because I know that initially our usage numbers will be low to begin with (as in actually constructing and running custom jobs).

* Validation is now run before images are linked to selavy, noise and bkg.
* Background images are able to be added even if Monitor is False.
* Check added for whether background images are defined in data linking.
marxide
marxide previously approved these changes Oct 12, 2020
@ajstewart
Copy link
Contributor Author

@srggrs conflicts resolved (it was just the change log).

@ajstewart ajstewart requested a review from srggrs October 13, 2020 04:52
@ajstewart ajstewart merged commit 1abc88f into master Oct 13, 2020
@ajstewart ajstewart deleted the epoch-based-association branch October 13, 2020 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement a user defined 'epoch association' approach
3 participants