Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support HEALpix-indexed AGASC HDF5 files and more #155

Merged
merged 34 commits into from
Oct 3, 2023
Merged

Conversation

taldcroft
Copy link
Member

@taldcroft taldcroft commented Sep 12, 2023

Description

This PR makes a broad set of updates to the agasc package:

  • Support HEALpix-indexed AGASC HDF5 files. This is the default configuration from AGASC version 1.8 and forward.
  • Improve the utility script create_derived_agasc_h5.py to make it more general and make the options independent and atomic.
  • Provide a well-defined file naming and file discovery convention which is exposed via a new public function get_agasc_file.
  • Refactor create_derived_agasc_h5.py and agasc.py into smaller well-documented sub-functions.
  • Add a healpix.py module as the place for most HEALpix related functionality.
  • Add a cache keyword to get_agasc_cone for performance-critical applications requiring repeated cone searches. This will read the AGASC file into memory and use that for subsequent cone searches.

To do:

  • Unit tests
  • Functional testing
    • Performance for get_star() and get_stars() and get_agasc_cone()
    • Memory profiling
    • Integration testing with proseco and sparkles (running unit tests in other packages)
  • Narrative docs

Closes #152

Interface impacts

  • Changed the default AGASC file from miniagasc.h5 to the latest version of proseco_agasc (e.g. proseco_agasc_1p8.h5) in the default AGASC directory.
  • The symlink miniagasc.h5 pointing to the latest miniagasc_1pN.h5 is no longer consider an official part of the AGASC data files.
  • Tests now use new files in $SKA/data/agasc:
    • agasc_healpix_1p7.h5 (HEALpix-ordered version of agasc1p7.h5)
    • proseco_agasc_1p8[rcN].h5

Testing

Unit tests

  • Mac
  • Linux
  • Windows

Independent check of unit tests by Jean

Functional tests

Memory performance (integration with sparkles/proseco)

From within the dev directory in a ska3-dev environment.

Dev (peak use 123 Mb): memray-flamegraph-profile-memory-dev

(ska3-dev) ➜  dev git:(healpix) ✗ memray run -o memray-profile-memory-dev.bin profile_memory.py
Writing profile results into memray-profile-memory-dev.bin
agasc.__version__ = 4.15.1.dev31+g4a5f473
proseco.__version__ = 5.10.1.dev7+g75bed68
sparkles.__version__ = 4.23.1.dev3+g1bc9513
...
Wrote memray-flamegraph-profile-memory-flight.html

Flight (peak use 283 Mb): memray-flamegraph-profile-memory-flight.html

(ska3-dev) ➜  dev git:(healpix) ska3
(ska3) ➜  dev git:(healpix) memray run -o memray-profile-memory-flight.bin profile_memory.py
Writing profile results into memray-profile-memory-flight.bin
agasc.__version__ = 4.15.0
proseco.__version__ = 5.9.0
sparkles.__version__ = 4.22.2
...
Wrote memray-flamegraph-profile-memory-flight.html
JC (for the record prior to further memory optimization)

For brief initial memory profiling, I ran

import agasc
cone = agasc.get_agasc_cone(10, 20,
                            agasc_file='/Users/jean/ska/data/agasc/proseco_agasc_1p7.h5')

and

import agasc
cone = agasc.get_agasc_cone(10, 20,
                            agasc_file='/Users/jean/ska/data/agasc/proseco_agasc_1p8rc4.h5')

with this PR with memray memory profiler. In aggregate, the 1p7 version showed peak resident memory use of 280Mb. The 1p8 healpix version showed peak resident memory use of 154Mb.

Speed performance

From within the agasc git repository:

ska3-kady$ cd dev
Flight agasc package with Dec-ordered AGASC 1.7 file
ska3-kady$ python profile_get_functions.py ~/ska/data/agasc/agasc1p7.h5 --n-cone=100
AGASC module version: 4.15.0
AGASC module file: /proj/sot/ska3/flight/lib/python3.10/site-packages/agasc/__init__.py
AGASC file: /home/aldcroft/ska/data/agasc/agasc1p7.h5
agasc_get_cone: time for 100 queries: 62.55 s
get_star: time for 200 queries: 9.70 s
get_stars: time for 10 queries of 200 stars:  3.26 s
This PR agasc with Dec-ordered AGASC 1.7 file

The expectation is to see similar performance.

ska3-kady$ env PYTHONPATH=.. python profile_get_functions.py ~/ska/data/agasc/agasc1p7.h5 --n-cone=100
AGASC module version: 4.15.1.dev28+g743285d
AGASC module file: /data/baffin/tom/git/agasc/agasc/__init__.py
AGASC file: /home/aldcroft/ska/data/agasc/agasc1p7.h5
agasc_get_cone: time for 100 queries: 60.81 s
get_star: time for 200 queries: 6.48 s
get_stars: time for 10 queries of 200 stars:  4.10 s
This PR agasc with HEALpix-ordered AGASC 1.7 file

This should be faster.

ska3-kady$ env PYTHONPATH=.. python profile_get_functions.py ~/ska/data/agasc/agasc_healpix_1p7.h5 --n-cone=100
AGASC module version: 4.15.1.dev28+g743285d
AGASC module file: /data/baffin/tom/git/agasc/agasc/__init__.py
AGASC file: /home/aldcroft/ska/data/agasc/agasc_healpix_1p7.h5
agasc_get_cone: time for 100 queries: 14.72 s
get_star: time for 200 queries: 7.11 s
get_stars: time for 10 queries of 200 stars:  3.45 s

Caching performance improvement

Using cache=True for 100 cone searches improves the speed by a factor of two. This uses the dev/profile_cache.py script.

kady (slow NFS)
ska3-kady$ python profile_cache.py 
filename=proseco_agasc_1p7.h5 cache=False 18.42
filename=proseco_agasc_1p7.h5 cache=True 7.22
filename=proseco_agasc_1p8rc4.h5 cache=False 16.78
filename=proseco_agasc_1p8rc4.h5 cache=True 7.06

Mac laptop (fast SSD)

filename=proseco_agasc_1p7.h5 cache=False 1.42
filename=proseco_agasc_1p7.h5 cache=True 0.76
filename=proseco_agasc_1p8rc4.h5 cache=False 1.01
filename=proseco_agasc_1p8rc4.h5 cache=True 0.55

@taldcroft taldcroft changed the title WIP: Support HEALpix-indexed AGASC HDF5 files Support HEALpix-indexed AGASC HDF5 files Sep 19, 2023
create_mini_agasc_h5.py Outdated Show resolved Hide resolved
agasc/agasc.py Outdated Show resolved Hide resolved
@taldcroft
Copy link
Member Author

taldcroft commented Sep 22, 2023

@jskrist @jeanconn - I changed the file selection logic and added a bunch of docs and examples.

I'm worried this is too complicated or overdesigned given that it somewhat requires pseudo-code to explain. That's always a bad sign. But I couldn't come up with something simple that handles our key use cases:

  • Select the latest version of any flavor of the AGASC (mini, proseco, full AGASC) using the function kwarg or environment variable.
  • Select a particular version of any flavor with kwarg or env var.
  • Select any arbitrary file.

Note that this version requires that MATLAB set the AGASC_DIR appropriately for that environment (unless $SKA/data/agasc will work).

agasc/agasc.py Outdated Show resolved Hide resolved
@jeanconn
Copy link
Contributor

Originally I had been thinking that the strategy to provide support for promotion of AGASC 1.8 was going to be to change the agasc module default to proseco_agasc_1p8.h5 and then coordinate with the FOT about if AGASC_HDF5_FILE would need to be set or not in the matlab tools (set to proseco_agasc_1p7.h5 explicitly until AGASC 1.8 promotion and then unset or update to proseco_agasc_1p8.h5 at AGASC 1.8 promotion). The new agasc code in this PR to support the "latest" version of whatever type of agasc file is specified is more elegant, but now I'm not sure about the use cases and I'm not 100% sure what makes sense from an implementation perspective on either the SOT or FOT sides. For example, the FOT are currently using a specified default of proseco_agasc_1p7.h5 for the agasc that is also visible and configurable in their setting gui --

Screen Shot 2023-09-22 at 10 56 12 AM

This suggests to me on the FOT side that the easiest thing would still be to move to proseco_agasc_1p8.h5 in an explicit fashion (by adding it to the options here in the gui and setting the AGASC_HDF5_FILE to that value by default) instead of an implicit fashion (by using whatever is the most recent file of type proseco_agasc in the directory). This doesn't preclude the machinery in this PR but I'm not sure if it will make sense to use "latest" from the FOT side. And that's OK either way, I just think we need to define the implementation steps for testing and the expected promotion steps and configuration upon promotion.

@taldcroft
Copy link
Member Author

I agree that changing the default to be the proseco agasc makes sense. This would be proseco_agasc* so that the default will always be the latest. I'm not 100% sure this won't impact some older code or legacy analysis, but all the flight critical functionality will be tested. Then proseco and sparkles can use that default which allows for easily setting the default (e.g. to proseco_agasc_1p7) for regression testing via monkeypatch.

Changing the default would then lead to the question of whether to stop making the miniagasc. Here are the columns that get dropped. I suspect there are no applications that really need those columns which could not use the full AGASC.

    excludes = ['PLX', 'PLX_ERR', 'PLX_CATID',
                'ACQQ1', 'ACQQ2', 'ACQQ3', 'ACQQ4', 'ACQQ5', 'ACQQ6',
                'XREF_ID1', 'XREF_ID2', 'XREF_ID3', 'XREF_ID4', 'XREF_ID5',
                'RSV4', 'RSV5', 'RSV6',
                'POS_CATID', 'PM_CATID',
                'MAG', 'MAG_ERR', 'MAG_BAND', 'MAG_CATID',
                'COLOR1_ERR', 'C1_CATID',  # Keep color1, 2, 3
                'COLOR2_ERR', 'C2_CATID',
                'RSV2',
                'VAR_CATID']

On the MATLAB side, thanks @jeanconn for pointing out this UI which I didn't know about. So one question I have is how the list of available file names is generated and where the default directory is defined. If MATLAB ends up calling Python to get the actual file path (via get_agasc_file) or getting stars via get_agasc_cone then AGASC_DIR would need to be defined and those file names would need to get passed without the .h5 suffix.

And it is a good point that if MATLAB wants to maintain a GUI menu with all available AGASC files then that will require a code change for each release and therefore there is no point in adding a "latest release" option. That said, the current available items only include the latest release, not 1.6, so that implies that MATLAB users don't really need previous releases. In that case changing the menu to include only the different flavors proseco, mini (maybe) and full at the latest version would be the last time that code got touched. So this is for @jskrist .

@jeanconn
Copy link
Contributor

Officially for @jskrist , but tiny answer back that I think the list of files for that is probably the list from the agasc setup script in local which can be modified outside releases. And from that code maybe it will basically work with AGASC_HDF5_FILE set to proseco_agasc.

I don't know if they can call Python at that point to 'resolve' the name, but maybe that's not needed anyway.

@jskrist
Copy link
Member

jskrist commented Sep 23, 2023

@jeanconn is right that MATLAB May not have access to python at some points during startup. We don't currently use python to get the list of agasc files available. I'm not sure about all the constraints on this package and the required features, but from the MATLAB perspective, if we can set the path to be checked via environment variable and specify the selected file in that directory via a separate environment variable, I think we are good. The hiccup I ran into was regarding the special handling of files specifying the .h5 file type. If I just need to strip off the file extension to get everything working on the MATLAB side, then that's fine with me. I'll need to update some code on my side to put the extension back on when needed, but that's not a big deal.

@taldcroft
Copy link
Member Author

The hiccup I ran into was regarding the special handling of files specifying the .h5 file type.

@jskrist - the idea there was that the "special handling" of *.h5 input is really more "normal handling". If you call get_agasc_cone(ra, dec, radius, agasc_file="miniagasc_1p8.h5"), there is a strong expectation that it will look for miniagasc_1p8.h5 in the current directory, not in default_agasc_dir(). So the .h5 means just treat this like any ordinary relative or absolute file path, being consistent with standard file path behavior.

Conversely, an agasc_file without an extension could not be the actual file path, so in that case apply special behavior to prepend the default_agasc_dir() path in front and add the .h5 suffix.

But that behavior doesn't need to apply to the environment variables. So here is a possible new start for the get_agasc_filename function that I think does what you want as long as you don't provide an explicit agasc_file argument and define the AGASC_HDF5_FILE env var. Remember default_agasc_dir() will return AGASC_DIR if that is defined.

    if agasc_file is None:
        if "AGASC_HDF5_FILE" in os.environ:
            return str(default_agasc_dir() / os.environ["AGASC_HDF5_FILE"])
        else:
            agasc_file = "miniagasc*"

@taldcroft taldcroft requested a review from javierggt September 26, 2023 12:40
@taldcroft
Copy link
Member Author

@javierggt @jeanconn - this is now ready for code review. I updated the top description and made a few more changes. There are still some to-do's, but the main code is (in theory) all there in good shape.

@jeanconn
Copy link
Contributor

Guessing these tests in test_agasc_2 use the default agasc and expect it to have MAG_CATID as a field. No biggie.

FAILED agasc/tests/test_agasc_2.py::test_supplement_get_agasc_cone - KeyError: 'MAG_CATID'
FAILED agasc/tests/test_agasc_2.py::test_supplement_get_star - KeyError: 'MAG_CATID'
FAILED agasc/tests/test_agasc_2.py::test_supplement_get_stars - KeyError: 'MAG_CATID'

@taldcroft
Copy link
Member Author

Not sure, but I did note above that tests are not expected to be passing at the moment.

@jeanconn
Copy link
Contributor

That's fine - last comment wasn't specific about which to-dos remain to-dos and I generally start code review by running the tests.

@jeanconn
Copy link
Contributor

We talked in our meeting about handling agasc_file kwarg separately from AGASC_HDF5_FILE, but from the doc update I somewhat thought that defining AGASC_HDF5_FILE with the '*' syntax would resolve with get_agasc_file. For the FOT side that seems not strictly required -- either use the default or specify the file, but maybe agasc_file and AGASC_HDF5_FILE should just do the same thing?

In [2]: agasc.get_agasc_filename()
Out[2]: '/proj/sot/ska3/flight/data/agasc/proseco_agasc_1p7.h5'

In [3]: agasc.get_agasc_filename(agasc_file='proseco_agasc*')
Out[3]: '/proj/sot/ska3/flight/data/agasc/proseco_agasc_1p7.h5'

In [4]: import os

In [5]: os.environ['AGASC_HDF5_FILE'] = 'proseco_agasc*'

In [6]: agasc.get_agasc_filename()
Out[6]: '/proj/sot/ska3/flight/data/agasc/proseco_agasc*'

In [7]: os.environ['AGASC_HDF5_FILE'] = 'miniagasc*'

In [8]: agasc.get_agasc_filename()
Out[8]: '/proj/sot/ska3/flight/data/agasc/miniagasc*'

@jeanconn
Copy link
Contributor

I take back my comment -- the docs are clear that if the user supplies AGASC_HDF5_FILE the agasc will be read from agasc dir / AGASC_HDF5_FILE and there's no advertising support for the "*" syntax with the AGASC_HDF5_FILE variable. And we probably don't have a testing use case. So let's stick with AGASC_HDF5_FILE is a string that ends in '.h5' that references that one file.

@taldcroft
Copy link
Member Author

ca4613b reduces memory from the supplement processing by around 20-30 Mb.

:param pm_filter: Use PM-corrected positions in filtering
:param fix_color1: set COLOR1=COLOR2 * 0.85 for stars with V-I color
:param use_supplement: Use estimated mag from AGASC supplement where available
(default=value of AGASC_SUPPLEMENT_ENABLED env var, or True if not defined)
:param cache: Cache the AGASC data in memory (default=False)
Copy link
Contributor

@jeanconn jeanconn Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, these agasc changes are very well tested and documented, but I'm not sure if there is a cache kwarg test or use case documented. Given the impact that seems fine -- I think the plan is this would only be used by the advanced user to increase speed at the expense of memory. Though I'm not sure about the magnitude of benefit or cost.

Copy link
Member Author

@taldcroft taldcroft Oct 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance gains are now documented in the description with a new profiling script in the dev directory. One use case is to replace get_agasc_cone_fast in kady (of course that requires the matlab_pm_bug so who knows). In retrospect it probably wasn't worth the effort but it is done now, let's be positive! 😄

@taldcroft taldcroft merged commit c84a427 into master Oct 3, 2023
1 check passed
@taldcroft taldcroft deleted the healpix branch October 3, 2023 14:24
@javierggt javierggt mentioned this pull request Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider using healpix (astropy-healpix) to index the AGASC HDF5 files
3 participants