Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Environment / state issues when looking up paths from two different releases #34

Open
andycasey opened this issue Oct 23, 2022 · 7 comments
Assignees

Comments

@andycasey
Copy link
Contributor

andycasey commented Oct 23, 2022

I'm routinely looking up paths for SDSS-5 data and SDSS-4 data.

I use a dictionary to lookup instances of SDSSPath(...) so that I don't have to create a SDSSPath object every time I evaluate a path. It looks something like this:

_sdss_path_instances = {}

def path(release, filetype, **kwargs):
  try:
    p = _sdss_path_instances[release]
  except:
    p = _sdss_path_instances[release] = SDSSPath(release)
  finally:
    return p.full(filetype, **kwargs)

But I am encountering very weird behaviour, which I can only conclude must be because of either a strange state issue, or because sdss_access is changing environment variables behind the scenes.

Here is a minimum reproducible example:

import os
from sdss_access import SDSSPath

# create a DR17 instance
p = SDSSPath("dr17")

kwds = {'obj': '2M07591936+1734091', 'apred': 'dr17', 'field': '204+22', 'apstar': 'stars', 'prefix': 'ap', 'reduction': '2M07591936+1734091', 'telescope': 'apo25m'}

# This part works fine.
original_path = p.full("apStar", **kwds)
print(os.path.exists(original_path), original_path)

# Now let's evaluate a SDSS 5 path, but **we are using a different instance**
fake_path = SDSSPath("sdss5").full("mwmVisit", v_astra="0.2.6", cat_id=1, run2d="v6_0_9", apred="1.0")

# Now let's use the original instance to evaluate a DR17 path.
check_path = p.full("apStar", **kwds)
print(os.path.exists(check_path), check_path)

Here is the output:

True /uufs/chpc.utah.edu/common/home/sdss50/dr17/apogee/spectro/redux/dr17/stars/apo25m/204+22/apStar-dr17-2M07591936+1734091.fits
False /uufs/chpc.utah.edu/common/home/sdss50/sdsswork/mwm/apogee/spectro/redux/dr17/stars/apo25m/204+22/apStar-dr17-2M07591936+1734091.fits

Is this the intended behaviour?

I assumed each instance would evaluate paths for that particular release, and that evaluating one path on a different instance would not impact existing instances. Note that this happens even if the second SDSSPath(...) is created in a totally unrelated scope, which makes me think there is some state sharing within sdss_access, or environment variables being changed.

@joelbrownstein
Copy link
Contributor

joelbrownstein commented Oct 23, 2022

I can reproduce this issue, although I don't have an explanation:

from sdss_access import SDSSPath

path_dr17 = SDSSPath(release = "dr17")
kwd_dr17 = kwds = {'obj': '2M07591936+1734091', 'apred': 'dr17', 'field': '204+22', 'apstar': 'stars', 'prefix': 'ap', 'reduction': '2M07591936+1734091', 'telescope': 'apo25m'}
path_dr17_full0 = path_dr17.full("apStar", **kwd_dr17)
print(path_dr17_full0)

path_astra = SDSSPath(release = "sdss5")
kwd_astra = {'v_astra':"0.2.6", 'cat_id':1, 'run2d':"v6_0_9", 'apred':"1.0"}
path_astra_full = path_astra.full("mwmStar", **kwd_astra)
print(path_astra_full)

path_dr17_full1 = path_dr17.full("apStar", **kwd_dr17)
print(path_dr17_full1)

print("dr17 paths agree: %r" % (path_dr17_full0 == path_dr17_full1))

Output:

/uufs/chpc.utah.edu/common/home/sdss50/dr17/apogee/spectro/redux/dr17/stars/apo25m/204+22/apStar-dr17-2M07591936+1734091.fits
/uufs/chpc.utah.edu/common/home/sdss50/sdsswork/mwm/spectro/astra/0.2.6/v6_0_9-1.0/spectra/star/00/01/mwmStar-0.2.6-1.fits
/uufs/chpc.utah.edu/common/home/sdss50/sdsswork/mwm/apogee/spectro/redux/dr17/stars/apo25m/204+22/apStar-dr17-2M07591936+1734091.fits
dr17 paths agree: False

@joelbrownstein
Copy link
Contributor

@havok2063 can you have a look at this? It looks like one instance is leaking into the other via the value of the tree version, so the first method call to path_dr17.full() correctly uses dr17 for the tree version, whereas the same instance (i.e. path_dr17) flips to using sdsswork in the second call to path_dr17.full().

@havok2063
Copy link
Collaborator

Hmm, I think the tree instance is in the global namespace of the package but it should be re-generated on each instance of Path. So I'll take a look as soon as I can.

@havok2063
Copy link
Collaborator

I can also reproduce the error. I also found a similar error with the Access class. I think it is an issue with the tree instance in the global namespace. I think moving the tree instantiation to inside the Path object will solve it. I'll work on it soon.

from sdss_access import SDSSPath
s=SDSSPath(release='dr17')
s.full('mangacube', drpver='v2_4_3', plate=8485, ifu=1901, wave='LOG')
'/Users/Brian/Work/sdss/sas/dr17/manga/spectro/redux/v2_4_3/8485/stack/manga-8485-1901-LOGCUBE.fits.gz'

s2=SDSSPath(release='sdsswork')
s2.full('mangacube', drpver='v2_4_3', plate=8485, ifu=1901, wave='LOG')
'/Users/Brian/Work/sdss/sas/mangawork/manga/spectro/redux/v2_4_3/8485/stack/manga-8485-1901-LOGCUBE.fits.gz'

s.full('mangacube', drpver='v2_4_3', plate=8485, ifu=1901, wave='LOG')
'/Users/Brian/Work/sdss/sas/mangawork/manga/spectro/redux/v2_4_3/8485/stack/manga-8485-1901-LOGCUBE.fits.gz'

@havok2063
Copy link
Collaborator

havok2063 commented Oct 25, 2022

So this is not as trivial as it seems. The way Tree works is to load the environment variable definitions into the user's Python os.environ, designed to mimic the way it works with the tree modules. This way the environment variables are set once in the global namespace, and can be used throughout other scripts. If you want to change your environment between releases, you need to use the replant_tree method on SDSSPath.

kwds = {'obj': '2M07591936+1734091', 'apred': 'dr17', 'field': '204+22', 'apstar': 'stars', 'prefix': 'ap', 'reduction': '2M07591936+1734091', 'telescope': 'apo25m'}

p = SDSSPath("dr17")
p.full("apStar", **kwds)
'/Users/Brian/Work/sdss/sas/dr17/apogee/spectro/redux/dr17/stars/apo25m/204+22/apStar-dr17-2M07591936+1734091.fits'

p.replant_tree('sdss5')
p.full("apStar", **kwds)
'/Users/Brian/Work/sdss/sas/sdsswork/mwm/apogee/spectro/redux/dr17/stars/apo25m/204+22/apStar-dr17-2M07591936+1734091.fits'

p.replant_tree('dr17')
p.full("apStar", **kwds)
'/Users/Brian/Work/sdss/sas/dr17/apogee/spectro/redux/dr17/stars/apo25m/204+22/apStar-dr17-2M07591936+1734091.fits'

I think to redesign this would require a fundamental change to how sdss-tree works in Python. I recommend we leave the functionality as it is currently.

@andycasey
Copy link
Contributor Author

andycasey commented Oct 26, 2022

Thanks for looking into this. I understand, and it would explain the behaviour.

I'm fine for leaving the functionality as-is for a while, but long term I think this is something that should be changed. If we had each SDSSPath have their own internal Tree, and if each Tree expanded variables using some local dictionary of variables when evaluating paths instead of setting environment variables, then it would mean we'd be able to treat Trees (and SDSSPaths) as independent. My intuition is that a single Tree or SDSSPath instance should give reference to everything within that instance, not alter system-wide environment variables to evaluate a path.

At the moment it fails silently by giving the user an incorrect path. In many cases it might be obvious that the path is incorrect because no file exists, but I can easily imagine situations where a user is loading data from one data release (e.g., IPL-1) and another data release (e.g., IPL-2). In that situation, both files would exist and the wrong file would be silently accessed by the path. For anything use case where data are being accessed from different releases (e.g., DR17 processing in Astra), the current behaviour means either replanting Tree on every path evaluation, or reverting to hand-writing paths to make sure they are correct.

I'd be happy to help make changes to sdss-tree to alter this behaviour because I think I will encounter many of these silent cases. (In fact, I had encountered many of them before but I could never pin down the weirdness for why paths were working sometimes, and missing other times).

What do you think? Do you think we should leave the behaviour as-is indefinitely, or is this something that should be changed (on some unspecified timescale)?

@havok2063
Copy link
Collaborator

I think what you suggest would actually create a tighter dependency between the tree and sdss-access packages, rather than the opposite. SDSSPath would be tied directly to a python instance of a specific Tree configuration, rather than using the os environment altogether. We have to be careful here to not discontinue the original uses of sdss_access which was designed to be used with modules managing the environment, in particular at Utah. Changing this behavior would cause a lot of issues in downstream packages. Using the environment variables also still allow people to specify custom locations in their local .bashrc.

There are use cases we need to retain:

  • when SDSSPath is used with modules, it uses the environment variables set by the module tree version that is set.
  • allow users to use any custom environment variables set for their paths
  • allow users without modules to switch trees and use sdss_access to generate paths from different releases

The current implementation of sdss_access already allows switching between data releases. You don't need to replant the tree on every path evaluation, only once or twice, to generate all the paths you need in one place.

So I would suggest we leave the behavior as it is currently. However, I think we could perform a check when a path is generated to check the path matches the tree/release set and raise an exception when a mismatch occurs. We could also implement a flag on instantiation of SDSSPath that, if set, would do what you want and do a local environment path replacement rather than an os.expandvars. I think we could add that but make it opt in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants