-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAS-1376 - Update HarmonyAdapter to perform many-to-one operations. #18
Conversation
@@ -1,5 +1,5 @@ | |||
autopep8 ~= 1.5 | |||
coverage ~= 4.5 | |||
coverage ~= 6.3.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting thing here - the older version of coverage
was causing test failures for the multiprocessing
module. I think this is the related GitHub issue.
@@ -0,0 +1,126 @@ | |||
""" A module containing utility functions to extract information from | |||
`pystac.Item` instances for Harmony, or to produce an output | |||
`pystac.Catalog`. These functions adapt work from the PO.DAAC concise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So these aren't identical - but they really do lean heavily on the great work done by @frankinspace et al. (Same for the download_utilities.py
module. There might be some stuff that could be migrated into harmony-service-lib-py
(like multi-threaded downloads for many-to-one services). But (as David Auty likes to say) those are "future distractions".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% on multi-threaded downloads. That absolutely should be a shared library - eventually. Or perhaps even better - moved into the HarmonyPy package?
None) | ||
|
||
|
||
def get_output_catalog(input_catalog: Catalog, zarr_root: str) -> Catalog: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also now wondering if we need to have similar code in HOSS - particularly for requests that specify a bounding box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in get_output_bounding_box (below) - So much common code, Harmony Adapter, or HarmonyPy ? Agree we may need to consider a HOSS ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the harmony-service-lib-py
package is probably the place to pop some of this stuff. That's the core Python package that offers common functionality to the backend services, including the HarmonyAdapter
. It's very tempting to just suggest copying this module (almost line-for-line) into that package.
be caused by requesting a file with an unknown protocol. | ||
|
||
""" | ||
with self.assertRaises(RuntimeError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test has been a little flaky on my local machine (the harmony.util.download
function is successful for a URL that isn't an obvious local file path or URL). I'm not sure I can see why. I'm hoping it behaves on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem was that, in the first failure, the local file /tmp/<uuid>.nc4
was being created, even in failure. The harmony.util.download
functions sees this file in subsequent test runs.
I've updated this test to put test artefacts in a temporary directory, and then remove that directory in the TestCase.tearDown
method.
I looked at the test failures. It looks like the end-to-end tests. Probably due to swapping out the Update: I also added validation to ensure a STAC catalog was included in the input. The end-to-end tests didn't have one in the arguments. I've tried updating the end-to-end tests with a |
4898523
to
a3385cf
Compare
7950ce9
to
44d62d5
Compare
|
||
@patch('harmony_netcdf_to_zarr.adapter.download_granules') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little bit disappointed to have to mock the download of the input granule as that makes this (and the large file conversion) less of an end-to-end test. However, the mock used to make the multiprocessing
module work with moto
was then causing issues with the multiprocessing.Queue
used to download the input granules.
I spent quite a while looking into this - there was some odd behaviour, indeed! Mocking the download in these two tests was the simplest way I could find to make the tests work in the short term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check
# Mimicking PO.DAAC Concise: for many-to-one the file name is | ||
# "<collection>_merged.zarr". | ||
collection = self._get_item_source(items[0]).collection | ||
output_name = f'{collection}_merged.zarr' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume {collection} is concept_id, not just short_name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to dig around for this in the unit tests for harmony-service-lib-py
to confirm. Looking at the unit tests there, it looks like you are right; the URL will likely contain the collection concept ID, not the short name. (Here's the class method, where it's essentially looking for the parent item in the STAC catalogue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short-Name is preferable for end-user. Is this possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely agree that a short name is more end-user friendly! I'm not sure that the short name is available to us - this is making use of a method on the BaseHarmonyAdapter
class, so outside of the scope of the service itself.
I guess the other thing worth noting is that this is what the PO.DAAC Concise service is doing, so we're being consistent with existing services. I definitely think this is a good question, and think the short name would be preferable if available. This might be a conversation to float up to a higher level, to see if the collection short name can be made available to the services somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addendum (sorry, still mulling this over): I think, mainly, this all stems from what Harmony chooses to put in the input STAC catalogue sent to the backend service.
(An alternative might be to delve into the metadata of one of the granules to find a short name in the global attributes, but then we'd be inconsistent with Concise)
access_token: str, harmony_config: Config, | ||
logger: Logger, process_count: int = None) -> List[str]: | ||
""" A method which scales concurrent downloads to the number of available | ||
CPU cores. For further explaination, see documentation on "multi-track |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explanation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gah. That typo is just embarrassing. I'll fix that up! Thanks, Ken.
Update: I've fixed this typo.
# Temporarily retain old version of NetCDF-to-Zarr service | ||
netcdf_to_zarr(local_file_paths[0], zarr_store) | ||
else: | ||
raise NotImplementedError('Concatenation is not yet supported.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused - isn't this where we initiate our aggregation processing? Above code addresses many-to-one options, why raise an exception here? Oh - I got it, from some conversation - this is still a mergeable solution but not the final solution (yet to come). Classic case for a ToDo comment, but not strictly necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah exactly. My aim here was to ensure that we could deploy this and continue service for a single granule input, but we've not yet actually done the bit to write many-to-one input (that's DAS-1379).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a short-lived TODO comment referring to DAS-1379.
44d62d5
to
1e62fa1
Compare
@@ -0,0 +1,126 @@ | |||
""" A module containing utility functions to extract information from | |||
`pystac.Item` instances for Harmony, or to produce an output | |||
`pystac.Catalog`. These functions adapt work from the PO.DAAC concise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% on multi-threaded downloads. That absolutely should be a shared library - eventually. Or perhaps even better - moved into the HarmonyPy package?
None) | ||
|
||
|
||
def get_output_catalog(input_catalog: Catalog, zarr_root: str) -> Catalog: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in get_output_bounding_box (below) - So much common code, Harmony Adapter, or HarmonyPy ? Agree we may need to consider a HOSS ticket.
objects are gone when the child processes quit. This method mocks | ||
`multiprocessing.popen_fork.Popen._launch` to do the work in the | ||
parent process. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
|
||
@patch('harmony_netcdf_to_zarr.adapter.download_granules') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check
This PR updates the main
NetCDFToZarrAdapter
class with a view to many-to-one operations. This includes adding utilities for downloading multiple granules in parallel threads, and for combining the input STAC items into a single output stack item that describes the aggregated output.A lot of this work is heavily inspired by some great, similar functionality in the PO.DAAC Concise service.
For now, I've tried to ensure a single granule request could still operate, but a multiple granule request will raise an exception, because we've not yet implemented that functionality.