Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH]: Allow for matching on the full path #299

Merged
merged 1 commit into from
Jun 11, 2024

Conversation

TomAugspurger
Copy link
Contributor

This adds a parameter to Storage.walk, allowing users to specif whether the matches applies just to the filename or to the full path, including the filename.

The same keyword is added to the task and task input, to let this be set in the datasset.yaml.

Recent changes to the data published in the ecmwf container requires a new feature in pctasks' create-chunks: the ability to match on a full path rather than just a filename.

Data are being published under both the

- <date-prefix>/ifs/...
- <date-prefix>/aifs/...

We list both sets of data, but our stactools package can't handle the aifs data properly yet. So we want to filter it out.

Previously, matches could only target the filename being listed. We need that to filter to specific products. We want to extend that match expression to include the prefix, so that we can filter out the aifs data.

How Has This Been Tested?

In addition to the unit tests, some manual tests to mimic what will run on the server:

# tst.py

from pctasks.dataset.chunks.task import *

input_ = CreateChunksInput.parse_obj(
    {
        "src_uri": "blob://ai4edataeuwest/ecmwf/20240517/00z/",
        "dst_uri": "assets-new",
        "options": {
            "chunk_length": 30000,
            "since": "2024-01-01T10:00:00+00:00",
            "extensions": [".grib2"],
            "matches": "/ifs/(0p25|0p4-beta)/(enfo|oper|waef|wave)(?!-opendata)",
            "match_full_path": True,
            "list_folders": False,
            "chunk_file_name": "uris-list",
            "chunk_extension": ".csv",
        },
    }
)

CreateChunksTask().run(input_, TaskContext(StorageFactory(), "test"))


input_ = CreateChunksInput.parse_obj(
    {
        "src_uri": "blob://ai4edataeuwest/ecmwf/20240517/00z/",
        "dst_uri": "assets-old",
        "options": {
            "chunk_length": 30000,
            "since": "2024-01-01T10:00:00+00:00",
            "extensions": [".grib2"],
            "matches": "/ifs/(0p25|0p4-beta)/(enfo|oper|waef|wave)(?!-opendata)",
            "list_folders": False,
            "chunk_file_name": "uris-list",
            "chunk_extension": ".csv",
        },
    }
)
CreateChunksTask().run(input_, TaskContext(StorageFactory(), "test"))

The first run (match_full_path=True) outputs files, and we can confirm that there aren't any aifs files:

cat assets-new/all/ai4edataeuwest/ecmwf/20240517/00z/0/uris-list.csv | grep aifs | wc -l

The second one (with match_full_path unset, equal to False the old behavior) it doesn't match anything.

Checklist:

Please delete options that are not relevant.

  • I have performed a self-review
  • Changelog has been updated
  • Documentation has been updated
  • Unit tests pass locally (./scripts/test)
  • Code is linted and styled (./scripts/format)

@TomAugspurger TomAugspurger force-pushed the user/tom/fix/matches-full-path branch from afcf0f8 to d8ce999 Compare June 3, 2024 15:48
This adds a parameter to `Storage.walk`, allowing users to specif
whether the `matches` applies just to the filename or to the full
path, including the filename.

The same keyword is added to the task and task input, to let this be
set in the datasset.yaml.

Recent changes to the data published in the ecmwf container requires a
new feature in pctasks' create-chunks: the ability to match on a full
path rather than just a filename.

Data are being published under both the

    - <date-prefix>/ifs/...
    - <date-prefix>/aifs/...

We list both sets of data, but our stactools package can't handle the
aifs data properly yet. So we want to filter it out.

Previously, `matches` could only target the filename being listed.
We need that to filter to specific products. We want to extend that
match expression to include the prefix, so that we can filter out the
`aifs` data.
@TomAugspurger TomAugspurger force-pushed the user/tom/fix/matches-full-path branch from d8ce999 to 4068eda Compare June 3, 2024 16:28
@TomAugspurger TomAugspurger merged commit 544c735 into main Jun 11, 2024
2 checks passed
@TomAugspurger TomAugspurger deleted the user/tom/fix/matches-full-path branch June 11, 2024 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants