Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[s1-grd]: Fixed relative paths in Blob Storage #233

Merged
merged 1 commit into from
Jul 10, 2023

Conversation

TomAugspurger
Copy link
Contributor

The current pipeline is creating items with invalid HREF. When we go to update the HREFs of the item created by stactools.sentinel1, we incorrectly removed the nested structure within the .SAFE directory.

To fix this, we'll use relative_to rather than basename.

The current pipeline is creating items with invalid HREF. When we go to
update the HREFs of the item created by `stactools.sentinel1`, we
incorrectly removed the nested structure within the `.SAFE` directory.

To fix this, we'll use `relative_to` rather than `basename`.
@TomAugspurger
Copy link
Contributor Author

I'll need to look into what's up with CI. The last one on main passed fine.

@TomAugspurger TomAugspurger merged commit 396e70c into main Jul 10, 2023
@TomAugspurger TomAugspurger deleted the tom/fix/s1-grd-paths branch July 10, 2023 20:01
@TomAugspurger
Copy link
Contributor Author

I've also scheduled a job to fix the existing items in the database. First, we need to list the items with the wrong asset links:

import pystac_client
import azure.storage.blob
import azure.identity
import tlz


def main():
    catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
    search = catalog.search(
        collections=["sentinel-1-grd"],
        datetime="2023-07-01/2023-07-11"
    )

    bad = []

    for item in search.items():
        asset = item.assets.get("vv") or item.assets.get("vh") or item.assets.get("hh") or item.assets.get("hv")
        assert asset is not None

        if asset.href.count("/") == 11:
            # too few!
            path = asset.href.split("/", 4)[-1].rsplit("/", 1)[0]
            bad.append(f"blob://sentinel1euwest/s1-grd/{path}/manifest.safe")

    out = "\n".join(bad)
    with open("bad.txt", "w") as f:
        f.write(out)

    N = len(bad) // 10
    gen = tlz.partition_all(N, bad)
    for i, partition in enumerate(gen):
        cc = azure.storage.blob.BlobClient.from_blob_url(
            f"https://sentinel1euwest.blob.core.windows.net/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/{i}/uris-list.csv",
            credential=azure.identity.DefaultAzureCredential()
        )
        cc.upload_blob("\n".join(partition).encode(), overwrite=True)
        print("Wrote", i)

if __name__ == "__main__":
    main()

Then we need to create a job to reingest those assets

name: Process items for sentinel-1-grd
tokens: {}
args:
- registry
jobs:
  process-chunk:
    foreach:
      items:
        - uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/0/uris-list.csv"
          chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/0/uris-list.csv"
          ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/0/items.ndjson"
        - uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/1/uris-list.csv"
          chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/1/uris-list.csv"
          ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/1/items.ndjson"
        - uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/2/uris-list.csv"
          chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/2/uris-list.csv"
          ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/2/items.ndjson"
        - uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/3/uris-list.csv"
          chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/3/uris-list.csv"
          ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/3/items.ndjson"
        - uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/4/uris-list.csv"
          chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/4/uris-list.csv"
          ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/4/items.ndjson"
        - uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/5/uris-list.csv"
          chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/5/uris-list.csv"
          ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/5/items.ndjson"
        - uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/6/uris-list.csv"
          chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/6/uris-list.csv"
          ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/6/items.ndjson"
        - uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/7/uris-list.csv"
          chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/7/uris-list.csv"
          ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/7/items.ndjson"
        - uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/8/uris-list.csv"
          chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/8/uris-list.csv"
          ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/8/items.ndjson"
        - uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/9/uris-list.csv"
          chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/9/uris-list.csv"
          ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/9/items.ndjson"
        - uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/10/uris-list.csv"
          chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/10/uris-list.csv"
          ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/10/items.ndjson"

    id: process-chunk
    tasks:
    - id: create-items
      image: ${{ args.registry }}/pctasks-sentinel-1-grd:20230629.1
      code:
        src: datasets/sentinel-1-grd/s1grd.py
      task: s1grd:S1GRDCollection.create_items_task
      
      args:
        asset_chunk_info:
          uri: "${{item.uri}}"
          chunk_id: "${{item.chunk_id}}"
        item_chunkset_uri: blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items
        collection_id: sentinel-1-grd
        options:
          skip_validation: false
      environment:
        AZURE_TENANT_ID: ${{ secrets.task-tenant-id }}
        AZURE_CLIENT_ID: ${{ secrets.task-client-id }}
        AZURE_CLIENT_SECRET: ${{ secrets.task-client-secret }}
        APPLICATIONINSIGHTS_CONNECTION_STRING: ${{ secrets.task-application-insights-connection-string
          }}
      schema_version: 1.0.0
    - id: ingest-items
      image_key: ingest
      task: pctasks.ingest_task.task:ingest_task
      args:
        content:
          type: Ndjson
          uris:
          - "${{item.ingest_uri}}"
        options:
          insert_group_size: 5000
          insert_only: false
      environment:
        AZURE_TENANT_ID: ${{ secrets.task-tenant-id }}
        AZURE_CLIENT_ID: ${{ secrets.task-client-id }}
        AZURE_CLIENT_SECRET: ${{ secrets.task-client-secret }}
        APPLICATIONINSIGHTS_CONNECTION_STRING: ${{ secrets.task-application-insights-connection-string
          }}
      schema_version: 1.0.0
  
schema_version: 1.0.0
id: sentinel-1-grd-asset-href-fix
dataset: sentinel-1-grd-asset-href-fix

That's running now.

@TomAugspurger
Copy link
Contributor Author

S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F is an example item that was broken.

Here's a diff of the before and after.

44c44
<       "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/iw-hh.tiff",
---
>       "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/measurement/iw-hh.tiff",
53c53
<       "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/iw-hv.tiff",
---
>       "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/measurement/iw-hv.tiff",
62c62
<       "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/quick-look.png",
---
>       "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/preview/quick-look.png",
80c80
<       "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/noise-iw-hh.xml",
---
>       "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/annotation/calibration/noise-iw-hh.xml",
89c89
<       "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/noise-iw-hv.xml",
---
>       "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/annotation/calibration/noise-iw-hv.xml",
98c98
<       "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/rfi-iw-hh.xml",
---
>       "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/annotation/rfi/rfi-iw-hh.xml",
107c107
<       "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/rfi-iw-hv.xml",
---
>       "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/annotation/rfi/rfi-iw-hv.xml",
116c116
<       "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/calibration-iw-hh.xml",
---
>       "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/annotation/calibration/calibration-iw-hh.xml",
125c125
<       "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/calibration-iw-hv.xml",
---
>       "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/annotation/calibration/calibration-iw-hv.xml",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant