-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[s1-grd]: Fixed relative paths in Blob Storage #233
Conversation
The current pipeline is creating items with invalid HREF. When we go to update the HREFs of the item created by `stactools.sentinel1`, we incorrectly removed the nested structure within the `.SAFE` directory. To fix this, we'll use `relative_to` rather than `basename`.
I'll need to look into what's up with CI. The last one on |
I've also scheduled a job to fix the existing items in the database. First, we need to list the items with the wrong asset links: import pystac_client
import azure.storage.blob
import azure.identity
import tlz
def main():
catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
search = catalog.search(
collections=["sentinel-1-grd"],
datetime="2023-07-01/2023-07-11"
)
bad = []
for item in search.items():
asset = item.assets.get("vv") or item.assets.get("vh") or item.assets.get("hh") or item.assets.get("hv")
assert asset is not None
if asset.href.count("/") == 11:
# too few!
path = asset.href.split("/", 4)[-1].rsplit("/", 1)[0]
bad.append(f"blob://sentinel1euwest/s1-grd/{path}/manifest.safe")
out = "\n".join(bad)
with open("bad.txt", "w") as f:
f.write(out)
N = len(bad) // 10
gen = tlz.partition_all(N, bad)
for i, partition in enumerate(gen):
cc = azure.storage.blob.BlobClient.from_blob_url(
f"https://sentinel1euwest.blob.core.windows.net/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/{i}/uris-list.csv",
credential=azure.identity.DefaultAzureCredential()
)
cc.upload_blob("\n".join(partition).encode(), overwrite=True)
print("Wrote", i)
if __name__ == "__main__":
main() Then we need to create a job to reingest those assets name: Process items for sentinel-1-grd
tokens: {}
args:
- registry
jobs:
process-chunk:
foreach:
items:
- uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/0/uris-list.csv"
chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/0/uris-list.csv"
ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/0/items.ndjson"
- uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/1/uris-list.csv"
chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/1/uris-list.csv"
ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/1/items.ndjson"
- uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/2/uris-list.csv"
chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/2/uris-list.csv"
ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/2/items.ndjson"
- uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/3/uris-list.csv"
chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/3/uris-list.csv"
ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/3/items.ndjson"
- uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/4/uris-list.csv"
chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/4/uris-list.csv"
ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/4/items.ndjson"
- uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/5/uris-list.csv"
chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/5/uris-list.csv"
ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/5/items.ndjson"
- uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/6/uris-list.csv"
chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/6/uris-list.csv"
ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/6/items.ndjson"
- uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/7/uris-list.csv"
chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/7/uris-list.csv"
ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/7/items.ndjson"
- uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/8/uris-list.csv"
chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/8/uris-list.csv"
ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/8/items.ndjson"
- uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/9/uris-list.csv"
chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/9/uris-list.csv"
ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/9/items.ndjson"
- uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/assets/all/sentinel1euwest/s1-grd/GRD/2023/7/10/10/uris-list.csv"
chunk_id: "sentinel1euwest/s1-grd/GRD/2023/7/10/10/uris-list.csv"
ingest_uri: "blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items/all/sentinel1euwest/s1-grd/GRD/2023/7/10/10/items.ndjson"
id: process-chunk
tasks:
- id: create-items
image: ${{ args.registry }}/pctasks-sentinel-1-grd:20230629.1
code:
src: datasets/sentinel-1-grd/s1grd.py
task: s1grd:S1GRDCollection.create_items_task
args:
asset_chunk_info:
uri: "${{item.uri}}"
chunk_id: "${{item.chunk_id}}"
item_chunkset_uri: blob://sentinel1euwest/s1-grd-etl-data/pctasks-chunks/sentinel-1-asset-href-fix/items
collection_id: sentinel-1-grd
options:
skip_validation: false
environment:
AZURE_TENANT_ID: ${{ secrets.task-tenant-id }}
AZURE_CLIENT_ID: ${{ secrets.task-client-id }}
AZURE_CLIENT_SECRET: ${{ secrets.task-client-secret }}
APPLICATIONINSIGHTS_CONNECTION_STRING: ${{ secrets.task-application-insights-connection-string
}}
schema_version: 1.0.0
- id: ingest-items
image_key: ingest
task: pctasks.ingest_task.task:ingest_task
args:
content:
type: Ndjson
uris:
- "${{item.ingest_uri}}"
options:
insert_group_size: 5000
insert_only: false
environment:
AZURE_TENANT_ID: ${{ secrets.task-tenant-id }}
AZURE_CLIENT_ID: ${{ secrets.task-client-id }}
AZURE_CLIENT_SECRET: ${{ secrets.task-client-secret }}
APPLICATIONINSIGHTS_CONNECTION_STRING: ${{ secrets.task-application-insights-connection-string
}}
schema_version: 1.0.0
schema_version: 1.0.0
id: sentinel-1-grd-asset-href-fix
dataset: sentinel-1-grd-asset-href-fix
That's running now. |
Here's a diff of the before and after. 44c44
< "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/iw-hh.tiff",
---
> "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/measurement/iw-hh.tiff",
53c53
< "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/iw-hv.tiff",
---
> "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/measurement/iw-hv.tiff",
62c62
< "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/quick-look.png",
---
> "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/preview/quick-look.png",
80c80
< "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/noise-iw-hh.xml",
---
> "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/annotation/calibration/noise-iw-hh.xml",
89c89
< "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/noise-iw-hv.xml",
---
> "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/annotation/calibration/noise-iw-hv.xml",
98c98
< "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/rfi-iw-hh.xml",
---
> "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/annotation/rfi/rfi-iw-hh.xml",
107c107
< "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/rfi-iw-hv.xml",
---
> "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/annotation/rfi/rfi-iw-hv.xml",
116c116
< "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/calibration-iw-hh.xml",
---
> "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/annotation/calibration/calibration-iw-hh.xml",
125c125
< "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/calibration-iw-hv.xml",
---
> "href": "https://sentinel1euwest.blob.core.windows.net/s1-grd/GRD/2023/7/10/IW/DH/S1A_IW_GRDH_1SDH_20230710T105755_20230710T105818_049360_05EF7F_07E2/annotation/calibration/calibration-iw-hv.xml", |
The current pipeline is creating items with invalid HREF. When we go to update the HREFs of the item created by
stactools.sentinel1
, we incorrectly removed the nested structure within the.SAFE
directory.To fix this, we'll use
relative_to
rather thanbasename
.