Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display the infrastructure layer #27

Closed
robyngit opened this issue Sep 9, 2022 · 22 comments
Closed

Display the infrastructure layer #27

robyngit opened this issue Sep 9, 2022 · 22 comments
Labels
data available The complete dataset is on the datateam server and ready to process Infrastructure data layer category: infrastructure layer Displaying a specific data product in the PDG portal pdg Permafrost Discovery Gateway priority: high

Comments

@robyngit
Copy link
Member

robyngit commented Sep 9, 2022

The data is associated with the following two papers:

The data are archived with Restricted Access on Zenodo

@robyngit robyngit added pdg Permafrost Discovery Gateway layer Displaying a specific data product in the PDG portal labels Sep 9, 2022
@robyngit
Copy link
Member Author

I did a test run of this layer. Here are some notes on processing it:

  • The readme included in the dataset gives all the relevant info
  • The single file that we want to process is the SACHI.shp file
  • This file takes ~20 sec to open with GeoPandas. It needs to be tiled to display in Cesium.
  • The workflow that we developed for IWP works well to create web-tiles + 3d tiles. No parallelization is necessary. Since it's just one file, no deduplication is required either.

I used the following config options, but we will probably want to create tiles higher resolution than z 10 (perhaps 13):

{
  "z_range": [0, 10],
  "statistics": [
    {
      "name": "coverage",
      "weight_by": "area",
      "property": "area_per_pixel_area",
      "aggregation_method": "sum",
      "resampling_method": "average",
      "val_range": [
        0,
        1
      ],
      "nodata_val": 0,
      "palette": "oryel",
      "nodata_color": "#ffffff00"
    }
  ]
}

Preview in Cesium:
Screen Shot 2022-09-28 at 16 45 11

@robyngit
Copy link
Member Author

robyngit commented Oct 3, 2022

Based on feedback from Annett (below), we should do the following with the next run:

  • change the resampling method from the mean to "nearest neighbour"
  • colour-code the layer based on type (e.g.road, building, other)

We should also remember to re-create the layer when updated data is available next year.

Details:

If displaying as raster, I would suggest to do no resampling which does averaging as it is discrete information (yes or no/classes). Otherwise you are introducing new information (on size, bright if narrow, dark if it is a larger polygon), what could be misunderstood as some thematic content. I would suggest 'nearest neighbour' instead of 'bilinear' . In that case you could also differentiate types (road, building, other).

Note that we are just about to complete an updated version. It covers a slightly larger area and has additional classes (three road types, airstrips and reservoirs as extra/additional objects). But I assume that quality control will not be finished before the end of the year.

@robyngit robyngit added the data available The complete dataset is on the datateam server and ready to process label Oct 17, 2022
@julietcohen
Copy link

Annett has produced a new version of this dataset, now located on Datateam at: /var/data/submission/pdg/bartsch_infrastructure/SACHI_v2/

Old version of the data is now in /var/data/submission/pdg/bartsch_infrastructure/old_version/

Initial notes:

  • resampling method in viz-raster is set to nearest neighbor by default but is overwritten by whatever is specified in the config. See from_rasters()
  • SACHI_v2.shp has only geometries and a DN column, which as unique values: 11, 12, 13, 20, 30, 40, 50
    • need to look into available metadata to determine what this code represents
  • 2,754,082 rows, 673 MB

@julietcohen
Copy link

julietcohen commented Jan 19, 2024

Notes:

  • In order to understand the unique values of the infrastructure codes, I need access to the Zenodo page Robyn linked in the initial issue comment, which contains a README of the description of the data fields. I requested access via the Zenodo form and am awaiting approval.
  • A visual to understand the distribution of the attribute we're mapping:
image
  • The DN col has no NA values but the geometry column has 94,924 NA (none) values. That's only ~3.4% of all rows.
  • CRS of input data:
image
  • all geometries:
image

@julietcohen
Copy link

Data after converting to EPSG 4326:
image

@julietcohen
Copy link

I was able to process a first draft of outputs with the visualization workflow, but importantly in order to produce the web tiles I needed to use the same modification to to_image() in this pull request that I originally used to successfully produce web tiles for the permafrost and ground ice layer (see this comment in that issue). While processing both datasets, they gave the same error and failed to write any web tiles until I added a line in to_image() that converts the image data to float. This gives me more confidence that this PR should be merged. A similarity of these 2 datasets is they both visualize categorical variables. In the permafrost dataset, we are visualizing 4 categories of permafrost coverage, and in this infrastructure dataset we are visualizing 7 types of infrastructure.

@julietcohen
Copy link

On Zenodo, I was granted access to the older version of the dataset which includes a README that includes attribute descriptions, which I was hoping would explain the infrastructure codes we see in this newer version of the dataset. Unfortunately, the attribute DN is not described in the older README.

First draft of the data layer with low resolution (set to max z10):

image image

To do:

  • reach out to Annett to find out description for DN attribute
  • find out resolution of the data to set proper max zoom level
    • when starting over workflow from staging with higher max-z, do it on Delta on a single node so it's faster
    • otherwise in order to take advantage of parallelization with either parsl or ray, would need to split the input file into multiple files because there is only 1
  • play with config to resolve visualization issue that infrastructure units appear as 2 different colors in different tiles, like a single road turns from green to blue
    • examples: change agg method, add val range to config
  • merge the to_image() PR into viz-raster develop then main branches and release before can link to version of the software that produced this dataset
  • remind myself where the output warning message comes from, because I know I have seen it before and documented it in another issue
    • RuntimeWarning: divide by zero encountered in scalar divide (255 / (max_val - min_val))
script
# process the infrastructure data
# start at lower z-level initially,
# then find the real res of the data and 
# apply the appropriate max z-level

# filepaths
from pathlib import Path
import os

# visual checks & vector data wrangling
import geopandas as gpd

# staging
import pdgstaging
from pdgstaging import TileStager

# rasterization & web-tiling
import pdgraster
from pdgraster import RasterTiler

# logging
from datetime import datetime
import logging
import logging.handlers
from pdgstaging import logging_config

# for transferring the log to workdir
import subprocess
from subprocess import Popen

# --------------------------------------------------------------

data_dir = '/home/jcohen/infrastructure/data/'

# define workflow configuration
config = { 
  "dir_input": data_dir, 
  "ext_input": ".shp",
  "ext_footprints": ".shp",
  "dir_staged": "staged/",
  "dir_geotiff": "geotiff/", 
  "dir_web_tiles": "web_tiles/", 
  "filename_staging_summary": "staging_summary.csv",
  "filename_rasterization_events": "raster_events.csv",
  "filename_rasters_summary": "raster_summary.csv",
  "filename_config": "config",
  "simplify_tolerance": 0.1,
  "tms_id": "WGS1984Quad",
  "z_range": [
    0,
    10
  ],
  "geometricError": 57,
  "z_coord": 0,
  "statistics": [
    {
      "name": "infrastructure_code",
      "weight_by": "area",
      "property": "DN",
      "aggregation_method": "max", # TODO: need to think about this one more, maybe should be "sum" or something else
      "resampling_method": "nearest",
      "palette": [
        "#f48525", # orange
        "#f4e625", # yellow
        "#47f425", # green
        "#25f4e2", # turquoise
        "#2525f4", # blue
        "#f425c3", # pink
        "#f42525" # red
      ],
      "nodata_val": 0,
      "nodata_color": "#ffffff00"
    },
  ],
  "deduplicate_at": None,
  "deduplicate_keep_rules": None,
  "deduplicate_method": None
}

# --------------------------------------------------------------

print("Staging initiated.")
# stage the tiles
stager = TileStager(config)
stager.stage_all()

# --------------------------------------------------------------

print("Rasterizing and Web-tiling initiated.")
# rasterize all staged tiles, resample to lower resolutions,
# and produce web tiles
RasterTiler(config).rasterize_all()

# transfer log from /tmp to user dir
# add subdirectories as needed
user = subprocess.check_output("whoami").strip().decode("ascii") 
cmd = ['mv', '/tmp/log.log', f'/home/{user}/infrastructure/']
# initiate the process to run that command
process = Popen(cmd)

print("Script complete.")

@julietcohen
Copy link

  • Adding a val_range to the config results in more consistency in the pallette for each infrastructure unit within the web tiles
    • val_range includes the smallest possible infrastructure code (11) to the largest (50)
    • for example: one road will be the same color from one end to the other
  • Pre-processing steps:
    1. Removed the geometries that are NA before inputting in the workflow to avoid the warning output in the first run
    2. split the input data into 6 files to process in parallel using Parsl on Datateam
  • Data layer is now on demo: Infrastructure (SACHI version 2)
  • I reached out to Annett to get clarification about which code represents which infrastructure type
    • could be the 7 classes of infrastructure described in one of the papers Robyn linked at the top of this issue: fishing, agriculture (mostly reindeer herding), gas/oil industry, mining, other use (e.g. transport hub), abandoned, and unknown
clean_infrastructure.py
# Arctic infrastructure data from Annett B.
# clean by removing rows with NA geometries
# and split data into smaller files for processing
# in parallel

import geopandas as gpd
import numpy as np

data_path = "/home/jcohen/infrastructure/data/SACHI_v2.shp"
data = gpd.read_file(data_path)

# remove NA values from the geometry column
data_clean = data[data['geometry'].notna()]

# split the cleaned data into 6 subfiles
# each file will have equal number of rows 
split_gdfs = np.array_split(data_clean, 6)
for i, split_gdf in enumerate(split_gdfs):
    split_gdf.reset_index(inplace = True)
    # Create the filename with a different number ranging from 1 to 5
    filename = f'/home/jcohen/infrastructure/data_cleaned_split/SACHI_v2_clean_{i + 1}.gpkg'
    split_gdf.to_file(filename, driver = "GPKG", index = True)
    print(f'Saved {filename}')

print("Script complete.")
parsl script
# Processing the infrastructure data from Annett 
# see issue: https://github.com/PermafrostDiscoveryGateway/pdg-portal/issues/27
# staging through web tiling with parsl
# conda environment: viz-local, with local installs
# for viz-staging and viz-raster, but with modified viz-raster
# because need to_image() fix (see PR),
# and pip install for parsl==2023.11.27

import pdgstaging
import pdgraster

from datetime import datetime
import json
import logging
import logging.handlers
from pdgstaging import logging_config
import os

import parsl
from parsl import python_app
from parsl.config import Config
from parsl.channels import LocalChannel
from parsl.executors import HighThroughputExecutor
from parsl.providers import LocalProvider

import shutil
import subprocess
from subprocess import Popen
user = subprocess.check_output("whoami").strip().decode("ascii")


# start with a fresh directory!
print("Removing old directories and files...")
base_dir = "/home/jcohen/infrastructure/parsl_workflow/"
old_filepaths = [f"{base_dir}staging_summary.csv",
                f"{base_dir}raster_summary.csv",
                f"{base_dir}raster_events.csv",
                f"{base_dir}config__updated.json",
                f"{base_dir}log.log"]
for old_file in old_filepaths:
  if os.path.exists(old_file):
      os.remove(old_file)

# remove dirs from past run
old_dirs = [f"{base_dir}staged",
            f"{base_dir}geotiff",
            f"{base_dir}web_tiles",
            f"{base_dir}runinfo"]
for old_dir in old_dirs:
  if os.path.exists(old_dir) and os.path.isdir(old_dir):
      shutil.rmtree(old_dir)


activate_conda = 'source /home/jcohen/.bashrc; conda activate viz-local'
htex_local = Config(
    executors=[
        HighThroughputExecutor(
            label="htex_local",
            worker_debug=False,
            #cores_per_worker=1,
            max_workers=6,
            provider=LocalProvider(
                #channel=LocalChannel(),
                init_blocks=1,
                max_blocks=1,
                worker_init=activate_conda
            ),
        )
    ],
)
parsl.clear()
parsl.load(htex_local)

def run_pdg_workflow(
    workflow_config,
    batch_size = 300
):
    """
    Run the main PDG workflow for the following steps:
    1. staging
    2. raster highest
    3. raster lower
    4. web tiling

    Parameters
    ----------
    workflow_config : dict
        Configuration for the PDG staging workflow, tailored to rasterization and 
        web tiling steps only.
    batch_size: int
        How many staged files, geotiffs, or web tiles should be included in a single creation
        task? (each task is run in parallel) Default: 300
    """

    start_time = datetime.now()

    logging.info("Staging initiated.")

    stager = pdgstaging.TileStager(workflow_config)
    tile_manager = stager.tiles
    config_manager = stager.config

    input_paths = stager.tiles.get_filenames_from_dir('input')
    input_batches = make_batch(input_paths, batch_size)

    # Stage all the input files (each batch in parallel)
    app_futures = []
    for i, batch in enumerate(input_batches):
        app_future = stage(batch, workflow_config)
        app_futures.append(app_future)
        logging.info(f'Started job for batch {i} of {len(input_batches)}')

    # Don't continue to next step until all files have been staged
    [a.result() for a in app_futures]

    logging.info("Staging complete.")

    # ----------------------------------------------------------------

    # Create highest geotiffs 
    rasterizer = pdgraster.RasterTiler(workflow_config)

    # Process staged files in batches
    logging.info(f'Collecting staged file paths to process...')
    staged_paths = tile_manager.get_filenames_from_dir('staged')
    logging.info(f'Found {len(staged_paths)} staged files to process.')
    staged_batches = make_batch(staged_paths, batch_size)
    logging.info(f'Processing staged files in {len(staged_batches)} batches.')

    app_futures = []
    for i, batch in enumerate(staged_batches):
        app_future = create_highest_geotiffs(batch, workflow_config)
        app_futures.append(app_future)
        logging.info(f'Started job for batch {i} of {len(staged_batches)}')

    # Don't move on to next step until all geotiffs have been created
    [a.result() for a in app_futures]

    logging.info("Rasterization highest complete. Rasterizing lower z-levels.")

    # ----------------------------------------------------------------

    # Rasterize composite geotiffs
    min_z = config_manager.get_min_z()
    max_z = config_manager.get_max_z()
    parent_zs = range(max_z - 1, min_z - 1, -1)

    # Can't start lower z-level until higher z-level is complete.
    for z in parent_zs:

        # Determine which tiles we need to make for the next z-level based on the
        # path names of the geotiffs just created
        logging.info(f'Collecting highest geotiff paths to process...')
        child_paths = tile_manager.get_filenames_from_dir('geotiff', z = z + 1)
        logging.info(f'Found {len(child_paths)} highest geotiffs to process.')
        # create empty set for the following loop
        parent_tiles = set()
        for child_path in child_paths:
            parent_tile = tile_manager.get_parent_tile(child_path)
            parent_tiles.add(parent_tile)
        # convert the set into a list
        parent_tiles = list(parent_tiles)

        # Break all parent tiles at level z into batches
        parent_tile_batches = make_batch(parent_tiles, batch_size)
        logging.info(f'Processing highest geotiffs in {len(parent_tile_batches)} batches.')

        # Make the next level of parent tiles
        app_futures = []
        for parent_tile_batch in parent_tile_batches:
            app_future = create_composite_geotiffs(
                parent_tile_batch, workflow_config)
            app_futures.append(app_future)

        # Don't start the next z-level, and don't move to web tiling, until the
        # current z-level is complete
        [a.result() for a in app_futures]

    logging.info("Composite rasterization complete. Creating web tiles.")

    # ----------------------------------------------------------------

    # Process web tiles in batches
    logging.info(f'Collecting file paths of geotiffs to process...')
    geotiff_paths = tile_manager.get_filenames_from_dir('geotiff')
    logging.info(f'Found {len(geotiff_paths)} geotiffs to process.')
    geotiff_batches = make_batch(geotiff_paths, batch_size)
    logging.info(f'Processing geotiffs in {len(geotiff_batches)} batches.')

    app_futures = []
    for i, batch in enumerate(geotiff_batches):
        app_future = create_web_tiles(batch, workflow_config)
        app_futures.append(app_future)
        logging.info(f'Started job for batch {i} of {len(geotiff_batches)}')

    # Don't record end time until all web tiles have been created
    [a.result() for a in app_futures]

    end_time = datetime.now()
    logging.info(f'⏰ Total time to create all z-level geotiffs and web tiles: '
                 f'{end_time - start_time}')

# ----------------------------------------------------------------

# Define the parsl functions used in the workflow:

@python_app
def stage(paths, config):
    """
    Stage a file
    """
    from datetime import datetime
    import json
    import logging
    import logging.handlers
    import os
    import pdgstaging
    from pdgstaging import logging_config

    stager = pdgstaging.TileStager(config = config, check_footprints = False)
    for path in paths:
        stager.stage(path)
    return True

# Create highest z-level geotiffs from staged files
@python_app
def create_highest_geotiffs(staged_paths, config):
    """
    Create a batch of geotiffs from staged files
    """
    from datetime import datetime
    import json
    import logging
    import logging.handlers
    import os
    import pdgraster
    from pdgraster import logging_config

    # rasterize the vectors, highest z-level only
    rasterizer = pdgraster.RasterTiler(config)
    return rasterizer.rasterize_vectors(
        staged_paths, make_parents = False)
    # no need to update ranges because manually set val_range in config

# ----------------------------------------------------------------

# Create composite geotiffs from highest z-level geotiffs 
@python_app
def create_composite_geotiffs(tiles, config):
    """
    Create a batch of composite geotiffs from highest geotiffs
    """
    from datetime import datetime
    import json
    import logging
    import logging.handlers
    import os
    import pdgraster
    from pdgraster import logging_config

    rasterizer = pdgraster.RasterTiler(config)
    return rasterizer.parent_geotiffs_from_children(
        tiles, recursive = False)

# ----------------------------------------------------------------

# Create a batch of webtiles from geotiffs
@python_app
def create_web_tiles(geotiff_paths, config):
    """
    Create a batch of webtiles from geotiffs
    """

    from datetime import datetime
    import json
    import logging
    import logging.handlers
    import os
    import pdgraster
    from pdgraster import logging_config

    rasterizer = pdgraster.RasterTiler(config)
    return rasterizer.webtiles_from_geotiffs(
        geotiff_paths, update_ranges = False)
        # no need to update ranges because val_range is
        # defined in the config


def make_batch(items, batch_size):
    """
    Create batches of a given size from a list of items.
    """
    return [items[i:i + batch_size] for i in range(0, len(items), batch_size)]

# ----------------------------------------------------------------


# run the workflow
workflow_config = '/home/jcohen/infrastructure/parsl_workflow/config.json'
print("Loaded config. Running workflow.")
run_pdg_workflow(workflow_config)
# Shutdown and clear the parsl executor
htex_local.executors[0].shutdown()
parsl.clear()

# transfer log from /tmp to user dir
cmd = ['mv', '/tmp/log.log', f'/home/{user}/infrastructure/parsl_workflow/']
# initiate the process to run that command
process = Popen(cmd)

print("Script complete.")
config
{ 
  "dir_input": "/home/jcohen/infrastructure/data_cleaned_split", 
  "ext_input": ".gpkg",
  "dir_staged": "staged/",
  "dir_geotiff": "geotiff/", 
  "dir_web_tiles": "web_tiles/", 
  "filename_staging_summary": "staging_summary.csv",
  "filename_rasterization_events": "raster_events.csv",
  "filename_rasters_summary": "raster_summary.csv",
  "filename_config": "config",
  "simplify_tolerance": 0.1,
  "tms_id": "WGS1984Quad",
  "z_range": [
    0,
    12
  ],
  "geometricError": 57,
  "z_coord": 0,
  "statistics": [
    {
      "name": "infrastructure_code",
      "weight_by": "area", 
      "property": "DN",
      "aggregation_method": "max", 
      "resampling_method": "nearest",
      "val_range": [
        11,
        50
      ], 
      "palette": [
        "#f48525", 
        "#f4e625", 
        "#47f425", 
        "#25f4e2", 
        "#2525f4", 
        "#f425c3", 
        "#f42525" 
      ],
      "nodata_val": 0,
      "nodata_color": "#ffffff00"
    }
  ],
  "deduplicate_at": null,
  "deduplicate_keep_rules": null,
  "deduplicate_method": null,
  "clip_to_footrpint": false
}

@elongano elongano moved this to In Progress in Data Layers Jan 29, 2024
@elongano
Copy link

Category: Infrastructure

@elongano elongano added the Infrastructure data layer category: infrastructure label Jan 29, 2024
@julietcohen
Copy link

Annett has provided the README for this dataset. It provides a code for each infrastructure type and is located in /var/data/submission/pdg/bartsch_infrastructure/

@julietcohen
Copy link

julietcohen commented Mar 6, 2024

The dataset package on the ADC: https://arcticdata.io/catalog/view/urn%3Auuid%3A4e1ea0af-6f7c-4a7a-b69f-4e818e113c43

  • Be sure to navigate to the newest version of the dataset with the link in the top banner.
  • This is restricted to private access until we finish it and Annett approves, then we will assign it with the DOI I already generated: 10.18739/A21J97929
  • I released the version 0.9.2 of viz-raster that was used to process this dataset so we can link to it in the metadata
  • the most recent version of viz-staging (v0.9.1) was used for this dataset so no need to make a new release for this data package

The final dataset has been processed and is archived on datateam:

  • Everything besides web tiles are located at /var/data/10.18739/A21J97929/
    • The subdir input contains:
      • the raw SACHI_v2 data file from Annett (just 1 shapefile and it's extension files)
      • the README for this new version of the dataset that Annett emailed
      • the cleaning script for the shapefile because it contained NA values that needed to be removed
      • the output file of that cleaning script SACHI_v2_clean.gpkg that was actually used as input to the viz workflow
  • The web tiles are at /var/data/tiles/10.18739/A21J97929/
    • The tiles are visualized on the demo PDG portal. I have not received any feedback or requested changes to the output, so I will move forward with the data package processing. If there needs to be changes to anything moving forward, it would be ideal to get that feedback before we publish the dataset with the DOI.

@julietcohen
Copy link

viz-workflow v0.9.2 has been released (that was used to produce this data layer). The Only thing blocking this layer from moving to production is the ADC data package needs to be finished and published. I have been adding more metadata to this package the past week, so it is on it's way to getting assigned it's pre-issued DOI!

@julietcohen
Copy link

Justin from the ADC has helpfully taken over the rest of the metadata documentation for this dataset, starting this week.

@julietcohen
Copy link

julietcohen commented Mar 26, 2024

Annett has requested that I change the palette of this dataset so that buildings are in red and water is in blue. I also noticed that there were MultiPolygons (only 2% of the geometries) in the input data from Annett. I re-cleaned this input data (still removed NA geoms like last time, with the additional cleaning step of exploding those MultiPolygon geoms) and will re-process the data with the viz workflow the requested palette change as well. This is not a time consuming operation on Delta.

This new cleaning script will replace the old one that is uploaded to the data package and the staged and geotiff tilesets will replace the ones currently in the pre-assigned DOI dir. I already discussed this with Justin.

For convenience, I am pasting the values for infrastructure code here (they are from the readme):

  • 11=linear transport infrastructure (asphalt)
  • 12=linear transport infrastructure (gravel)
  • 13=linear transport infrastructure (undefined),
  • 20=buildings (and other constructions such as bridges)
  • 30=other impacted area (includes gravel pads, mining sites)
  • 40=airstrip
  • 50=reservoir or other water body impacted by human activities

@julietcohen
Copy link

Annett also mentioned that she is going to update the dataset on Zenodo with the new version this week, and it already has a DOI there. Justin let her know we will still use our pre-assigned DOI since it is a new dataset, and I went into detail about our derived products. We will mention the Zenodo DOI in the dataset documentation on the ADC.

@julietcohen
Copy link

Delta decided that python doesn't exist anymore in any of my environments so I have been troubleshooting that for the past hour. Next step would be to uninstall VScode and reinstall, and remove the known hosts for Delta. I already tried uninstalling all python extensions and reinstalling.

@julietcohen
Copy link

julietcohen commented Apr 1, 2024

Annett has updated the dataset based on my feedback regarding the geometries. She uploaded the new version of SACHI_v2 to Zenodo. She included the following info:

DOI 10.5281/zenodo.10160636
I made a number of changes based on your feedback:

  1. I had a closer look at the geometry properties. The version which you got had duplicates and the overlap areas of the different sentinel-2 source granules were not yet merged. This is now solved and the there are now also no features without attributes any more
  2. I extended the readme file for the meta data:
    S2_date1 to S2_date3 - dates of individual Sentinel-2 images used for averaging
    S1_winter - year(s) of Sentinel-1 images used for averaging (months December and/or January)

Justin and I made it clear that we would be giving the dataset a new DOI (the one I pre-created for this dataset) because our ADC version of the data package differs from the one she has on Zenodo, considering all our derived products.

Since she made changes to SACHI_v2, I will upload the new version to Datateam and reprocess the dataset with the viz workflow.

@julietcohen
Copy link

I reprocessed the infrastructure dataset with Annett's new version of SACHI_v2 data as input to the viz workflow. Since she fixed the geometries, I only had to split the multipolygons before inputting the data into the viz workflow. I also updated the palette to represent buildings in red and water in blue as Annett requested and removed yellow from the palette since we visualize the ice-wedge polygon layer in yellow, and we anticipate that users will explore these layers together. All input and output and the new pre-viz claning script have been uploaded to /var/data/10.18739/A21J97929. The output web tiles have been uploaded to /var/data/tiles/10.18739/A21J97929/

I updated the demo portal with this layer:
image

@julietcohen
Copy link

I was able to refine the assignment of the categorical palette so there is just 1 color attributed to 1 infrastructure type. I did this in a similar way that I did for the permafrost and ground ice dataset. Instead of using the attribute DN for the visualized tiles, I needed to create a new attribute which I named palette_code that assigns numbers 1 through 7 to each of the DN numbers because the numbers for the palette assignment seem to need to be evenly spaced in order for 1 color to correspond to 1 categorical value. The values of DN are numerical but are not evenly spaced.

# add attribute that codes the categorical DN attribute
# into evenly spaced numbers in order to assign 
# palette correctly to the categories in the web tiles:
conditions = [
    (data['DN'] == 11),
    (data['DN'] == 12),
    (data['DN'] == 13),
    (data['DN'] == 20),
    (data['DN'] == 30),
    (data['DN'] == 40),
    (data['DN'] == 50)
]
choices = [1, 2, 3, 4, 5, 6, 7 ]
data['palette_code'] = np.select(conditions, choices)

As a result, I see more diversity in the colors of the geometries, as Annett suggested they should look, and I integrated grey for the 30 value of DN per her suggestion:

image

Taking this approach, I realize that in order for Annett's metadata for the DN values to match the numbers in the raster, I need to include both attributes (DN and palette_code) as bands in the rasters. This is because we will use the web tiles for palette_code as the layer we visualize on the portal, and users will want the option to use either band when they download the rasters for their own analysis.

One note is that I have only tried this dataset with a custom palette with each of the 7 hex codes specifically assigned rather than a pre-made palette with a name, because we wanted to include certiain colors in a specific order, and exclude certain colors that would too closely match other datasets on the PDG.

@julietcohen
Copy link

One more consideration for this dataset: the resolution of this data is not clear on the zenodo page (this new zenodo link is important because it points to version 2 of the dataset that Annett updated, not version 1 which was originally linked above when this ticket was first created). Sentinel data resolution can vary depending on the bands used. One way to find this may be to dive deeper into the methods, like a paper associated with this dataset, or ask Annett. I have been using z-level 12 as the max, cause that is ~32m resolution at latitude +/- 31

@julietcohen
Copy link

julietcohen commented Apr 24, 2024

Annett has approved the layer to be moved to production. The remaining to-do items, in order:

  • process the infrastructure layer on Delta for the final time
    • with z-13 as highest zoom (significantly increasing the number of tiles)
    • with 2 stats: infrastructure_code (DN) and palette_code
  • delete old tilesets from Datateam
  • move tilesets and other files (like the cvs's and log) to Datateam
  • remove the useless node dir (cn014) from the staged dir (just moving everything within cn014 up one level then removing that dir)
  • update the vector and geotiff tile examples in the ADC package
  • update other metadata details that need to be clarified now that there is 1 extra attribute in the vectors, 2 stats in the geotiffs, and a higher resolution for the highest z geotiffs
  • expand the description for the layer in the left pane
  • on the portal, add the attribution based on the citation for the ADC package
  • get EZID credentials from Matt
  • make EZID for the viz-workflow v0.9.3 release
  • release viz-workflow v0.9.3 with new config used for infrastructure layer processing, update that release version in the ADC package
  • make package entities ghost entities for example files for staged and geotiff
  • one last look over the ADC package and publish
    • need to double check for the ;amp; things that appear in the text when special characters are used
    • make abstract markdown formatting render by editing the XML
    • make sure all file names are updated in methods since files were replaced
  • move layer to production
    • make sure legend is updated
  • inform Annett

@mbjones
Copy link
Member

mbjones commented Apr 30, 2024

@julietcohen In reading through your correspondence on this ticket, I saw that it originated as a vector dataset. The most accurate way for us to present it would be as a vector layer, rather than raster. Let's please discuss this as I suspect it would provide a much better result. We have a number of vector layers in the portals now, mainly via a geoJSON conversion. I'm not sure if the dataset size would be prohibitive there, but let's discuss please.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Data Layers May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data available The complete dataset is on the datateam server and ready to process Infrastructure data layer category: infrastructure layer Displaying a specific data product in the PDG portal pdg Permafrost Discovery Gateway priority: high
Projects
Status: Done
Development

No branches or pull requests

4 participants