Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add aggregation datasets for lake time series data: Permafrost Database for Northern Alaska and Circumpolar Thermokarst Landscapes #37

Open
initze opened this issue Apr 5, 2023 · 18 comments
Assignees
Labels
data available The complete dataset is on the datateam server and ready to process layer Displaying a specific data product in the PDG portal pdg Permafrost Discovery Gateway Water data layer category: water

Comments

@initze
Copy link

initze commented Apr 5, 2023

Datasets for lake statistics aggregation

Regional Dataset

Jorgenson: Ecological Mapping and Permafrost Database for Northern Alaska (Final Ecological Mapping Update (2014))

Data Download Link (zipped shp)

Archive/Dataset Link: https://catalog.northslopescience.org/dataset/2236

For data aggregation I would propose to use the field "ECOREGION" and "LITHOLOGY" to start with, i guess once set up we could add others

ECOREGION

LITHOLOGY

Pan-Arctic Dataset

Olefeldt: Circumpolar Thermokarst Landscapes, 2015, Circum-Arctic

[Data Download Link (GPKG), with fixed geometries (I had some issues with original file)] https://1drv.ms/u/s!AobXXrP933xWh8lqz5It06Zf8AT9JA?e=yBQaDY

Archive/Dataset Link: https://apgc.awi.de/dataset/ctl-arctic

For data aggregation I would propose to use the field "TKThLP"

grafik

@initze
Copy link
Author

initze commented Apr 5, 2023

Google slides with some simple sketches for data visualizations.

in progress --> will be updated from time-to-time

https://docs.google.com/presentation/d/1egijizjw9boYxuuMBpfGwE7tX4M4KmYiJq0wbocO3Zc/edit#slide=id.g22b5ba41013_0_0

@initze
Copy link
Author

initze commented Apr 5, 2023

Lake area time-series

Here we have 2 datasets:
GDrive Link: https://drive.google.com/drive/folders/1RH3G-u6mKRSZ82Fok9I1CPUpK8vj8EtH?usp=share_link

Both datasets can be joined via the ID_merged attribute

  1. INitze_Lakes_merged_v3_PDG_Testset.gpkg
  • Geopackage file with polygon geometries with static lake change statistics for the period 1999-2014 from Nitze et al 2018
  • published dataset with some cleaning, merged 4 transects
  1. Lakes_IngmarPDG_annual_area.nc

@julietcohen
Copy link

Thank you for this dataset, metadata, and suggestions for processing, Ingmar! Very much looking forward to visualizing this on the PDG.

@julietcohen
Copy link

The 2 files for lake area time series have been uploaded to the NCEAS datateam server: /home/pdg/data/nitze_lake_change/time_series_2023-04-05

@julietcohen
Copy link

Update: After rearranging Datateam directories, the data now is on Datateam at: /var/data/submission/pdg/nitze_lake_change/time_series_2023-04-05

@julietcohen julietcohen added data available The complete dataset is on the datateam server and ready to process priority: high labels Jul 25, 2023
@julietcohen
Copy link

julietcohen commented Jul 25, 2023

Initial exploration

GeoPackage:

  • geometries that represent lake data in non-adjacent UTM zones:
image
  • projected CRS: EPSG 3995, WGS 84 Arctic Polar Stereographic
  • 601,486 unique values for ID_merged
  • has 114,821 NA values in each of the columns: raten_cm_y & rateg_cm_y, across 114,821 unique ID values

NetCDF:

  • 601,286 unique values for ID_merged
  • no geometries
  • several years of data for each ID (1984-1988, 2017-2021, ...) with 38 unique years represented
  • data for annual permanent water and seasonal water
  • has 3,800 NA values for both attributes
    • the NA values span across all 38 years, but are only present for 100 ID values (38*100=3,800)

@julietcohen
Copy link

Combined data: Lake Area Time Series

Converted the NetCDF file to a dataframe:

image

Combined with the GeoPackage with an inner join, retaining only the ID_merged values that are within both files. The merged data lives in:
/var/data/submission/pdg/nitze_lake_change/time_series_2023-04-05/merged_lakes_area.gpkg

merge.py
# Merge the two lake change datasets from Ingmar Nitze
# Author: Juliet Cohen
# Date: 07/26/23
# use conda environment "geospatial"

import geopandas as gpd
import xarray as xar
import pandas as pd

# Lake area time series data GeoPackage
gdf = gpd.read_file("/var/data/submission/pdg/nitze_lake_change/time_series_2023-04-05/INitze_Lakes_merged_v3_PDG_Testset.gpkg")
# Lake area time series data NetCDF
area = xar.open_dataset("/var/data/submission/pdg/nitze_lake_change/time_series_2023-04-05/Lakes_IngmarPDG_annual_area.nc")

# convert the NetCDF file into a dataframe, with columns for:
# ID_merged, year, permanent_water, seasonal_water
area_df = area.to_dataframe().reset_index()

# merge the dataframe, retaining only ID_merged values (lakes) that exist in both files
# the spatial dataframe must be the left argument to retain the geometries
merged_data = gdf.merge(right = area_df,
                        how = 'inner', # only retain lakes with data for the measurements for permanent water and seasonal water 
                        on = 'ID_merged')

# Save as a gpkg for input into the viz-workflow
merged_data.to_file("/var/data/submission/pdg/nitze_lake_change/time_series_2023-04-05/merged_lakes_area.gpkg", driver = "GPKG")

Next step:
Set up a parsl run to process this file for initial testing, with no deduplication and max z-level 11. The gpkg is quite large, so reading it in take a long time. When working with the entire lake change time series dataset (rather than just these few UTM zones) it is a good idea to separate the merged gpkg files by UTM zone so they are easier to parallelize with batches during staging.

@julietcohen julietcohen added the layer Displaying a specific data product in the PDG portal label Aug 7, 2023
@julietcohen
Copy link

julietcohen commented Dec 8, 2023

Update on lake area time series visualization

  • subset the merged gpkg data by transect (so 4 files), in order to save as files and read it in on Datateam in a reasonable amount of time for initial workflow testing
  • since the data is annual and there are 38 unique years, the first 38 polygons have the exact same geometries and much of the same attribute values
  • T2 is the smallest of the 4 subset files, but still took 40 min to read it in (!), so I subset that further to a file in my home dir for workflow testing with an even smaller subset (760 rows to get 20 unique polygons, with 0 NA values in all columns woohoo)
  • multipolygon geometries were exploded into a single row per polygon for this small subset. This needs to be done for all merged data prior to input into the workflow for the full run. However, looks like (based on just the sample of rows I'm working with) the polygons were already single polygons but labeled as multipolygons. I haven't seen this before but it's not important at this moment because explode does not error in this case and works as expected.
  • attribute descriptions would be helpful, since most are acronyms and are therefore hard to decipher, and we'll need them eventually anyway for the metadata, even if I don't need them all for visualizing
  • the nodata_val must be 0 until I merge the branch that allows the nodata_val to be a different value, so best to choose an attribute to visualize that has all positive values for now

@julietcohen
Copy link

First draft of lake size time series dataset

image
  • attribute: permanent_water (need units from Ingmar)
  • only transect 2 (there are 4 in total)
  • polygon colors are retained as user zooms in ✅
  • need to choose more intuitive palette (discuss with Ingmar)
  • consider using higher z-level than max 11 (discuss with Ingmar)
  • explanation of how the years are represented, the polygons are identical for the years but values change for attribute(s) of interest, which change the color
config
{
  "deduplicate_clip_to_footprint": false,
  "deduplicate_method": "None",
  "deduplicate_at": "None",
  "dir_output": "/home/jcohen/lake_change_time_series/parsl_workflow/", 
  "dir_input": "/var/data/submission/pdg/nitze_lake_change/time_series_2023-04-05/exploded/", 
  "ext_input": ".gpkg",
  "dir_staged": "/home/jcohen/lake_change_time_series/parsl_workflow/staged/", 
  "dir_geotiff": "/home/jcohen/lake_change_time_series/parsl_workflow/geotiff/",  
  "dir_web_tiles": "/home/jcohen/lake_change_time_series/parsl_workflow/web_tiles/", 
  "filename_staging_summary": "/home/jcohen/lake_change_time_series/parsl_workflow/staging_summary.csv",
  "filename_rasterization_events": "/home/jcohen/lake_change_time_series/parsl_workflow/raster_events.csv",
  "filename_rasters_summary": "/home/jcohen/lake_change_time_series/parsl_workflow/raster_summary.csv",
  "simplify_tolerance": 0.1,
  "tms_id": "WGS1984Quad",
  "z_range": [
    0,
    11
  ],
  "geometricError": 57,
  "z_coord": 0,
  "statistics": [
    {
      "name": "permanent_water", 
      "weight_by": "area",
      "property": "permanent_water", 
      "aggregation_method": "max",
      "resampling_method": "mode", 
      "val_range": [
        0,
        4000
      ],
      "palette": [
        "#fae500",
        "#00d8db",
        "#2100db",
        "#9a00fa"
      ],
      "nodata_val": 0,
      "nodata_color": "#ffffff00" 
    }
  ]
}
parsl workflow
# Processing the merged lake change time series data sample from Ingmar
# see issue: https://github.com/PermafrostDiscoveryGateway/pdg-portal/issues/37
# staging through web tiling with parsl
# conda environment: viz-local, with local installs
# for viz-staging and viz-raster, main branch
# and pip install for parsl==2023.11.27

from datetime import datetime
import json
import logging
import logging.handlers
import os

import pdgstaging
import pdgraster

import parsl
from parsl import python_app
from parsl.config import Config
from parsl.channels import LocalChannel
from parsl.executors import HighThroughputExecutor
from parsl.providers import LocalProvider

import subprocess
from subprocess import Popen
user = subprocess.check_output("whoami").strip().decode("ascii")

# configure logger
logger = logging.getLogger("logger")
# Remove any existing handlers from the logger
for handler in logger.handlers[:]:
    logger.removeHandler(handler)
# prevent logging statements from being printed to terminal
logger.propagate = False
# set up new handler
handler = logging.FileHandler("/tmp/log.log")
formatter = logging.Formatter(logging.BASIC_FORMAT)
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.INFO)


activate_conda = 'source /home/jcohen/.bashrc; conda activate viz-local'
htex_local = Config(
    executors=[
        HighThroughputExecutor(
            label="htex_local",
            worker_debug=False,
            #cores_per_worker=1,
            max_workers=10,
            provider=LocalProvider(
                #channel=LocalChannel(),
                init_blocks=1,
                max_blocks=1,
                worker_init=activate_conda
            ),
        )
    ],
)
parsl.clear()
parsl.load(htex_local)

def run_pdg_workflow(
    workflow_config,
    batch_size = 300
):
    """
    Run the main PDG workflow for the following steps:
    1. staging
    2. raster highest
    3. raster lower
    4. web tiling

    Parameters
    ----------
    workflow_config : dict
        Configuration for the PDG staging workflow, tailored to rasterization and 
        web tiling steps only.
    batch_size: int
        How many staged files, geotiffs, or web tiles should be included in a single creation
        task? (each task is run in parallel) Default: 300
    """

    start_time = datetime.now()

    logging.info("Staging initiated.")

    stager = pdgstaging.TileStager(workflow_config)
    #tile_manager = rasterizer.tiles
    tile_manager = stager.tiles
    config_manager = stager.config

    input_paths = stager.tiles.get_filenames_from_dir('input')
    input_batches = make_batch(input_paths, batch_size)

    # Stage all the input files (each batch in parallel)
    app_futures = []
    for i, batch in enumerate(input_batches):
        app_future = stage(batch, workflow_config)
        app_futures.append(app_future)
        logging.info(f'Started job for batch {i} of {len(input_batches)}')

    # Don't continue to next step until all files have been staged
    [a.result() for a in app_futures]

    logging.info("Staging complete.")

    # ----------------------------------------------------------------

    # Create highest geotiffs 
    rasterizer = pdgraster.RasterTiler(workflow_config)

    # Process staged files in batches
    logging.info(f'Collecting staged file paths to process...')
    staged_paths = tile_manager.get_filenames_from_dir('staged')
    logging.info(f'Found {len(staged_paths)} staged files to process.')
    staged_batches = make_batch(staged_paths, batch_size)
    logging.info(f'Processing staged files in {len(staged_batches)} batches.')

    app_futures = []
    for i, batch in enumerate(staged_batches):
        app_future = create_highest_geotiffs(batch, workflow_config)
        app_futures.append(app_future)
        logging.info(f'Started job for batch {i} of {len(staged_batches)}')

    # Don't move on to next step until all geotiffs have been created
    [a.result() for a in app_futures]

    logging.info("Rasterization highest complete. Rasterizing lower z-levels.")

    # ----------------------------------------------------------------

    # Rasterize composite geotiffs
    min_z = config_manager.get_min_z()
    max_z = config_manager.get_max_z()
    parent_zs = range(max_z - 1, min_z - 1, -1)

    # Can't start lower z-level until higher z-level is complete.
    for z in parent_zs:

        # Determine which tiles we need to make for the next z-level based on the
        # path names of the geotiffs just created
        logging.info(f'Collecting highest geotiff paths to process...')
        child_paths = tile_manager.get_filenames_from_dir('geotiff', z = z + 1)
        logging.info(f'Found {len(child_paths)} highest geotiffs to process.')
        # create empty set for the following loop
        parent_tiles = set()
        for child_path in child_paths:
            parent_tile = tile_manager.get_parent_tile(child_path)
            parent_tiles.add(parent_tile)
        # convert the set into a list
        parent_tiles = list(parent_tiles)

        # Break all parent tiles at level z into batches
        parent_tile_batches = make_batch(parent_tiles, batch_size)
        logging.info(f'Processing highest geotiffs in {len(parent_tile_batches)} batches.')

        # Make the next level of parent tiles
        app_futures = []
        for parent_tile_batch in parent_tile_batches:
            app_future = create_composite_geotiffs(
                parent_tile_batch, workflow_config)
            app_futures.append(app_future)

        # Don't start the next z-level, and don't move to web tiling, until the
        # current z-level is complete
        [a.result() for a in app_futures]

    logging.info("Composite rasterization complete. Creating web tiles.")

    # ----------------------------------------------------------------

    # Process web tiles in batches
    logging.info(f'Collecting file paths of geotiffs to process...')
    geotiff_paths = tile_manager.get_filenames_from_dir('geotiff')
    logging.info(f'Found {len(geotiff_paths)} geotiffs to process.')
    geotiff_batches = make_batch(geotiff_paths, batch_size)
    logging.info(f'Processing geotiffs in {len(geotiff_batches)} batches.')

    app_futures = []
    for i, batch in enumerate(geotiff_batches):
        app_future = create_web_tiles(batch, workflow_config)
        app_futures.append(app_future)
        logging.info(f'Started job for batch {i} of {len(geotiff_batches)}')

    # Don't record end time until all web tiles have been created
    [a.result() for a in app_futures]

    end_time = datetime.now()
    logging.info(f'⏰ Total time to create all z-level geotiffs and web tiles: '
                 f'{end_time - start_time}')

# ----------------------------------------------------------------

# Define the parsl functions used in the workflow:

@python_app
def stage(paths, config):
    """
    Stage a file
    """
    from datetime import datetime
    import json
    import logging
    import logging.handlers
    import os
    import pdgstaging

    # configure logger:
    logger = logging.getLogger("logger")
    # Remove any existing handlers from the logger
    for handler in logger.handlers[:]:
        logger.removeHandler(handler)
    # prevent logging statements from being printed to terminal
    logger.propagate = False
    # set up new handler
    handler = logging.FileHandler("/tmp/log.log")
    formatter = logging.Formatter(logging.BASIC_FORMAT)
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    logger.setLevel(logging.INFO)

    stager = pdgstaging.TileStager(config)
    for path in paths:
        stager.stage(path)
    return True

# Create highest z-level geotiffs from staged files
@python_app
def create_highest_geotiffs(staged_paths, config):
    """
    Create a batch of geotiffs from staged files
    """
    from datetime import datetime
    import json
    import logging
    import logging.handlers
    import os
    import pdgraster

    # configure logger:
    logger = logging.getLogger("logger")
    # Remove any existing handlers from the logger
    for handler in logger.handlers[:]:
        logger.removeHandler(handler)
    # prevent logging statements from being printed to terminal
    logger.propagate = False
    # set up new handler
    handler = logging.FileHandler("/tmp/log.log")
    formatter = logging.Formatter(logging.BASIC_FORMAT)
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    logger.setLevel(logging.INFO)

    # rasterize the vectors, highest z-level only
    rasterizer = pdgraster.RasterTiler(config)
    return rasterizer.rasterize_vectors(
        staged_paths, make_parents = False)
    # no need to update ranges because only using coverage statistic, which ranges 0-1

# ----------------------------------------------------------------

# Create composite geotiffs from highest z-level geotiffs 
@python_app
def create_composite_geotiffs(tiles, config):
    """
    Create a batch of composite geotiffs from highest geotiffs
    """
    from datetime import datetime
    import json
    import logging
    import logging.handlers
    import os
    import pdgraster

    # configure logger:
    logger = logging.getLogger("logger")
    # Remove any existing handlers from the logger
    for handler in logger.handlers[:]:
        logger.removeHandler(handler)
    # prevent logging statements from being printed to terminal
    logger.propagate = False
    # set up new handler
    handler = logging.FileHandler("/tmp/log.log")
    formatter = logging.Formatter(logging.BASIC_FORMAT)
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    logger.setLevel(logging.INFO)

    rasterizer = pdgraster.RasterTiler(config)
    return rasterizer.parent_geotiffs_from_children(
        tiles, recursive = False)

# ----------------------------------------------------------------

# Create a batch of webtiles from geotiffs
@python_app
def create_web_tiles(geotiff_paths, config):
    """
    Create a batch of webtiles from geotiffs
    """

    from datetime import datetime
    import json
    import logging
    import logging.handlers
    import os
    import pdgraster

    # configure logger:
    logger = logging.getLogger("logger")
    # Remove any existing handlers from the logger
    for handler in logger.handlers[:]:
        logger.removeHandler(handler)
    # prevent logging statements from being printed to terminal
    logger.propagate = False
    # set up new handler
    handler = logging.FileHandler("/tmp/log.log")
    formatter = logging.Formatter(logging.BASIC_FORMAT)
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    logger.setLevel(logging.INFO)

    rasterizer = pdgraster.RasterTiler(config)
    return rasterizer.webtiles_from_geotiffs(
        geotiff_paths, update_ranges = False)
        # no need to update ranges because the range is [1,4] for
        # all z-levels, and is defined in the config


def make_batch(items, batch_size):
    """
    Create batches of a given size from a list of items.
    """
    return [items[i:i + batch_size] for i in range(0, len(items), batch_size)]

# ----------------------------------------------------------------

# run the workflow
config_file = '/home/jcohen/lake_change_time_series/parsl_workflow/config.json'
logger.info(f'🗂 Workflow configuration loaded from {config_file}')
print("Loaded config. Running workflow.")
logger.info(f'Starting PDG workflow: staging, rasterization, and web tiling')
run_pdg_workflow(config_file)
# Shutdown and clear the parsl executor
htex_local.executors[0].shutdown()
parsl.clear()

# transfer log from /tmp to user dir
cmd = ['mv', '/tmp/log.log', f'/home/{user}/lake_change_time_series/parsl_workflow/']
# initiate the process to run that command
process = Popen(cmd)

print("Script complete.")

@julietcohen
Copy link

Producing annual layers for static lake size: permanent_water, 2017-2021

Anna requested that in anticipation of the google fellows starting on the team in January, our top priority should be to create annual layers for lake size for at least permanent_water and also potentially seasonal_water. Ingmar suggested that this be done for the past 5 years of data, because those years have the best quality observations and the fewest NA values for these attributes. I merged the data just as before, plus the step to remove NA values from these attributes that was not present before. I then subset the merged data (all transects) to 5 annual files.

At our meeting, Ingmar helpfully suggested using a mathematical way to distribute the color palette so that all colors are represented, even with skewed data. The way we have done it previously included plotting the input data attribute values in a binned histogram, and setting the range in the config to the min and max values that encompass the vast majority of the data, so the outliers are assigned the same color as the min and max values (depending on which side of the range they fall), and that way the data is much easier to interpret in a map. In order to attempt to introduce a more mathematical approach, I calculated the 95th quantile for each of the annual files. The values are similar for each attribute across the 5 years.

annual percentiles
2017 permanent water 95th percentile: 48.29246735806339
2017 seasonal water 95th percentile: 2.6606105447148294

2018 permanent water 95th percentile: 48.476413012662455
2018 seasonal water 95th percentile: 2.471091755819858

2019 permanent water 95th percentile: 48.25663980988141
2019 seasonal water 95th percentile: 2.6432749690066695

2020 permanent water 95th percentile: 47.909318479470706
2020 seasonal water 95th percentile: 3.0058684271837874

2021 permanent water 95th percentile: 47.84887662863641
2021 seasonal water 95th percentile: 2.8591087319313626

@julietcohen
Copy link

julietcohen commented Dec 13, 2023

First version of the 2017 lake time series data

There are 2 new layers. One is the "permanent water", and the other is "seasonal water". Units are hectares.

  • recall that this dataset has 4 transect regions, meaning the data is distributed to several noncontiguous regions across the Arctic. This is important because the "zoom to entire dataset" button is not as helpful for these datasets as it is for smaller datasets, or datasets with adjacent regions. I did play around with specifying lower zoom level tile bounds for the zoom feature. When I specified any of the several regions with polygons, the zoom to feature works great in the sense that it does zoom the user to that region. Unfortunately, it makes the tiles in the other regions invisible. Luckily, zooming into the portal and panning around a little allows you to spot some of the bigger blue lakes pretty easily, and you can zoom into each region individually. Please let me know if I can clarify this issue. I certainly want to resolve this "zoom to entire dataset" issue.
  • if you're having trouble viewing the tiles, please clear your browsing data and reload the page
  • for Ingmar: I did not use deduplication because of the lack of overlap I observed when plotting the data. Please correct me if I misinterpreted this, and I actually should use deduplication.
  • the same blue palette was used for both statistics, but the numerical ranges of the data are different, since generally the permanent water values are larger than the seasonal water values. Should these 2 layers have totally different palettes so users do not have to hover over the legend to understand that the ranges are different?
  • as we get into the more recent years, I expect that more polygons will be present
image

@initze
Copy link
Author

initze commented Dec 14, 2023

@julietcohen
Could you add the link to the dev version? I just lost it

for Ingmar: I did not use deduplication because of the lack of overlap I observed when plotting the data. Please correct me if I misinterpreted this, and I actually should use deduplication.

In your dataset, the four regions are separate from each other. Technically they were merged and deduplicated in a previous step. So here there should be no need for that.

the same blue palette was used for both statistics, but the numerical ranges of the data are different, since generally the permanent water values are larger than the seasonal water values

Maybe we can increase the max val (upper limit) of the visualization. For my taste the differences are not super great.

Some cool idea

we could calculate the diff of each year versus the previous (or some kind of aggregated number of years), e.g. 2019-2018. Then we should be able to visualize major changes such as lake drainage 😄

@julietcohen
Copy link

@initze Thank you for your feedback. Here is the link to the PDG demo portal: https://demo.arcticdata.io/portals/permafrost

I'll increase the upper limit of the range in my config file and see if that helps create more diversity in colors associated with polygons of different sizes. For example, I could increase the quantile that I am using to set the max value. I was using the 95th quantile, which resulted in a max range value for permanent water of ~49. Increasing to the 99.99th quantile would be a max value of ~6,088.

Calculating the diff of each year would be very interesting! I do not think it would be difficult either, since I have already produced the annual files. I can ensure that each file has the same geometries, and we can take the difference for the variables of interest for each geometry. I can work on that after I visualize the 5 years independently.

Also, I think maybe a purple palette would be nice for the seasonal_water to contrast the blue that we can keep for the permanent_water. Just considering that the IWP layer is yellow, and the Arctic communities layer is green. Using red for any layer that represents a static attribute such as permanent or seasonal water seems like it would imply decreasing, so we should reserve red for the end of the spectrum for a variable that represents loss of size.

@julietcohen
Copy link

Another note: I have increased the maximum z-level for this data to z-12 (just 1 level higher) after consulting the metadata for the TMS, because we want the resolution of this data to be ~30m (correct me if that's wrong)

@julietcohen
Copy link

julietcohen commented Dec 18, 2023

2017 Lake Data Update

I created web tiles for 2017 with different palettes for permanent water and seasonal water, and used the 99.99th percentile for the highest value in the palette range to show more diversity of colors in the lower values. These are now on the demo portal.

image image

Polygon color diversity

Speaking about the visual colors of the polygons, we see that changing the range in the workflow config to the 99.99th percentile for the max value did succeed in showing more diversity for permanent water. In my opinion, the pink & purple palette for seasonal water does not show enough diversity in the smaller lakes, and I should re-process the web tiles with a different percentile for the max value for that attribute. Maybe @initze has feedback on this. We should also keep in mind that the more tailoring we have to do in order to find the best range for the config values to show the most color diversity, the further this workflow gets from being automated. Ideally, we would mathematically determine the best range of values for the config to optimize the color diversity in the polygons without having to guess and check each time. I would appreciate input from anyone who has a good way to achieve this.

Legends

The legends for these 2 layers are accurate, as the max value shown for each layer is indeed the max value for that attribute, not the 99.99th percentile. Additionally, both layers have a min value of 0, and this was the min value in the legend. However, the value range for permanent water is so large that we encounter the same issue we ran into with the Arctic Communities layer: when you hover over the legend, it shows scientific notation, which is accurate but not ideal. This was acceptable for the communities layer, so I assume this is acceptable for this layer, too, for now.

Processing other lake data years

I have moved forward to process 2018-2021, and have already completed 2018 (update: 2018 geotiff data was corrupted during a directory transfer cancelled midway when VS code lost connection. geotiffs need to be re-created). However, there's no point in processing the web tiles for these years until we determine the best way to set the range for each attribute, based on our guess & checks for 2017. I will continue to process staged tiles and geotiffs for all years, but it would be very time consuming to guess and check for the optimal range of values to best represent the polygons.

@julietcohen
Copy link

julietcohen commented Dec 20, 2023

2017 Lake Data Update

We agreed that the 99.99th percentile for the blue palette for permanent water looks good. Increasing the percentile used for the max value of the range for the config reduced the amount of darkest blue color, and increased the amount of lighter shades in the polygons.

For the pink & purple palette for the seasonal water, we had the opposite problem: too much of the lighter shades, and not enough darker shades, so the approach is to reduce the percentile. I tried 92, but that was too much! This resulted in too many polygons with the darkest purple color. See the same region as in the last comment for comparison:

image

Then created web tiles with the 95th percentile for the highest value in the palette range. There is still too much purple:

image

But importantly, if we zoom in more to that same area pictured above, or in a different area such as the north slope of AK, we can see the 95th percentile works pretty well:

image

Comparing percentiles with the seasonal water data distribution

We can visualize 5 different percentiles with the data distribution here:

image

@julietcohen
Copy link

I have simplified and added documentation to the script used to clean, merge, and parse Ingmar's 2 input files into the 5 most recent years (merge_explode_annual.py). I ran it to re-produce the annual input files for the visualization workflow. This script has been uploaded to the new ADC package I created for this data. Now that the input files for the viz workflow have been re-produced, I will start again with 2017 at staging through web tiling in one sweep, with the max value of the range for the permanent water being the 99.99th percentile, and the max value for seasonal water being the 95th percentile. I will do the same for the following 4 years. Keeping these percentiles the same for all years would be ideal.

@julietcohen
Copy link

All years 2017-2021 have been processed into staged, geotiffs, and web tiles for permanent water and seasonal water. All processing was done on Delta with the ray workflow. The 2017-2018 files have already completed the transfer to Datateam with the pre-issued DOI A28G8FK10, and they are visualized on the demo portal. Years 2019-2021 have not yet fully transferred, Globus has been experiencing some NCSA endpoint authentication issues for the past week, making large transfers take >3 days and sometimes the transfers fail all together. So they will be up on the demo portal when those make it through!

@mbjones mbjones moved this to In Progress in Data Layers Jan 11, 2024
@elongano elongano added Water data layer category: water pdg Permafrost Discovery Gateway labels Jan 30, 2024
@julietcohen julietcohen changed the title lake data visualization and aggregation Add aggregation datasets for lake data: Permafrost Database for Northern Alaska and Circumpolar Thermokarst Landscapes Mar 19, 2024
@julietcohen julietcohen changed the title Add aggregation datasets for lake data: Permafrost Database for Northern Alaska and Circumpolar Thermokarst Landscapes Add aggregation datasets for lake time series data: Permafrost Database for Northern Alaska and Circumpolar Thermokarst Landscapes Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data available The complete dataset is on the datateam server and ready to process layer Displaying a specific data product in the PDG portal pdg Permafrost Discovery Gateway Water data layer category: water
Projects
Status: In Progress
Development

No branches or pull requests

4 participants