-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Display the infrastructure layer #27
Comments
I did a test run of this layer. Here are some notes on processing it:
I used the following config options, but we will probably want to create tiles higher resolution than z 10 (perhaps 13): {
"z_range": [0, 10],
"statistics": [
{
"name": "coverage",
"weight_by": "area",
"property": "area_per_pixel_area",
"aggregation_method": "sum",
"resampling_method": "average",
"val_range": [
0,
1
],
"nodata_val": 0,
"palette": "oryel",
"nodata_color": "#ffffff00"
}
]
} |
Based on feedback from Annett (below), we should do the following with the next run:
We should also remember to re-create the layer when updated data is available next year. Details:
|
Annett has produced a new version of this dataset, now located on Datateam at: Old version of the data is now in Initial notes:
|
I was able to process a first draft of outputs with the visualization workflow, but importantly in order to produce the web tiles I needed to use the same modification to |
clean_infrastructure.py# Arctic infrastructure data from Annett B.
# clean by removing rows with NA geometries
# and split data into smaller files for processing
# in parallel
import geopandas as gpd
import numpy as np
data_path = "/home/jcohen/infrastructure/data/SACHI_v2.shp"
data = gpd.read_file(data_path)
# remove NA values from the geometry column
data_clean = data[data['geometry'].notna()]
# split the cleaned data into 6 subfiles
# each file will have equal number of rows
split_gdfs = np.array_split(data_clean, 6)
for i, split_gdf in enumerate(split_gdfs):
split_gdf.reset_index(inplace = True)
# Create the filename with a different number ranging from 1 to 5
filename = f'/home/jcohen/infrastructure/data_cleaned_split/SACHI_v2_clean_{i + 1}.gpkg'
split_gdf.to_file(filename, driver = "GPKG", index = True)
print(f'Saved {filename}')
print("Script complete.") parsl script# Processing the infrastructure data from Annett
# see issue: https://github.com/PermafrostDiscoveryGateway/pdg-portal/issues/27
# staging through web tiling with parsl
# conda environment: viz-local, with local installs
# for viz-staging and viz-raster, but with modified viz-raster
# because need to_image() fix (see PR),
# and pip install for parsl==2023.11.27
import pdgstaging
import pdgraster
from datetime import datetime
import json
import logging
import logging.handlers
from pdgstaging import logging_config
import os
import parsl
from parsl import python_app
from parsl.config import Config
from parsl.channels import LocalChannel
from parsl.executors import HighThroughputExecutor
from parsl.providers import LocalProvider
import shutil
import subprocess
from subprocess import Popen
user = subprocess.check_output("whoami").strip().decode("ascii")
# start with a fresh directory!
print("Removing old directories and files...")
base_dir = "/home/jcohen/infrastructure/parsl_workflow/"
old_filepaths = [f"{base_dir}staging_summary.csv",
f"{base_dir}raster_summary.csv",
f"{base_dir}raster_events.csv",
f"{base_dir}config__updated.json",
f"{base_dir}log.log"]
for old_file in old_filepaths:
if os.path.exists(old_file):
os.remove(old_file)
# remove dirs from past run
old_dirs = [f"{base_dir}staged",
f"{base_dir}geotiff",
f"{base_dir}web_tiles",
f"{base_dir}runinfo"]
for old_dir in old_dirs:
if os.path.exists(old_dir) and os.path.isdir(old_dir):
shutil.rmtree(old_dir)
activate_conda = 'source /home/jcohen/.bashrc; conda activate viz-local'
htex_local = Config(
executors=[
HighThroughputExecutor(
label="htex_local",
worker_debug=False,
#cores_per_worker=1,
max_workers=6,
provider=LocalProvider(
#channel=LocalChannel(),
init_blocks=1,
max_blocks=1,
worker_init=activate_conda
),
)
],
)
parsl.clear()
parsl.load(htex_local)
def run_pdg_workflow(
workflow_config,
batch_size = 300
):
"""
Run the main PDG workflow for the following steps:
1. staging
2. raster highest
3. raster lower
4. web tiling
Parameters
----------
workflow_config : dict
Configuration for the PDG staging workflow, tailored to rasterization and
web tiling steps only.
batch_size: int
How many staged files, geotiffs, or web tiles should be included in a single creation
task? (each task is run in parallel) Default: 300
"""
start_time = datetime.now()
logging.info("Staging initiated.")
stager = pdgstaging.TileStager(workflow_config)
tile_manager = stager.tiles
config_manager = stager.config
input_paths = stager.tiles.get_filenames_from_dir('input')
input_batches = make_batch(input_paths, batch_size)
# Stage all the input files (each batch in parallel)
app_futures = []
for i, batch in enumerate(input_batches):
app_future = stage(batch, workflow_config)
app_futures.append(app_future)
logging.info(f'Started job for batch {i} of {len(input_batches)}')
# Don't continue to next step until all files have been staged
[a.result() for a in app_futures]
logging.info("Staging complete.")
# ----------------------------------------------------------------
# Create highest geotiffs
rasterizer = pdgraster.RasterTiler(workflow_config)
# Process staged files in batches
logging.info(f'Collecting staged file paths to process...')
staged_paths = tile_manager.get_filenames_from_dir('staged')
logging.info(f'Found {len(staged_paths)} staged files to process.')
staged_batches = make_batch(staged_paths, batch_size)
logging.info(f'Processing staged files in {len(staged_batches)} batches.')
app_futures = []
for i, batch in enumerate(staged_batches):
app_future = create_highest_geotiffs(batch, workflow_config)
app_futures.append(app_future)
logging.info(f'Started job for batch {i} of {len(staged_batches)}')
# Don't move on to next step until all geotiffs have been created
[a.result() for a in app_futures]
logging.info("Rasterization highest complete. Rasterizing lower z-levels.")
# ----------------------------------------------------------------
# Rasterize composite geotiffs
min_z = config_manager.get_min_z()
max_z = config_manager.get_max_z()
parent_zs = range(max_z - 1, min_z - 1, -1)
# Can't start lower z-level until higher z-level is complete.
for z in parent_zs:
# Determine which tiles we need to make for the next z-level based on the
# path names of the geotiffs just created
logging.info(f'Collecting highest geotiff paths to process...')
child_paths = tile_manager.get_filenames_from_dir('geotiff', z = z + 1)
logging.info(f'Found {len(child_paths)} highest geotiffs to process.')
# create empty set for the following loop
parent_tiles = set()
for child_path in child_paths:
parent_tile = tile_manager.get_parent_tile(child_path)
parent_tiles.add(parent_tile)
# convert the set into a list
parent_tiles = list(parent_tiles)
# Break all parent tiles at level z into batches
parent_tile_batches = make_batch(parent_tiles, batch_size)
logging.info(f'Processing highest geotiffs in {len(parent_tile_batches)} batches.')
# Make the next level of parent tiles
app_futures = []
for parent_tile_batch in parent_tile_batches:
app_future = create_composite_geotiffs(
parent_tile_batch, workflow_config)
app_futures.append(app_future)
# Don't start the next z-level, and don't move to web tiling, until the
# current z-level is complete
[a.result() for a in app_futures]
logging.info("Composite rasterization complete. Creating web tiles.")
# ----------------------------------------------------------------
# Process web tiles in batches
logging.info(f'Collecting file paths of geotiffs to process...')
geotiff_paths = tile_manager.get_filenames_from_dir('geotiff')
logging.info(f'Found {len(geotiff_paths)} geotiffs to process.')
geotiff_batches = make_batch(geotiff_paths, batch_size)
logging.info(f'Processing geotiffs in {len(geotiff_batches)} batches.')
app_futures = []
for i, batch in enumerate(geotiff_batches):
app_future = create_web_tiles(batch, workflow_config)
app_futures.append(app_future)
logging.info(f'Started job for batch {i} of {len(geotiff_batches)}')
# Don't record end time until all web tiles have been created
[a.result() for a in app_futures]
end_time = datetime.now()
logging.info(f'⏰ Total time to create all z-level geotiffs and web tiles: '
f'{end_time - start_time}')
# ----------------------------------------------------------------
# Define the parsl functions used in the workflow:
@python_app
def stage(paths, config):
"""
Stage a file
"""
from datetime import datetime
import json
import logging
import logging.handlers
import os
import pdgstaging
from pdgstaging import logging_config
stager = pdgstaging.TileStager(config = config, check_footprints = False)
for path in paths:
stager.stage(path)
return True
# Create highest z-level geotiffs from staged files
@python_app
def create_highest_geotiffs(staged_paths, config):
"""
Create a batch of geotiffs from staged files
"""
from datetime import datetime
import json
import logging
import logging.handlers
import os
import pdgraster
from pdgraster import logging_config
# rasterize the vectors, highest z-level only
rasterizer = pdgraster.RasterTiler(config)
return rasterizer.rasterize_vectors(
staged_paths, make_parents = False)
# no need to update ranges because manually set val_range in config
# ----------------------------------------------------------------
# Create composite geotiffs from highest z-level geotiffs
@python_app
def create_composite_geotiffs(tiles, config):
"""
Create a batch of composite geotiffs from highest geotiffs
"""
from datetime import datetime
import json
import logging
import logging.handlers
import os
import pdgraster
from pdgraster import logging_config
rasterizer = pdgraster.RasterTiler(config)
return rasterizer.parent_geotiffs_from_children(
tiles, recursive = False)
# ----------------------------------------------------------------
# Create a batch of webtiles from geotiffs
@python_app
def create_web_tiles(geotiff_paths, config):
"""
Create a batch of webtiles from geotiffs
"""
from datetime import datetime
import json
import logging
import logging.handlers
import os
import pdgraster
from pdgraster import logging_config
rasterizer = pdgraster.RasterTiler(config)
return rasterizer.webtiles_from_geotiffs(
geotiff_paths, update_ranges = False)
# no need to update ranges because val_range is
# defined in the config
def make_batch(items, batch_size):
"""
Create batches of a given size from a list of items.
"""
return [items[i:i + batch_size] for i in range(0, len(items), batch_size)]
# ----------------------------------------------------------------
# run the workflow
workflow_config = '/home/jcohen/infrastructure/parsl_workflow/config.json'
print("Loaded config. Running workflow.")
run_pdg_workflow(workflow_config)
# Shutdown and clear the parsl executor
htex_local.executors[0].shutdown()
parsl.clear()
# transfer log from /tmp to user dir
cmd = ['mv', '/tmp/log.log', f'/home/{user}/infrastructure/parsl_workflow/']
# initiate the process to run that command
process = Popen(cmd)
print("Script complete.") config
|
Category: Infrastructure |
Annett has provided the README for this dataset. It provides a code for each infrastructure type and is located in |
The dataset package on the ADC: https://arcticdata.io/catalog/view/urn%3Auuid%3A4e1ea0af-6f7c-4a7a-b69f-4e818e113c43
The final dataset has been processed and is archived on datateam:
|
|
Justin from the ADC has helpfully taken over the rest of the metadata documentation for this dataset, starting this week. |
Annett has requested that I change the palette of this dataset so that buildings are in red and water is in blue. I also noticed that there were MultiPolygons (only 2% of the geometries) in the input data from Annett. I re-cleaned this input data (still removed NA geoms like last time, with the additional cleaning step of exploding those MultiPolygon geoms) and will re-process the data with the viz workflow the requested palette change as well. This is not a time consuming operation on Delta. This new cleaning script will replace the old one that is uploaded to the data package and the staged and geotiff tilesets will replace the ones currently in the pre-assigned DOI dir. I already discussed this with Justin. For convenience, I am pasting the values for infrastructure code here (they are from the readme):
|
Annett also mentioned that she is going to update the dataset on Zenodo with the new version this week, and it already has a DOI there. Justin let her know we will still use our pre-assigned DOI since it is a new dataset, and I went into detail about our derived products. We will mention the Zenodo DOI in the dataset documentation on the ADC. |
Delta decided that python doesn't exist anymore in any of my environments so I have been troubleshooting that for the past hour. Next step would be to uninstall VScode and reinstall, and remove the known hosts for Delta. I already tried uninstalling all python extensions and reinstalling. |
Annett has updated the dataset based on my feedback regarding the geometries. She uploaded the new version of SACHI_v2 to Zenodo. She included the following info:
Justin and I made it clear that we would be giving the dataset a new DOI (the one I pre-created for this dataset) because our ADC version of the data package differs from the one she has on Zenodo, considering all our derived products. Since she made changes to SACHI_v2, I will upload the new version to Datateam and reprocess the dataset with the viz workflow. |
I reprocessed the infrastructure dataset with Annett's new version of SACHI_v2 data as input to the viz workflow. Since she fixed the geometries, I only had to split the multipolygons before inputting the data into the viz workflow. I also updated the palette to represent buildings in red and water in blue as Annett requested and removed yellow from the palette since we visualize the ice-wedge polygon layer in yellow, and we anticipate that users will explore these layers together. All input and output and the new pre-viz claning script have been uploaded to |
I was able to refine the assignment of the categorical palette so there is just 1 color attributed to 1 infrastructure type. I did this in a similar way that I did for the permafrost and ground ice dataset. Instead of using the attribute # add attribute that codes the categorical DN attribute
# into evenly spaced numbers in order to assign
# palette correctly to the categories in the web tiles:
conditions = [
(data['DN'] == 11),
(data['DN'] == 12),
(data['DN'] == 13),
(data['DN'] == 20),
(data['DN'] == 30),
(data['DN'] == 40),
(data['DN'] == 50)
]
choices = [1, 2, 3, 4, 5, 6, 7 ]
data['palette_code'] = np.select(conditions, choices) As a result, I see more diversity in the colors of the geometries, as Annett suggested they should look, and I integrated grey for the 30 value of Taking this approach, I realize that in order for Annett's metadata for the One note is that I have only tried this dataset with a custom palette with each of the 7 hex codes specifically assigned rather than a pre-made palette with a name, because we wanted to include certiain colors in a specific order, and exclude certain colors that would too closely match other datasets on the PDG. |
One more consideration for this dataset: the resolution of this data is not clear on the zenodo page (this new zenodo link is important because it points to version 2 of the dataset that Annett updated, not version 1 which was originally linked above when this ticket was first created). Sentinel data resolution can vary depending on the bands used. One way to find this may be to dive deeper into the methods, like a paper associated with this dataset, or ask Annett. I have been using z-level 12 as the max, cause that is ~32m resolution at latitude +/- 31 |
Annett has approved the layer to be moved to production. The remaining to-do items, in order:
|
@julietcohen In reading through your correspondence on this ticket, I saw that it originated as a vector dataset. The most accurate way for us to present it would be as a vector layer, rather than raster. Let's please discuss this as I suspect it would provide a much better result. We have a number of vector layers in the portals now, mainly via a geoJSON conversion. I'm not sure if the dataset size would be prohibitive there, but let's discuss please. |
datateam.nceas.ucsb.edu:/home/pdg/data/bartsch_infrastructure
The data is associated with the following two papers:
The data are archived with Restricted Access on Zenodo
The text was updated successfully, but these errors were encountered: