Skip to content
This repository has been archived by the owner on Aug 29, 2023. It is now read-only.

Zeleted: Apache OCW overview

Norman Fomferra edited this page Mar 14, 2017 · 1 revision

The Apache Open Climate Workbench toolkit aims to provide a suit of tools to make Climate Scientists lives easier. It does this by providing tools for loading and manipulating datasets, running evaluations, and plotting results. Below follows a quick overview of the OCW API, as well as quick analysis of usability of OCW in the light of ECT.

Common Data Abstraction

The equivalent of CDM in OCW is a dataset.Dataset object that consists of a three dimensional numpy.ndarray that contains the values, together with three numpy.ndarray vectors that define the dimensions - lat, lon and time. There are also some misc attributes, such as variable name, unit, etc. The Dataset object itself has methods that return the spatial and temporal extents of the dataset, as well as spatial and temporal resolutions of the dataset.

Data Sources

There are multiple data loaders that can be used to create a dataset.Dataset object. Sources can be local netCDF files, as well as remote netCDF files accessible through an OpenDAP URL. In general, one netCDF file creates one dataset.Dataset. There is a method for loading multiple netCDF files at once, but this merely returns an array of Datasets. When loading a single netCDF file, it is possible to optionally provide the loader with netCDF variable names for lat, lon and time variables.

There are other loaders for more specific data sources, and it is mentioned that it is relatively easy to create a custom loader.

Dataset manipulations (processors)

dataset_processor module contains processors that manipulate the datasets and create a new dataset as a result of these manipulations. The original datasets stay intact. Included dataset manipulations include unit conversions, spatial and temporal subsetting, as well as spatial regridding and temporal rebinning. There are no Dataset manipulations that would change the projection of the dataset.

Dataset metrics

Processors that don't produce a new dataset are grouped into the metrics module. There are two base classes - UnaryMetric and BinaryMetric that serve as base implementations of unary metrics that work on a single dataset and binary metrics that work on two datasets respectively. The only unary metric is Temporal Standard Deviation that returns an ndarray I assume it returns a 2D array where for each lat,lon point temporal stddev is given.

Binary metrics are more abundant. There are metrics' processors for doing the following calculations between two datasets:

  • Bias - Calculates the difference
  • PatternCorrelation - Calculates the correlation coefficient
  • RMSError - Calculates the RMS error with the mean calculated over time and space
  • SpatialPatternTaylorDiagram - Calculates the standard deviation ratio and the pattern correlation coefficient for use in plotting a Taylor diagram.
  • StdDevRatio - Calculates the standard deviation ratio.
  • TemporalCorrelation - Calculates temporal correlation coefficients and confidence levels using Pearson's correlation. Returns two 2D arrays that contain these values for each lat/lon point.
  • TemporalMeanBias - The mean bias of the two datasets over time.

Evaluations

Evaluations make it easier to run a bunch of metrics on a bunch of datasets.

Plotting

There are some built-in plotting functionalities.

Analysis in the light of ECT

The data object might be too limited for the intended uses of ECT, as it is constrained to a lat,lon,time representation. There is no implementation of a data loader that would be capable of joining data from multiple netCDF files into a single dataset object. the dataset.Dataset implementation relies on pure Numpy, meaning, that it probably can't handle out of memory datasets. There does not seem to be any processors intended for joining multiple datasets together, such as temporal/spatial concatenation, aggregation. It is good that there are processors for temporal rebinning and spatial regridding. There are no capabilities of working with multiple different projections, apart from plotting the lat/lon grid contained in the Dataset in different projections using Matplotlib's Basemap package.

The idea of separating processors that produce new datasets from processors that merely calculate different metrics is a good one. There are quite a few metrics implemented that we would need to have. From this point of view, it makes sense to put some effort in getting the OCW to run, even if in a standalone environment, in order to be able to run it and get inspiration with respect to how different metrics are implemented there.

In general, OCW seems a lot like what we want to accomplish with ECT. If the Dataset implementation could handle out of memory datasets and we were fine with using a rigid 3D lat/lon/time dataset object, it could even make sense to just fork the project and continue from where they are now. Otherwise, this project makes me slightly cautious. It has taken them two years to create something that is simpler than what we have in mind and does less.

EDIT: OK, it's not that scary actually, there's really not much code in the whole OCW. I assume Open Source projects can take time to grow.

Clone this wiki locally