diff --git a/PM_05_Accelerating_Lagrangian_analyses_of_oceanic_data_benchmarking_typical_workflows.ipynb b/PM_05_Accelerating_Lagrangian_analyses_of_oceanic_data_benchmarking_typical_workflows.ipynb index a1e5f9c..b05a0c8 100644 --- a/PM_05_Accelerating_Lagrangian_analyses_of_oceanic_data_benchmarking_typical_workflows.ipynb +++ b/PM_05_Accelerating_Lagrangian_analyses_of_oceanic_data_benchmarking_typical_workflows.ipynb @@ -47,7 +47,7 @@ " \n", @@ -189,6 +189,7 @@ " \n", " \n", " \n", + " \n", "
  • \n", " \n", " 6. Pandas\n", @@ -277,24 +278,24 @@ "\n", "For data, *Lagrangian* refers to oceanic and atmosphere information acquired by observing platforms drifting with the flow they are embedded within, but also refers more broadly to the data originating from uncrewed platforms, vehicles, and animals that gather data along their unrestricted and often complex paths. Because such paths traverse both spatial and temporal dimensions, Lagrangian data often convolve spatial and temporal information that cannot always readily be organized in common data structures and stored in standard file formats with the help of common libraries and standards. As such, for both originators and users, Lagrangian data present challenges that the [EarthCube CloudDrift](https://github.com/Cloud-Drift) project aims to overcome.\n", "\n", - "This notebook consists of systematic evaluations and comparisons of workflows for Lagrangian data, using as a basis the velocity and sea surface temperature datasets emanating from the drifting buoys of the [Global Drifter Program](https://www.aoml.noaa.gov/phod/gdp/) (GDP). Specifically, we consider the interplay between diverse storage file formats ([NetCDF](https://www.unidata.ucar.edu/software/netcdf/), [Parquet](https://github.com/apache/parquet-format)) and the data structure associated with common existing libraries in Python ([xarray](https://docs.xarray.dev/en/stable/), [pandas](https://pandas.pydata.org), and [Awkward Array](https://awkward-array.org/quickstart.html)) in order to test their adequacies for performing three common Lagrangian tasks:\n", + "This notebook consists of systematic evaluations and comparisons of workflows for Lagrangian data, using as a basis the velocity and sea surface temperature datasets emanating from the drifting buoys of the [Global Drifter Program](https://www.aoml.noaa.gov/phod/gdp/) (GDP). Specifically, we consider the interplay between diverse storage file formats ([NetCDF](https://www.unidata.ucar.edu/software/netcdf/), [Parquet](https://github.com/apache/parquet-format)) and the data structure associated with common existing libraries in *Python* ([xarray](https://docs.xarray.dev/en/stable/), [pandas](https://pandas.pydata.org), and [Awkward Array](https://awkward-array.org/quickstart.html)) in order to test their adequacies for performing three common Lagrangian tasks:\n", "\n", - "1. binning of a variable on an Eulerian grid (e.g. mean temperature map),\n", + "1. binning of a variable on an spatially-fixed grid (e.g. mean temperature map),\n", "2. extracting data within given geographical and/or temporal windows (e.g. Gulf of Mexico),\n", "3. analyses per trajectory (e.g. single statistics, spectral estimation by Fast Fourier Transform).\n", "\n", - "Since the *CloudDrift* project aims at accelerating the use of Lagrangian data for atmospheric, oceanic, and climate sciences, we hope that the users of this notebook will provide us with feedback on its ease of use and the intuitiveness of the proposed methods in order to guide the on-going development of the *clouddrift* python package.\n", + "Since the *CloudDrift* project aims at accelerating the use of Lagrangian data for atmospheric, oceanic, and climate sciences, we hope that the users of this notebook will provide us with feedback on its ease of use and the intuitiveness of the proposed methods in order to guide the on-going development of the *clouddrift* *Python* package.\n", "\n", "## Technical contributions\n", "\n", "- Description of some challenges arising from the analysis of large, heterogeneous Lagrangian datasets.\n", - "- Description of some data formats for Lagrangian analysis with python.\n", - "- Comparison of performances of established and developing python packages and libraries.\n", + "- Description of some data formats for Lagrangian analysis with *Python*.\n", + "- Comparison of performances of established and developing *Python* packages and libraries.\n", "\n", "## Methodology\n", "\n", "The notebook proceeds in three steps:\n", - "1. First, we download a subset of the hourly dataset of the GDP. Specifically, we access the version 2.00 (beta) of the dataset that consists of a collection of 17,324 NetCDF files available from a FTP server of the GDP. Alternative methods to download these data are described on the website of the [GDP DAC at NOAA AOML](https://www.aoml.noaa.gov/phod/gdp/hourly_data.php) and includes a newly-formed collection from the NOAA National Centers for Environmental Information with [doi:10.25921/x46c-3620](https://doi.org/10.25921/x46c-3620). We download a subset (which size can be scaled up or down) then proceed to aggregate the data from the individual files in one single file using a suggested format (the contiguous ragged array).\n", + "1. First, we download a subset of the hourly dataset of the GDP. Specifically, we access version 2.00 (beta) of the dataset that consists of a collection of 17,324 NetCDF files, one for each drifting buoy, available from a HHTPS (or FTP) [server](https://www.aoml.noaa.gov/ftp/pub/phod/lumpkin/hourly/v2.00/netcdf/) of the GDP. Alternative methods to download these data are described on the website of the [GDP DAC at NOAA AOML](https://www.aoml.noaa.gov/phod/gdp/hourly_data.php) and includes a newly-formed collection from the NOAA National Centers for Environmental Information with [doi:10.25921/x46c-3620](https://doi.org/10.25921/x46c-3620). We download a subset (which size can be scaled up or down) then proceed to aggregate the data from the individual files in one single file using a suggested format (the contiguous ragged array).\n", "\n", "2. Second, we benchmark three libraries—*xarray*, *Pandas*, and *Awkward Array*—with typical Lagrangian workflow tasks such as the geographical binning of a variable, the extraction of the data for a given region, and operations performed per drifter trajectory.\n", "\n", @@ -304,7 +305,7 @@ "\n", "In terms of data file format, we tested both NetCDF and Parquet file formats but did not find significant performance gain from using one or the other. Because NetCDF is a well-known and established file format in Earth sciences, we save the contiguous ragged array as a single NetCDF archive. \n", "\n", - "In terms of python packages, we find that *Pandas* is intuitive with a simple syntax but does not perform efficiently with large dataset. The complete GDP hourly dataset is currently *only* ~15 GB, but as part of *CloudDrift* we also want to support larger Lagrangian datasets (>100 GB). On the other hand, *xarray* can interface with *Dask* to efficiently *lazy-load* large dataset but it requires custom adaptation to operate on a ragged array. In contrast, *Awkward Array* provides a novel approach by storing alongside the data an offset index in a manner that is transparent to the user, simplifying the analysis of non-uniform Lagrangian datasets. We find that it is also *fast* and can easily interface with *Numba* to further improve performances.\n", + "In terms of python packages, we find that *Pandas* is intuitive with a simple syntax but does not perform efficiently with a large dataset. The complete GDP hourly dataset is currently *only* ~15 GB, but as part of *CloudDrift* we also want to support larger Lagrangian datasets (>100 GB). On the other hand, *xarray* can interface with *Dask* to efficiently *lazy-load* large dataset but it requires custom adaptation to operate on a ragged array. In contrast, *Awkward Array* provides a novel approach by storing alongside the data an offset index in a manner that is transparent to the user, simplifying the analysis of non-uniform Lagrangian datasets. We find that it is also *fast* and can easily interface with *Numba* to further improve performances.\n", "\n", "In terms of benchmark speed, each package show similar results for the geographical binning (test 1) and the operation per trajectory (test 3) benchmarks. For the extraction of a given region (test 2), *xarray* was found to be slower than both *Pandas* and *Awkward Array*. We note that speed performance may not the deciding factor for all users and we believe that ease of use and simple intuitive syntax are also important." ] @@ -443,7 +444,7 @@ "\n", "In the first step of this notebook, we present the current format of the [Global Drifter Program (GDP)](https://www.aoml.noaa.gov/phod/gdp/) dataset, and show how to transform it into a single archival file in which each variable is stored in an ragged fashion.\n", "\n", - "The GDP produces two interpolated datasets of drifter position, velocity and sea surface temperature from more than 20,000 drifters that have been released since 1979. One dataset is at 6-hour resolution ([Hansen and Poulain 1996](http://dx.doi.org/10.1175/1520-0426(1996)013<0900:QCAIOW>2.0.CO;2)) and the other one is at hourly resolution ([Elipot et al. 2016](http://dx.doi.org/10.1002/2016JC011716)). The files, one per drifter identified by its unique identification number (ID), are updated on a quarterly basis and are available via the FTP server of the GDP Data Assembly Center (DAC).\n", + "The GDP produces two interpolated datasets of drifter position, velocity and sea surface temperature from more than 20,000 drifters that have been released since 1979. One dataset is at 6-hour resolution ([Hansen and Poulain 1996](http://dx.doi.org/10.1175/1520-0426(1996)013<0900:QCAIOW>2.0.CO;2)) and the other one is at hourly resolution ([Elipot et al. 2016](http://dx.doi.org/10.1002/2016JC011716)). The files, one per drifter identified by its unique identification number (ID), are updated on a quarterly basis and are available via the [HTTPS server](https://www.aoml.noaa.gov/ftp/pub/phod/lumpkin/hourly/v2.00/netcdf/) of the GDP Data Assembly Center (DAC).\n", "\n", "Here we use a subset of the hourly drifter dataset of the GDP by setting the variable `subset_nb_drifters = 500`. The suggested number is large enough to create an interesting dataset, yet without making the downloading cumbersome and the data processing too expensive. Feel free to scale down or up this value (from 1 to 17324), but beware that if you are running this notebook in a binder there is some memory limitation (500 should work). " ] @@ -469,8 +470,8 @@ "output_type": "stream", "text": [ "Fetching the 500 requested netCDF files (as a reference ~2min for 500 files).\n", - "CPU times: user 88.3 ms, sys: 24.4 ms, total: 113 ms\n", - "Wall time: 3.03 s\n" + "CPU times: user 53 ms, sys: 15.4 ms, total: 68.4 ms\n", + "Wall time: 2.39 s\n" ] } ], @@ -929,16 +930,16 @@ " acknowledgement: Elipot et al. (2016), Elipot et al. (2021) to...\n", " history: Version 2.00. Metadata from dirall.dat and d...\n", " interpolation_method: \n", - " imei: " + " imei: " ], "text/plain": [ "\n", @@ -992,7 +993,7 @@ "source": [ "## Contiguous Ragged Array\n", "\n", - "In the GDP dataset, the number of observations varies from `len(['obs'])=13` to `len(['obs'])=66417`. As such, it seems inefficient to create bidimensional datastructure `['traj', 'obs']`, commonly used by Lagrangian numerical simulation tools such as [Ocean Parcels](https://oceanenDrift](https://opendrift.github.io/) and [OpenDrift](https://opendrift.github.io/) that tend to generate trajectories of equal or similar lengths.\n", + "In the GDP dataset, the number of observations varies from `len(['obs'])=13` to `len(['obs'])=66417`. As such, it seems inefficient to create bidimensional datastructure `['traj', 'obs']`, commonly used by Lagrangian numerical simulation tools such as [Ocean Parcels](https://oceanparcels.org/) and [OpenDrift](https://opendrift.github.io/) that tend to generate trajectories of equal or similar lengths.\n", "\n", "Here, we propose to combine the data from the individual netCDFs files into a [*contiguous ragged array*](https://cfconventions.org/cf-conventions/cf-conventions.html#_contiguous_ragged_array_representation) eventually written in a single NetCDF file in order to simplify data distribution, decrease metadata redundancies, and efficiently store a Lagrangian data collection of uneven lengths. The aggregation process (conducted with the `create_ragged_array` function found in the module `preprocess.py`) also converts to variables some of the metadata originally stored as attributes in the individual NetCDFs. The final structure contains 21 variables with dimension `['obs']` and 38 variables with dimension `['traj']`." ] @@ -1387,7 +1388,7 @@ " location_type (traj) bool False False False ... True True True\n", " WMO (traj) int32 4400509 1600536 ... 4601712 4601740\n", " expno (traj) int32 9046 9435 7325 ... 21312 21312 21312\n", - " deploy_date (traj) datetime64[ns] 2001-05-01 2001-01-11 ... NaT\n", + " deploy_date (traj) datetime64[ns] 2001-05-01 ... 1970-01-01\n", " deploy_lon (traj) float32 -52.17 71.24 -97.16 ... -151.0 -143.4\n", " ... ...\n", " err_sst (obs) float32 ...\n", @@ -1400,7 +1401,7 @@ " title: Global Drifter Program hourly drifting buoy collection\n", " history: Version 2.00. Metadata from dirall.dat and deplog.dat\n", " Conventions: CF-1.6\n", - " date_created: 2022-04-14T23:14:58.694974\n", + " date_created: 2022-04-15T15:08:31.898904\n", " publisher_name: GDP Drifter DAC\n", " publisher_email: aoml.dftr@noaa.gov\n", " ... ...\n", @@ -1409,31 +1410,31 @@ " contributor_role: Data Acquisition Center\n", " institution: NOAA Atlantic Oceanographic and Meteorological Laboratory\n", " acknowledgement: Elipot et al. (2022) to be submitted. Elipot et al. (2...\n", - " summary: Global Drifter Program hourly data
  • title :
    Global Drifter Program hourly drifting buoy collection
    history :
    Version 2.00. Metadata from dirall.dat and deplog.dat
    Conventions :
    CF-1.6
    date_created :
    2022-04-15T15:08:31.898904
    publisher_name :
    GDP Drifter DAC
    publisher_email :
    aoml.dftr@noaa.gov
    publisher_url :
    https://www.aoml.noaa.gov/phod/gdp
    licence :
    MIT License
    processing_level :
    Level 2 QC by GDP drifter DAC
    metadata_link :
    https://www.aoml.noaa.gov/phod/dac/dirall.html
    contributor_name :
    NOAA Global Drifter Program
    contributor_role :
    Data Acquisition Center
    institution :
    NOAA Atlantic Oceanographic and Meteorological Laboratory
    acknowledgement :
    Elipot et al. (2022) to be submitted. Elipot et al. (2016). Global Drifter Program quality-controlled hourly interpolated data from ocean surface drifting buoys, version 2.00. NOAA National Centers for Environmental Information. https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/2016JC011716TBA. Accessed [date].
    summary :
    Global Drifter Program hourly data
  • " ], "text/plain": [ "\n", @@ -1463,7 +1464,7 @@ " title: Global Drifter Program hourly drifting buoy collection\n", " history: Version 2.00. Metadata from dirall.dat and deplog.dat\n", " Conventions: CF-1.6\n", - " date_created: 2022-04-14T23:14:58.694974\n", + " date_created: 2022-04-15T15:08:31.898904\n", " publisher_name: GDP Drifter DAC\n", " publisher_email: aoml.dftr@noaa.gov\n", " ... ...\n", @@ -1859,7 +1860,7 @@ "Dimensions without coordinates: traj\n", "Attributes:\n", " long_name: Number of observations per trajectory\n", - " units: -" + " units: -" ], "text/plain": [ "\n", @@ -1904,7 +1905,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In the second and next step of the notebook, we benchmark different data science python libraries. For this, the following sections present three typical Lagrangian tasks which are conducted successively using *xarray*, *Pandas*, and finally *Awkward Arrays*." + "In the second and next step of the notebook, we benchmark different data science *Python* libraries. For this, the following sections present three typical Lagrangian tasks which are conducted successively using *xarray*, *Pandas*, and finally *Awkward Arrays*." ] }, { @@ -2319,7 +2320,7 @@ " title: Global Drifter Program hourly drifting buoy collection\n", " history: Version 2.00. Metadata from dirall.dat and deplog.dat\n", " Conventions: CF-1.6\n", - " date_created: 2022-04-14T23:14:58.694974\n", + " date_created: 2022-04-15T15:08:31.898904\n", " publisher_name: GDP Drifter DAC\n", " publisher_email: aoml.dftr@noaa.gov\n", " ... ...\n", @@ -2328,7 +2329,7 @@ " contributor_role: Data Acquisition Center\n", " institution: NOAA Atlantic Oceanographic and Meteorological Laboratory\n", " acknowledgement: Elipot et al. (2022) to be submitted. Elipot et al. (2...\n", - " summary: Global Drifter Program hourly data