Skip to content

data api client design

Emanuel Schmid edited this page Mar 16, 2022 · 20 revisions

Discussion of the data-api client requirements and implementation

Motivation

The data-api client, climada.util.api_client.Client is meant for

  • providing a generic python interface to the public CLIMADA data api
  • creating climada Python objects, such as Exposures, Hazard or ImpactFunc, from dataset files of the CLIMADA data api in a comfortable, easy to use way.

The implemented methods are supposed to be as natural as possible and hiding away boilerplate code that downloads files, reads and converts content into CLIMADA objects. Additionally they should take care of caching files on the local filesystem in order to save resources of the api server.

This discussion is based on the most up-to-date implementation of the API client which can currently be found on branch https://github.com/CLIMADA-project/climada_python/tree/feature/data-api-1.0.

Classes

DataTypeShortInfo

only datatype and group

DataTypeInfo

... plus status, description and properties

DatasetInfo

  • datatype (DataTypeShortInfo), name and version (unique in the db)
  • status, activation date and expiration date
  • uuid
  • description, doi and license
  • files (FileInfo)
  • properties ("name", "value" pairs)

FileInfo

  • dataset uuid and file name (unique in the db)
  • format, size and checksum
  • url

Methods

list_data_type_infos

returns: a list of DataTypeInfo objects

arguments:

  • data type group (exposures, hazard, impact_func)

purpose: show what kind of datasets are available from climada.ethz.ch

comments:

  • used to be get_datatypes

suggestions:

get_data_type_info

returns: a DataTypeInfo object

arguments:

  • data type name

purpose: give additional information about a data type: mandatory and optional properties of datasets from this type.

comments:

  • used to be get_data_type
  • used to be more useful when collecting data type properties was expensive, and list_data_type_infos skipped it.
    with CLIMADA Data API 1.0 this is not the case anymore.

suggestions:

  • remove the method and add an optional parameter to list_data_type_infos.

get_properties_datatype

returns: a dataframe of property values

arguments:

  • data type
  • fixed properties
  • limit of row numbers

purpose: show properties of a given datatype and a selection of possible values, in order to make meaningful queries

coments:

  • returning a data frame is somewhat misleading as the data is really a dictionary of independent ans differently sized arrays.

suggestions:

  • remove the row number limitation argument as it is just repainting .head
  • return a data frame with three columns : property name, property value, number of occurrences
  • make it fast (involves server side extension)

list_dataset_infos

returns: a list of DatasetInfo objects

arguments:

  • data type
  • a dictionary of properties
  • dataset status

purpose: query climada.ethz.ch for datasets matching given arguments

comments:

  • used to be get_datasets

suggestions:

get_dataset_info

returns: a single DatasetInfo object

arguments: same as list_dataset_infos

purpose: same as list_dataset_info - but raise a descriptive exception if the result of the query doesn't yield exactly 1 dataset.

comments:

  • used to be get_dataset
  • is somewhat superfluous. However the method may be used within get_hazard and get_exposures and there it may provide easy to understand feedback if a query is ambiguous or contradictory.
  • Naming is very confusing.

suggestions:

  • Make a method that returns the relevant parameters for a user: the properties and higher-level values needed for the get_exposures, hazard, litpop etc. Basically, more or less wrappers of something like
list_data_type_infos_exposures = client.list_data_type_infos(data_type_group='exposures')
set([data_type_info.data_type for data_type_info in list_data_type_infos_exposures])
  • The method for the user should not return values such as the uuid or creation_date by default. There should be an extra method to get these 'metadata', or maybe the methods such as get_exposures or get_dataset should also return a summary of the metadata (such as climada version, creation date, etc...) together with the data.

download_dataset

returns: the path of the target directory and the downloaded files

arguments:

  • dataset
  • target directory, default: SYSTEM_DIR from config
  • organize path hirarchically? 'data group type'/'data type'/'dataset name'/'version', default: yes
  • consistency check method, default: compare sizes

purpose: download the whole data set (all files) into the given target directory - or just point to the files if they have been downloaded before

comments:

  • the default check is perhaps too optimistic. In case of a mischievous replacing of files on the server a more thorough check (md5) could increase security.
  • dataset is rather confusing when there is a single data file.

suggestions:

  • remove optional arguments?
  • return files only without target directory

download_file

returns:

  • path

arguments:

  • FileInfo
  • target directory
  • consistency check
  • number of retries

purpose: downloads a single file from a dataset to the given target destination, checking success and retrying in case of failure

comments: used in download_dataset, to_hazard and to_exposures. In the latter two mainly to allow target destination and skip downloads of non-hdf5 files which is a hypothetical use case.

suggestions:

  • turn it into a private method

purge_cache

arguments:

  • target destination

purpose: remove inconsistent download pathes from the cache db

suggestions:

get_hazard

returns: a Hazard object

arguments:

  • hazard type (i.e., any data type from the 'hazard' data type group)
  • (target directory for file download)
  • arguments from get_dataset_info (without data type or data type group)

purpose: search climada.ethz.ch for a matching hazard, download the file and read it into a Hazard object.

comments:

  • used to include a concatenation of Hazard objects in case more than one dataset matches the requirements and the concatenation is somehow supported (depending on the properties of the datasets). But in the current version concatenation must now be done outside of the Client.

suggestions:

  • remove the target directory argument and use download_dataset instead of download_file

get_exposures

returns: an Exposures object

arguments:

  • exposures type (i.e., any data type from the 'exposures' data type group)
  • (target directory for file download)
  • arguments from get_dataset_info (without data type or data type group)

purpose: search climada.ethz.ch for a matching hazard, download the file and read it into a Hazard object.

comments:

  • used to include a concatenation of Exposures objects in case more than one dataset matches the requirements. But in the current version concatenation must now be done outside of the Client.

suggestions:

  • remove the target directory argument and use download_dataset instead of download_file

get_litpop_default

returns: a Litpop object

arguments:

  • country
  • (target directory for file download)

purpose: get the global or country litpop exposures object with fin_mode 'pc' and exponents '(1,1)'

comments:

  • the method itself is carrier of default values for litpop exposures. That is a bit odd.

suggestions:

  • move the default values to the config file
  • introduce default values for all data types, not just litpop? In this case augment get_exposures and get_hazard with default handling. Which would be more comfortable for the user than getting error messages about ambiguous queries too.

to_hazard

returns: a Hazard object

arguments:

  • data set
  • (target dir)

purpose: used in get_hazard

comments:

suggestions:

  • make it private

to_exposures

returns: an Exposures object

arguments:

  • data set
  • (target dir)

purpose: used in get_exposures

comments:

suggestions:

  • make it private