-
Notifications
You must be signed in to change notification settings - Fork 124
data api client design
Discussion of the data-api client requirements and implementation
The data-api client, climada.util.api_client.Client
is meant for
- providing a generic python interface to the public CLIMADA data api
- creating
climada
Python objects, such asExposures
,Hazard
orImpactFunc
, from dataset files of the CLIMADA data api in a comfortable, easy to use way.
The implemented methods are supposed to be as natural as possible and hiding away boilerplate code that downloads files, reads and converts content into CLIMADA objects. Additionally they should take care of caching files on the local filesystem in order to save resources of the api server.
This discussion is based on the most up-to-date implementation of the API client which can currently be found on branch https://github.com/CLIMADA-project/climada_python/tree/feature/data-api-1.0.
only datatype and group
... plus status, description and properties
- datatype (
DataTypeShortInfo
), name and version (unique in the db) - status, activation date and expiration date
- uuid
- description, doi and license
- files (FileInfo)
- properties ("name", "value" pairs)
- dataset uuid and file name (unique in the db)
- format, size and checksum
- url
returns: a list of DataTypeInfo
objects
arguments:
- data type group (
exposures
,hazard
,impact_func
)
purpose: show what kind of datasets are available from climada.ethz.ch
comments:
- used to be
get_datatypes
suggestions:
returns: a DataTypeInfo
object
arguments:
- data type name
purpose: give additional information about a data type: mandatory and optional properties of datasets from this type.
comments:
- used to be
get_data_type
- used to be more useful when collecting data type properties was expensive, and
list_data_type_infos
skipped it.
with CLIMADA Data API 1.0 this is not the case anymore.
suggestions:
- remove the method and add an optional parameter to list_data_type_infos.
returns: a dataframe of property values
arguments:
- data type
- fixed properties
- limit of row numbers
purpose: show properties of a given datatype and a selection of possible values, in order to make meaningful queries
coments:
- returning a data frame is somewhat misleading as the data is really a dictionary of independent ans differently sized arrays.
suggestions:
- remove the row number limitation argument as it is just repainting
.head
- return a data frame with three columns : property name, property value, number of occurrences
- make it fast (involves server side extension)
returns: a list of DatasetInfo
objects
arguments:
- data type
- a dictionary of properties
- dataset status
purpose: query climada.ethz.ch for datasets matching given arguments
comments:
- used to be
get_datasets
suggestions:
returns: a single DatasetInfo
object
arguments: same as list_dataset_infos
purpose: same as list_dataset_info
- but raise a descriptive exception if the result of the query doesn't yield exactly 1 dataset.
comments:
- used to be
get_dataset
- is somewhat superfluous. However the method may be used within
get_hazard
andget_exposures
and there it may provide easy to understand feedback if a query is ambiguous or contradictory. - Naming is very confusing.
suggestions:
- Make a method that returns the relevant parameters for a user: the properties and higher-level values needed for the get_exposures, hazard, litpop etc. Basically, more or less wrappers of something like
list_data_type_infos_exposures = client.list_data_type_infos(data_type_group='exposures')
set([data_type_info.data_type for data_type_info in list_data_type_infos_exposures])
- The method for the user should not return values such as the uuid or creation_date by default. There should be an extra method to get these 'metadata', or maybe the methods such as get_exposures or get_dataset should also return a summary of the metadata (such as climada version, creation date, etc...) together with the data.
returns: the path of the target directory and the downloaded files
arguments:
- dataset
- target directory, default: SYSTEM_DIR from config
- organize path hirarchically? 'data group type'/'data type'/'dataset name'/'version', default: yes
- consistency check method, default: compare sizes
purpose: download the whole data set (all files) into the given target directory - or just point to the files if they have been downloaded before
comments:
- the default check is perhaps too optimistic. In case of a mischievous replacing of files on the server a more thorough check (md5) could increase security.
- dataset is rather confusing when there is a single data file.
suggestions:
- remove optional arguments?
- return files only without target directory
returns:
- path
arguments:
- FileInfo
- target directory
- consistency check
- number of retries
purpose: downloads a single file from a dataset to the given target destination, checking success and retrying in case of failure
comments: used in download_dataset
, to_hazard
and to_exposures
. In the latter two mainly to allow target destination and skip downloads of non-hdf5 files which is a hypothetical use case.
suggestions:
- turn it into a private method
arguments:
- target destination
purpose: remove inconsistent download pathes from the cache db
suggestions:
returns: a Hazard
object
arguments:
- hazard type (i.e., any data type from the 'hazard' data type group)
- (target directory for file download)
- arguments from
get_dataset_info
(without data type or data type group)
purpose: search climada.ethz.ch for a matching hazard, download the file and read it into a Hazard
object.
comments:
- used to include a concatenation of
Hazard
objects in case more than one dataset matches the requirements and the concatenation is somehow supported (depending on the properties of the datasets). But in the current version concatenation must now be done outside of theClient
.
suggestions:
- remove the target directory argument and use
download_dataset
instead ofdownload_file
returns: an Exposures
object
arguments:
- exposures type (i.e., any data type from the 'exposures' data type group)
- (target directory for file download)
- arguments from
get_dataset_info
(without data type or data type group)
purpose: search climada.ethz.ch for a matching hazard, download the file and read it into a Hazard
object.
comments:
- used to include a concatenation of
Exposures
objects in case more than one dataset matches the requirements. But in the current version concatenation must now be done outside of theClient
.
suggestions:
- remove the target directory argument and use
download_dataset
instead ofdownload_file
returns: a Litpop
object
arguments:
- country
- (target directory for file download)
purpose: get the global or country litpop exposures object with fin_mode 'pc' and exponents '(1,1)'
comments:
- the method itself is carrier of default values for litpop exposures. That is a bit odd.
suggestions:
- move the default values to the config file
- introduce default values for all data types, not just litpop? In this case augment
get_exposures
andget_hazard
with default handling. Which would be more comfortable for the user than getting error messages about ambiguous queries too.
returns: a Hazard
object
arguments:
- data set
- (target dir)
purpose: used in get_hazard
comments:
suggestions:
- make it private
returns: an Exposures
object
arguments:
- data set
- (target dir)
purpose: used in get_exposures
comments:
suggestions:
- make it private