-
Notifications
You must be signed in to change notification settings - Fork 20
ESGF_Architecture
The ESGF architecture is that of a global system of distributed Nodes, which interoperate which other according to a peer-to-peer paradigm. This means that there is not a rigid distinction of roles between different Nodes, rather each node can expose different services according to how it is configured, and can act as the provider or the consumer of services depending on the situation. In a peer-to-peer system, Nodes can join or leave the federation dynamically, without affecting the operations of the other Nodes. This is in stark contrast to a traditional client-server architecture, where the server represents a single point of failure for the operations of multiple clients. There are two main characteristics that make ESGF a peer-to-peer system:
-
The modularity and configurability of the ESGF software stack, which allows each Node to expose a graduated set of services depending on the specific site requirements.
-
The establishment of federation protocols that allow the exchange of information from Node to Node on an equalitarian basis, without the existence of special central locations where the information is aggregated.
These two characteristics are described in more detail below.
Figure 1: Schematic representation of the ESGF architecture as a system of Nodes that interact with each other through a peer-to-peer paradigm.
A common ESGF Software Stack is deployed at each Node in the federation to provide services for data, metadata and user management. The installation can be configured to install all or part of the available services, depending on the site needs, and possibly to replicate some of the services across multiple servers at the same site. Specifically, the following flavors of ESGF Node can be installed:
- Data Node : includes services for publishing and serving data, namely:
* The __ESGF Node Manager__ . The ESGF Node Manager is a web application that mediates the peer-to-peer interaction among all the Nodes in the federation. Its main purpose is create and expose the ESGF Registry, a document that contains critical inter-operability information such as the name and type of each Node, its available services and URL endpoints, its CA certificate, etc.
* The __ESGF Publisher__ , and associated Postgres relational database. The ESGF Publisher is a desktop application that allows to publish data into a Node. The publishing workflow starts with extracting metadata from files on disk, storing it on the Node database, creating THREDDS XML catalogs and finally publishing the catalogs to the Node publishing service. Postgres is a popular freely available relational database that is used in ESGF to store all metadata harvested from the ESGF publisher, as well as user account information.
* The __Thredds Data Server__ , configured with the ESGF security filters. The Thredds Data Server (TDS), developed by Unidata, represents the standard mechanism through which an ESGF Node delivers its data to the clients. The TDS includes functionality for serving data in a variety of forms and protocols: full files HTTP download, OpenDAP sub-setting, GIS products via WMS and WCS, etc. The ESGF installation procedure configures the TDS with a set of special ESGF filters that intercept any data request, and enforce the access control policies established for that dataset by interacting with the appropriate ESGF Security Services deployed throughout the federation.
* The __ESGF Security Services__ . The ESGF Security framework includes functionality for distributed access control throughout the federation. It is composed of _client-side_ components (the access filters and OpenID Relying Party) that protect access to the data, and _server-side_ components (the Attribute and Authorization services) that can be queried to gather all necessary information to make an access control decision. The framework supports access both by browsers (via OpenID authentication), and desktop clients and libraries (via X509 certificates).
* The __GridFTP__ server. The GridFTP server, developed by the Globus alliance, is a high performance protocol for reliable data transfer. It includes a server, deployed on an ESGF Node, and a client-side library that the user must deploy on their desktop.
* The __ESGF Dashboard__ . The ESGF Dashboard is a web application intended for system administrators to monitor the status of all services deployed at each Node.
- IdP Node : includes services for authenticating users:
* The __OpenID Identity Provider__ web application. The OpenID Identity Provider (IdP) allows users to register and authenticate with the system, including Single-Sign-On functionality for browser-based access throughout the federation.
* The __Globus SimpleCA__ and __MyProxy__ server. The MyProxy server, developed by NCSA, is used to issue short term certificates that can be used by client libraries and toolkits to authenticate the user during a data product request. The certificates are signed by the locally installed Globus Simple Certificate Authority (CA).
- Index Node : includes the applications necessary to index and search metadata:
* The __Apache Solr__ engine. Apache Solr is a high performance, scalable web application for storing and searching metadata.
* The __ESGF Search__ back-end services. The ESGF Search module includes services for _pulling_ metadata from external repositories (such as the THREDDS XML catalogs produced by the ESGF Publisher) into the local Solr index, for _pushing_ complete metadata records to the local Solr index, and for searching the distributed Solr indexes deployed within the federation.
* The __ESGF Web Portal__ application. The ESGF Web Portal is a web application that contains the user interface to many of the other ESGF modules. It exposes web pages for registering users, searching for data, downloading data etc.
- Compute Node : includes services for data analysis and visualization, namely:
* The __Live Access Server__ . The Live Access Server (LAS), developed by NOAA/PMEL, is an analysis and visualization engine that allows users to request advanced data and imaging products from multiple ESGF Nodes at once. Internally, it relies on the TDS catalogs and OpenDAP services for configuration and remote data access. It can be configured with a pluggable visualization engine such as Ferret (the default), NCL or CDAT.
Figure 2. Detailed representation of the software components that comprise the ESGF software stack.
Interoperability among all Nodes in the ESGF federation is based on a peer-to- peer paradigm for exchanging information about services, trusts, and metadata holdings. Specifically, the following protocols and mechanism make all the Nodes in the federation work together as a whole:
-
The ESGF Registry . The ESGF Registry contains all relevant information about each Node in the federation: its type, the URL endpoints of the services it exposes, its public certificates, and so on. This information is not kept in a central location, rather it is continually exchanged among all Nodes so that each Node always has a local up-to-date copy of the state of the whole federation.
-
Single-Sign-On . Because all ESGF Nodes trust each other's certificate authority, a user can register and authenticate at any of the Nodes, and be granted credentials that are honored throughout the federation. The type of credentials granted depends on how the user is accessing the system:
* if using a web browser, the _OpenID_ protocol is used to exchange authentication information between the site where the user authenticates, and any other site
* if using a desktop client, an _X509_ short term certificate is transmitted by the client to any server that requests the user to authenticate
-
Distributed Access Control . The data served from each Node may need to be protected by policies that are administered at another Node. For example. CMIP5 data hosted throughout the federation is protected by the _CMIP5 Research- and CMIP5 Commercial attributed that are administered by PCMDI. The ESGF security infrastructure supports this model by establishing mutual trust among all the constituents Nodes, and by transmitting security information (Attribute and Authorization statements) as signed documents encoded as SAML (the Security Assertion Markup Language).
-
Metadata Exchange . All Nodes in the federation are configured to query (or fully replicate) each other's metadata holdings. As a consequence, when users initiate a search at any one site, they are able to discover resources of interest throughout the whole federation.
Traditionally, the data and metadata services deployed throughout the ESGF system have been made available to users through a standard web browser client. Increasingly though the ESGF collaboration is working towards enabling direct access to these services via rich desktop clients and toolkits, which allow scripted and more powerful access. Specifically, the following clients are being developed.
-
UV-CDAT . UV-CDAT is a high-performace visualization client that allows the user to query the ESGF data catalogs via any metadata category, and either download the selected files, or create visualization plots.
-
esgf-pyclient . Python library for searching and downlloading ESGF holdings programmatically.
-
esgfpy-publish . Python library for generating and publishing metadata records to the ESGF system.