Skip to content
This repository has been archived by the owner on Jul 23, 2020. It is now read-only.

Ability to provide search engine using artifacts from Aggregation of Collection IDs #19

Closed
ShaneMill1 opened this issue Mar 20, 2020 · 6 comments

Comments

@ShaneMill1
Copy link

As a result of Issue #14, the aggregation of collections software provides a json output that contains the metadata for each dataset.

For example, 00z_gfs_100_collections.json (1.00 Degree GFS) contains:

  • each collection ID
  • the unique parameter ID
  • the parameter long name
  • the collection's dimensions and the dimension values.

We have this information for all of the datasets, so we want to create a catalog of this metadata.

We want to start simple, so as a prototype, I want to be able to allow the user to be able to search by the grib long_name (ie. 'Temperature') and have every collection ID and therefore link to the collection returned that contains 'Temperature'.

Then, the collection ID should be informative enough to provide the user some information regarding what type of Temperature is contained in the collection:

So for Temperature, possible returns would be the links for:

  • gfs_100_lat_0_lon_0_forecast_time3_Low_cloud_top_level
  • gfs_100_forecast_time0_lat_0_lon_0_Maximum_wind_level
  • gfs_100_lat_0_lon_0_forecast_time3_Middle_cloud_top_level
  • gfs_100_forecast_time0_lat_0_lon_0_lv_ISBL0_Isobaric_surface_Pa
  • etc. and including nam, metar, other datasets

Therefore, from the collection IDs, a user would see that gfs_100_forecast_time0_lat_0_lon_0_lv_ISBL0_Isobaric_surface_Pa would contain the isobaric temperature, could go to that collection and select 500mb temperature (50000 pa)

@ShaneMill1
Copy link
Author

So far, have created a python module that will create a catalog out of all of the _collection.json files that are created from the aggregation process.

Created a dictionary where each key is the model id and instance (ie: 00z_gfs_100), just to make the dictionary id's unique. The value to that key is another dictionary containing the contents of the associated _collection.json file.
For example, {key: gfs_100_00z, value: contents of 00z_gfs_100_collection.json as a dictionary}

To recap, the contents of 00z_gfs_100_collection.json include the collection IDs, and the associated dimensions, long_name, level_type, etc.

Now, with this catalog in the form of a dictionary, the python module takes an input keyword. The input keyword is a string and the module will search the long_names for the string. The module will create a list of collection IDs that are a match, and the resulting URL list is created.

For example, for a search of "ozone":
[dataapi@data-api search_engine]$ ./search.py ozone is the executed

And the following list is returned:

['https://data-api.mdl.nws.noaa.gov/EDR-API/collections/automated_gfs_025_forecast_time0_lat_0_lon_0_Entire_atmosphere_considered_as_a_single_layer/instance/00z', 'https://data-api.mdl.nws.noaa.gov/EDR-API/collections/automated_gfs_025_forecast_time0_lat_0_lon_0_lv_ISBL12_Isobaric_surface_Pa/instance/00z', 'https://data-api.mdl.nws.noaa.gov/EDR-API/collections/automated_gfs_050_forecast_time0_lat_0_lon_0_Entire_atmosphere_considered_as_a_single_layer/instance/00z', 'https://data-api.mdl.nws.noaa.gov/EDR-API/collections/automated_gfs_050_forecast_time0_lat_0_lon_0_lv_ISBL12_Isobaric_surface_Pa/instance/00z', 'https://data-api.mdl.nws.noaa.gov/EDR-API/collections/automated_gfs_100_forecast_time0_lat_0_lon_0_Entire_atmosphere_considered_as_a_single_layer/instance/00z', 'https://data-api.mdl.nws.noaa.gov/EDR-API/collections/automated_gfs_100_forecast_time0_lat_0_lon_0_lv_ISBL12_Isobaric_surface_Pa/instance/00z', 'https://data-api.mdl.nws.noaa.gov/EDR-API/collections/automated_gfs_100_forecast_time0_lat_0_lon_0_Entire_atmosphere_considered_as_a_single_layer/instance/06z', 'https://data-api.mdl.nws.noaa.gov/EDR-API/collections/automated_gfs_100_forecast_time0_lat_0_lon_0_lv_ISBL12_Isobaric_surface_Pa/instance/06z', 'https://data-api.mdl.nws.noaa.gov/EDR-API/collections/automated_gfs_100_forecast_time0_lat_0_lon_0_Entire_atmosphere_considered_as_a_single_layer/instance/12z', 'https://data-api.mdl.nws.noaa.gov/EDR-API/collections/automated_gfs_100_forecast_time0_lat_0_lon_0_lv_ISBL12_Isobaric_surface_Pa/instance/12z', 'https://data-api.mdl.nws.noaa.gov/EDR-API/collections/automated_gfs_100_forecast_time0_lat_0_lon_0_Entire_atmosphere_considered_as_a_single_layer/instance/18z', 'https://data-api.mdl.nws.noaa.gov/EDR-API/collections/automated_gfs_100_forecast_time0_lat_0_lon_0_lv_ISBL12_Isobaric_surface_Pa/instance/18z']

Although this is run from the command line, the next step is to import this module into the wow_point_server.py and incorporate an interface either through a /search URL (flask endpoint) or modify the root.html template file to incorporate a Search header and an associated form.

@ShaneMill1
Copy link
Author

ShaneMill1 commented Mar 20, 2020

We now have the search engine (by grib long name) implemented on our instance:

https://data-api.mdl.nws.noaa.gov/EDR-API

You can search a keyword, and the collections that have the parameters with the long name containing that keyword will be shown. You can click on one of the links, and be taken to the point where you can make a query.

The links returned are in a dictionary that was not ordered, so my next move would be to order that dictionary.

Additional work would be to match the keywords with other metadata attributes such as the dimension names (ie ISBL for isobaric)

@ShaneMill1
Copy link
Author

Going off of the last comment, added the ability to search by grib parameter long name as well as the dimension long name so that a user can search for isobaric,temperature to further narrow the results.

In a previous meeting, it was discussed that we would like to utilize OpenSearch geo and time extensions. I did some research, and found the pycsw module that offers a python implementation of OGC CSW as well as the OpenSearch geo and time extensions.
http://docs.pycsw.org/en/stable/introduction.html

Looking at the documentation, I was able to incorporate pycsw into our EDR-API implementation following this approach:
https://docs.pycsw.org/en/latest/api.html

Then, following the steps provided at the link below, I was able to create a compliant blank sqlite database to start from:
http://docs.pycsw.org/en/stable/administration.html#metadata-repository-setup

Finally, you can see the beginning of our implementation at the following endpoint:
https://data-api.mdl.nws.noaa.gov/EDR-API/csw?service=CSW&version=2.0.2&request=GetCapabilities

My next steps will be to connect the dots from how I create metadata in the aggregation of collections software and incorporate that metadata into these services.

@ShaneMill1
Copy link
Author

After discussion with Mark, Chris, and Steve, will continue to work with pycsw module. Will work on populating the sqlite database with metadata that comes from aggregation software. OpenSearch geo and time extensions provide simple queries and responses for the search engine, so focus will be on this aspect for now. pycsw module is very extensible, so we can pick what other services we want to expose. For now though, focus will be on opensearch.

@tomkralidis
Copy link
Contributor

@ShaneMill1 let's continue the discussion on what/how the formal metadata would look to serve as part of a CSW instance (Dublin Core, ISO, etc.). This will also give us an opportunity to investigate the OGC API - Records work (disclosure: I'm part of this SWG and working on an implementation).

@ShaneMill1
Copy link
Author

ShaneMill1 commented Apr 3, 2020

@ShaneMill1 let's continue the discussion on what/how the formal metadata would look to serve as part of a CSW instance (Dublin Core, ISO, etc.). This will also give us an opportunity to investigate the OGC API - Records work (disclosure: I'm part of this SWG and working on an implementation).

@tomkralidis Absolutely- I am going to create an issue in the EDR SWG github to continue this discussion. All of that sounds good to me!

located here: opengeospatial/ogcapi-environmental-data-retrieval#40

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants