-
Notifications
You must be signed in to change notification settings - Fork 6
REST Services
The following is a basic overview of the SemTK REST Services. They are all built on Spring Boot.
Core SemTK Services:
- Query Service
- Ingestion Service
- Ontology Info Service
- Nodegroup Service
- Nodegroup Store Service
- Nodegroup Execution Service
- Dispatch Service
- Results Service
- Status Service
- Utility Service
SemTK Services for EDC and FDC:
- EDC Query Generation Service
- FDC Cache Service
- FDC Sample Service
SemTK Services that wrap external data stores:
- Athena Service (AWS Athena)
- ArangoDB Service
- Hive Service (Apache Hive)
- File Staging Service
See the diagrams at the bottom for a view of how these services interact.
The Query Service wraps calls to triple stores (e.g. Virtuoso) and provides functionality to handle parameter naming for inputs, enforcing an output format, and basic error handling. The Query Service allows for predictable interactions with different triple stores, which often have varying behavior, particularly for default return encapsulation and error handling. Routing all triple store queries through a single point also allows us to provide some useful utility methods, allowing users to perform certain actions without writing SPARQL queries.
The Ingestion Service simplifies the insertion of triples into the triple store. The data for these triples can currently come from either an ODBC connection or a CSV file. The Ingestion Service uses a Json template to define the structure of the sub-graph to be populated and the transformations to perform in order to successfully map the input values to the output triples.
This service supports both a direct mode (which inserts each record independently) as well as a pre-check mode (which only inserts any data if all the inputs are checked to be error-free).
More information is here
The Ontology Info Service provides information about the ontology (semantic model). Among other things, it is used to populate the lefthand pane in SparqlGraph, which features a hierarchical view of the ontology.
The Nodegroup Service allows creation and editing of nodegroups. The general flow is for a caller to provide nodegroup JSON and connection information for the ontology, and recieve new nodegroup JSON in return.
Note the following:
-
the jsonRenderedNodeGroup endpoint parameter may be the JSON from a SparqlGraphJson (including the connection information, NodeGroup, and ImportSpec) or simply a NodeGroup. Endpoints that require the ontology via connection JSON will throw an error if the connection is missing from the jsonRenderedNodeGroup, while those that don't need the ontology will process either form of the parameter.
-
endpoints that use a conn or jsonRenderedNodeGroup connection to retrieve an ontology will cache the ontology for future calls to the Nodegroup Service. If the ontology changes between calls, it may take several minutes before the services sees the new version. Calling /clearCachedOntology will clear the cache so that changes to the ontology take effect immediately.
The Nodegroup Store Service allows for storage and retrieval of nodegroups. Nodegroups store information about how to generate semantic queries about a subgraph of interest. Nodegroups may also store information such as a SPARQL connection and/or data loading specifications.
More information is here
The Nodegroup Execution Service executes a nodegroup to retrieve data programmatically, as an alternative to using the SparqlGraph UI.
More information is here
The Dispatch Service fulfills incoming queries by initiating SPARQL queries to retrieve data and sending the results to the Results Service. It updates the job's status and percent completion using the Status Service.
This service by default uses dispatcher class com.ge.research.semtk.sparqlX.asynchronousQuery.AsynchronousNodeGroupDispatcher
(see the .env file), which can fulfill queries containing plain SPARQL by sending to the triple store for execution.
The default dispatcher class may be replaced by a custom dispatcher when needed. For example, to use the EDC feature, the dispatch service must be configured to use com.ge.research.semtk.sparqlX.dispatch.EdcDispatcher
, which accepts and processes a SPARQL query that may include data external to the triple store.
The Results Service enables query results to be written to a cache, and then subsequently retrieved using a job ID.
The Status Service is a mechanism to keep track of in-progress and completed tasks. Using a job id, it can provide the status and percent complete for the task, as well as accept status updates for a given job id. Together with the Results Service, it enables jobs to be completed in an asynchronous manner.
Provides endpoints for managing SemTK. Currently the only utility endpoints available are for configuring EDC (external data connections), as shown below. In the future, other types of utility endpoints may be added.
Generates queries to run against an external data store. Endpoints are described below.
Generates queries to be run against a relational database containing time-coherent time series data. This service assumes that the time series data has a timestamp column of type double, containing a Unix timestamp in seconds (e.g. 1497015968.556 for Fri, 09 Jun 2017 13:46:08). The name of the column is configurable in SemTK. Example query (for Hive):
select cast(timestamp_col AS timestamp) as timestamp, pressure2 AS PRESSURE, diameter2 AS DIAMETER, flow2 AS FLOW from my_database.my_table order by timestamp
select cast(timestamp_col AS timestamp) as timestamp, pressure3 AS PRESSURE, diameter3 AS DIAMETER, flow3 AS FLOW from my_database.my_table where ( (pressure3 > 80) OR (diameter3 > 40000 AND diameter3 < 70000) ) and ( ( unix_timestamp(to_utc_timestamp(timestamp_col,'Ect/GMT+0'), 'yyyy-MM-dd hh:mm:ss') >= unix_timestamp('10/08/2014 10:00:00 AM','MM/dd/yyyy hh:mm:ss a') ) AND ( unix_timestamp(to_utc_timestamp(timestamp_col,'Ect/GMT+0'), 'yyyy-MM-dd hh:mm:ss') <= unix_timestamp('10/08/2014 11:00:00 AM','MM/dd/yyyy hh:mm:ss a') ) ) order by timestamp
Flags may be used to omit column aliases, or to return the timestamps in their stored format (rather than converting to a human-readable timestamp).
Generates queries to be run against KairosDB. Example query:
{"start_relative":{"value":10,"unit":"YEARS"},"cacheTime":0,"metrics":[{"name":"PRESSURE","tags":{},"group_by":[],"aggregators":[]},{"name":"DIAMETER","tags":{},"group_by":[],"aggregators":[]}]}
Generates "queries" with the necessary information to retrieve a file from a file store (e.g. HDFS). Example:
*hdfs://test1-98231834.img###test1.img
(file location, file name)
The following diagrams show the interactions between various SemTK services.