Skip to content

Commit

Permalink
Add documentation (#2229) (#2233)
Browse files Browse the repository at this point in the history
(cherry picked from commit 45da40f)

Signed-off-by: Vamsi Manohar <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
  • Loading branch information
1 parent 00ae8ce commit 5fa8e54
Show file tree
Hide file tree
Showing 2 changed files with 32 additions and 25 deletions.
22 changes: 14 additions & 8 deletions docs/user/interfaces/asyncqueryinterface.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,23 +14,29 @@ Async Query Interface Endpoints
Introduction
============

For supporting `S3Glue <../ppl/admin/connectors/s3glue_connector.rst>`_ and Cloudwatch datasources connectors, we have introduced a new execution engine on top of Spark.
For supporting `S3Glue <../ppl/admin/connectors/s3glue_connector.rst>`_ datasource connector, we have introduced a new execution engine on top of Spark.
All the queries to be executed on spark execution engine can only be submitted via Async Query APIs. Below sections will list all the new APIs introduced.


Configuration required for Async Query APIs
======================================
Currently, we only support AWS emr serverless as SPARK execution engine. The details of execution engine should be configured under
``plugins.query.executionengine.spark.config`` cluster setting. The value should be a stringified json comprising of ``applicationId``, ``executionRoleARN``,``region``.
Required Spark Execution Engine Config for Async Query APIs
===========================================================
Currently, we only support AWS EMRServerless as SPARK execution engine. The details of execution engine should be configured under
``plugins.query.executionengine.spark.config`` in cluster settings. The value should be a stringified json comprising of ``applicationId``, ``executionRoleARN``,``region``, ``sparkSubmitParameter``.
Sample Setting Value ::

plugins.query.executionengine.spark.config: '{"applicationId":"xxxxx", "executionRoleARN":"arn:aws:iam::***********:role/emr-job-execution-role","region":"eu-west-1"}'


plugins.query.executionengine.spark.config:
'{ "applicationId":"xxxxx",
"executionRoleARN":"arn:aws:iam::***********:role/emr-job-execution-role",
"region":"eu-west-1",
"sparkSubmitParameter": "--conf spark.dynamicAllocation.enabled=false"
}'
If this setting is not configured during bootstrap, Async Query APIs will be disabled and it requires a cluster restart to enable them back again.
We make use of default aws credentials chain to make calls to the emr serverless application and also make sure the default credentials
have pass role permissions for emr-job-execution-role mentioned in the engine configuration.

* ``applicationId``, ``executionRoleARN`` and ``region`` are required parameters.
* ``sparkSubmitParameter`` is an optional parameter. It can take the form ``--conf A=1 --conf B=2 ...``.


Async Query Creation API
======================================
Expand Down
35 changes: 18 additions & 17 deletions docs/user/ppl/admin/connectors/s3glue_connector.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,29 +14,22 @@ S3Glue Connector
Introduction
============

Properties in DataSource Configuration

* name: A unique identifier for the data source within a domain.
* connector: Currently supports the following connectors: s3glue, spark, prometheus, and opensearch.
* resultIndex: Stores the results of queries executed on the data source. If unavailable, it defaults to .query_execution_result.

Glue Connector
========================================================

s3Glue connector provides a way to query s3 files using glue as metadata store and spark as execution engine.
This page covers s3Glue datasource configuration and also how to query and s3Glue datasource.

Required resources for s3 Glue Connector
===================================
* S3: This is where the data lies.
* Spark Execution Engine: Query Execution happens on spark.
* Glue Metadata store: Glue takes care of table metadata.
* Opensearch: Index for s3 data lies in opensearch and also acts as temporary buffer for query results.
* ``EMRServerless Spark Execution Engine Config Setting``: Since we execute s3Glue queries on top of spark execution engine, we require this configuration.
More details: `ExecutionEngine Config <../../../interfaces/asyncqueryinterface.rst#id2>`_
* ``S3``: This is where the data lies.
* ``Glue`` Metadata store: Glue takes care of table metadata.
* ``Opensearch IndexStore``: Index for s3 data lies in opensearch and also acts as temporary buffer for query results.

We currently only support emr-serverless as spark execution engine and Glue as metadata store. we will add more support in future.

Glue Connector Properties.

* ``resultIndex`` is a new parameter specific to glue connector. Stores the results of queries executed on the data source. If unavailable, it defaults to .query_execution_result.
* ``glue.auth.type`` [Required]
* This parameters provides the authentication type information required for execution engine to connect to glue.
* S3 Glue connector currently only supports ``iam_role`` authentication and the below parameters is required.
Expand Down Expand Up @@ -78,11 +71,19 @@ Glue datasource configuration::
"glue.indexstore.opensearch.uri": "http://adsasdf.amazonopensearch.com:9200",
"glue.indexstore.opensearch.auth" :"awssigv4",
"glue.indexstore.opensearch.auth.region" :"awssigv4",
}
},
"resultIndex": "query_execution_result"
}]

Sample s3Glue datasource queries
================================
<To Be Added>
Sample s3Glue datasource queries APIS
=====================================

Sample Queries

* Select Query : ``select * from mys3.default.http_logs limit 1"``
* Create Covering Index Query: ``create index clientip_year on my_glue.default.http_logs (clientip, year) WITH (auto_refresh=true)``
* Create Skipping Index: ``create skipping index on mys3.default.http_logs (status VALUE_SET)``

These queries would work only top of async queries. Documentation: `Async Query APIs <../../../interfaces/asyncqueryinterface.rst>`_

Documentation for Index Queries: https://github.com/opensearch-project/opensearch-spark/blob/main/docs/index.md

0 comments on commit 5fa8e54

Please sign in to comment.