[FEATURE] Modify integration tests to run with OpenSearch vendors (multi-vendor IT abilities ) #647

ykmr1224 · 2024-09-11T21:09:31Z

Is your feature request related to a problem?
Currently, integration tests are implemented to run with OpenSearch server running in docker.
As we want to make Flint compatible with OpenSearch Serverless, we should make integration tests be able to run against Serverless so that we can verify its functionality easily.

What solution would you like?
We need following changes

Mock metadata storage since we don't store them to serverless (otherwise, we need to come up a way to avoid issues caused due to WAIT_FOR refresh policy not supported)
Modify FlintSparkSuite and OpenSearchSuite to switch the OpenSearch to serverless collectio (Since we cannot run serverless within docker, we might not be able to run tests in Github workflow)
Fix verification method to not depend on queryIndex
- OpenSearch Serverless does not provide stats API, and usage of DataFrame causes issue due to it. (Another approach is eliminating the usage of stats API)

What alternatives have you considered?
n/a

Do you have any additional context?
n/a

The text was updated successfully, but these errors were encountered:

dblock · 2024-09-30T16:25:54Z

Looks like you want to run open source tests against a vendor-specific distribution.

We definitely want to continue running tests here against dockerized open-source OpenSearch.

For vendor-specific features we do have some precedent. There's an issue for running tests against vendor-specific distributions or instances in opensearch-project/opensearch-clients#47 for SigV4.

One thing to be mindful of is that blocking PRs because of a vendor-specific solution being broken doesn't seem right.

[Catch All Triage - 1, 2, 3, 4]

AmiStrn · 2024-09-30T16:56:59Z

@ykmr1224 thanks for the Feature request!

I am alarmed at the possibility that the vendor-neutral project will be tied down in any way by a vendor-specific one.
The following scenario is what i am afraid of -

Contributor wishes to make changes to opensearch-spark
Works on code and submits a PR
PR fails tests. Changes are required in OpenSearch Core (open source) and OpenSearch Serverless (closed source)
Changes are made to Core (now the feature is compatible with the open-source project)
PR still fails tests because the change breaks when running on OpenSearch Serverless
Contributor must ask folks maintaining OpenSearch Serverless to make changes in their code
Feature is stuck until vendor makes changes (IF they agree to the changes)

This enables the vendor to take control of large portions of open-source code de facto. @ykmr1224 @dblock I am curious if there are positive precedents for this type of dependency out there.

This may not be 100% true for this feature, but could be the case for other vendor-owned integrations, or might start to happen once we open the possibility and push a gray line further and further.

reta · 2024-10-01T13:45:53Z

Opened up opensearch-project/.github#229 to discuss the issue at large

normanj-bitquill · 2024-11-29T22:41:56Z

As mentioned above, this issue opensearch-project/.github#229 discusses a better structure. Until that is achieved, there may still be desire to run tests against Spark EMR. This should be a stop gap measure until the OpenSearch Serverless release is better aligned with the OSS release.

With some initial testing, I was able to run SQL queries against the Spark EMR docker image.
https://gallery.ecr.aws/emr-serverless/spark/emr-7.2.0

Here is what I needed to do:

Create a directory to hold logging configuration. This really just needs two specific shell scripts. In my testing, these were empty shell scripts.
1. run-adot-collector.sh
2. run-fluentd-spark.sh
Create a directory that holds the Spark app
Get a local copy of the /etc/spark/conf directory from the Spark EMR Docker image.
Add this line to the end of the file spark-defaults.conf
```
spark.sql.legacy.createHiveTableByDefault false
```

Finally run the docker image

docker run \
    --name emr \
    -v ./spark-logging:/var/loggingConfiguration/spark \
    -v ./app:/app \
    -v ./conf:/etc/spark/conf \
    public.ecr.aws/emr-serverless/spark/emr-7.2.0:20241022 \
    driver \
    --class MyApp \
    /app/myapp_2.12-1.0.jar

This is a serverless image, so it will quickly run and then exit. I tested with a Scala app for Spark.

There is still a step missing for including the OpenSearch PPL extension.

normanj-bitquill · 2024-11-29T23:05:11Z

Another consideration is how to actually run tests on the docker image. I have a proposal:

Structure the tests with several files:

Data files
Setup queries
PPL query
Expected results

Create a Spark app. It can do some initial setup, such as creating tables. For each test:

Run the setup queries for the test
Run the PPL query
Write the results to a JSON file that is on a mounted volume of the docker image

After the Spark app finishes, go over each test and verify that the expected results match the actual results.

This means that there will be two components for running the tests:

A test running that is running on the host machine and can be started from the SBT build
A Spark app that runs the queries and captures the results for each of the tests

YANG-DB · 2024-12-03T00:58:36Z

Create a Spark app. It can do some initial setup, such as creating tables. For each test:

@normanj-bitquill thanks for the suggestions
I agree we need a dedicated spark job to run the test and create a report regarding the success and performance of the tests being executed.

In addition I was hoping for using OpenSearch's Dashboard workbench PPL via a Spark data-source to access a local docker spark cluster which is running the Flint job - can you experiment if this is achievable ? and what are the prerequisites for this ?

YANG-DB · 2024-12-03T00:59:40Z

As mentioned above, this issue opensearch-project/.github#229 discusses a better structure. Until that is achieved, there may still be desire to run tests against Spark EMR. This should be a stop gap measure until the OpenSearch Serverless release is better aligned with the OSS release.

With some initial testing, I was able to run SQL queries against the Spark EMR docker image. https://gallery.ecr.aws/emr-serverless/spark/emr-7.2.0

Here is what I needed to do:
Create a directory to hold logging configuration. This really just needs two specific shell scripts. In my testing, these were empty shell scripts.

run-adot-collector.sh

run-fluentd-spark.sh

Create a directory that holds the Spark app

Get a local copy of the /etc/spark/conf directory from the Spark EMR Docker image.
Add this line to the end of the file spark-defaults.conf
spark.sql.legacy.createHiveTableByDefault false
Finally run the docker image
docker run \
    --name emr \
    -v ./spark-logging:/var/loggingConfiguration/spark \
    -v ./app:/app \
    -v ./conf:/etc/spark/conf \
    public.ecr.aws/emr-serverless/spark/emr-7.2.0:20241022 \
    driver \
    --class MyApp \
    /app/myapp_2.12-1.0.jar
This is a serverless image, so it will quickly run and then exit. I tested with a Scala app for Spark.

There is still a step missing for including the OpenSearch PPL extension.

@normanj-bitquill this look good - lets create a tutorial page with all the needed scripts for anyone to experiment with

YANG-DB · 2024-12-03T01:14:06Z

@ykmr1224 thanks for the Feature request!

I am alarmed at the possibility that the vendor-neutral project will be tied down in any way by a vendor-specific one. The following scenario is what i am afraid of -

Contributor wishes to make changes to opensearch-spark

Works on code and submits a PR

PR fails tests. Changes are required in OpenSearch Core (open source) and OpenSearch Serverless (closed source)

Changes are made to Core (now the feature is compatible with the open-source project)

PR still fails tests because the change breaks when running on OpenSearch Serverless

Contributor must ask folks maintaining OpenSearch Serverless to make changes in their code

Feature is stuck until vendor makes changes (IF they agree to the changes)

This enables the vendor to take control of large portions of open-source code de facto. @ykmr1224 @dblock I am curious if there are positive precedents for this type of dependency out there.

This may not be 100% true for this feature, but could be the case for other vendor-owned integrations, or might start to happen once we open the possibility and push a gray line further and further.

@AmiStrn thanks for the feedback
I do believe that opening our code for multi-vendors contribution in a well managed way would actually benefit the Community - there are some open source projects that utilize this approach allowing vendor specific code contributions in their own rep (similar to OpenTelemetry - contrib repository)

We do need to avoid any direct build or runtime dependencies in these contrib components
We need an additional abstract layer to allow the specific vendor implementation without explicitly depending on it (the next example is NOT a good separations of concerns IMO )

ykmr1224 added enhancement New feature or request untriaged labels Sep 11, 2024

dblock removed the untriaged label Sep 30, 2024

reta mentioned this issue Oct 1, 2024

[PROPOSAL] Separate OSS functionality from vendor specific one opensearch-project/.github#229

Open

normanj-bitquill mentioned this issue Nov 29, 2024

Add Docker-compose setting for IT testing #606

Closed

normanj-bitquill mentioned this issue Dec 2, 2024

[FEATURE]Add Docker-compose setting for IT testing #601

Closed

YANG-DB changed the title ~~[FEATURE] Modify integration tests to run with OpenSearch Serverless~~ [FEATURE] Modify integration tests to run with OpenSearch vendors (multi-vendor IT abilities ) Dec 3, 2024

normanj-bitquill mentioned this issue Dec 3, 2024

Add some instructions to use Spark EMR docker image #965

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Modify integration tests to run with OpenSearch vendors (multi-vendor IT abilities ) #647

[FEATURE] Modify integration tests to run with OpenSearch vendors (multi-vendor IT abilities ) #647

ykmr1224 commented Sep 11, 2024

dblock commented Sep 30, 2024

AmiStrn commented Sep 30, 2024

reta commented Oct 1, 2024

normanj-bitquill commented Nov 29, 2024

normanj-bitquill commented Nov 29, 2024

YANG-DB commented Dec 3, 2024

YANG-DB commented Dec 3, 2024

YANG-DB commented Dec 3, 2024

[FEATURE] Modify integration tests to run with OpenSearch vendors (multi-vendor IT abilities ) #647

[FEATURE] Modify integration tests to run with OpenSearch vendors (multi-vendor IT abilities ) #647

Comments

ykmr1224 commented Sep 11, 2024

dblock commented Sep 30, 2024

AmiStrn commented Sep 30, 2024

reta commented Oct 1, 2024

normanj-bitquill commented Nov 29, 2024

normanj-bitquill commented Nov 29, 2024

YANG-DB commented Dec 3, 2024

YANG-DB commented Dec 3, 2024

YANG-DB commented Dec 3, 2024