Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Modify integration tests to run with OpenSearch vendors (multi-vendor IT abilities ) #647

Open
ykmr1224 opened this issue Sep 11, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@ykmr1224
Copy link
Collaborator

Is your feature request related to a problem?
Currently, integration tests are implemented to run with OpenSearch server running in docker.
As we want to make Flint compatible with OpenSearch Serverless, we should make integration tests be able to run against Serverless so that we can verify its functionality easily.

What solution would you like?
We need following changes

  • Mock metadata storage since we don't store them to serverless (otherwise, we need to come up a way to avoid issues caused due to WAIT_FOR refresh policy not supported)
  • Modify FlintSparkSuite and OpenSearchSuite to switch the OpenSearch to serverless collectio (Since we cannot run serverless within docker, we might not be able to run tests in Github workflow)
  • Fix verification method to not depend on queryIndex
    • OpenSearch Serverless does not provide stats API, and usage of DataFrame causes issue due to it. (Another approach is eliminating the usage of stats API)

What alternatives have you considered?
n/a

Do you have any additional context?
n/a

@ykmr1224 ykmr1224 added enhancement New feature or request untriaged labels Sep 11, 2024
@dblock
Copy link
Member

dblock commented Sep 30, 2024

Looks like you want to run open source tests against a vendor-specific distribution.

We definitely want to continue running tests here against dockerized open-source OpenSearch.

For vendor-specific features we do have some precedent. There's an issue for running tests against vendor-specific distributions or instances in opensearch-project/opensearch-clients#47 for SigV4.

One thing to be mindful of is that blocking PRs because of a vendor-specific solution being broken doesn't seem right.

[Catch All Triage - 1, 2, 3, 4]

@dblock dblock removed the untriaged label Sep 30, 2024
@AmiStrn
Copy link

AmiStrn commented Sep 30, 2024

@ykmr1224 thanks for the Feature request!

I am alarmed at the possibility that the vendor-neutral project will be tied down in any way by a vendor-specific one.
The following scenario is what i am afraid of -

  1. Contributor wishes to make changes to opensearch-spark
  2. Works on code and submits a PR
  3. PR fails tests. Changes are required in OpenSearch Core (open source) and OpenSearch Serverless (closed source)
  4. Changes are made to Core (now the feature is compatible with the open-source project)
  5. PR still fails tests because the change breaks when running on OpenSearch Serverless
  6. Contributor must ask folks maintaining OpenSearch Serverless to make changes in their code
  7. Feature is stuck until vendor makes changes (IF they agree to the changes)

This enables the vendor to take control of large portions of open-source code de facto. @ykmr1224 @dblock I am curious if there are positive precedents for this type of dependency out there.

This may not be 100% true for this feature, but could be the case for other vendor-owned integrations, or might start to happen once we open the possibility and push a gray line further and further.

@reta
Copy link

reta commented Oct 1, 2024

Opened up opensearch-project/.github#229 to discuss the issue at large

@normanj-bitquill
Copy link
Contributor

As mentioned above, this issue opensearch-project/.github#229 discusses a better structure. Until that is achieved, there may still be desire to run tests against Spark EMR. This should be a stop gap measure until the OpenSearch Serverless release is better aligned with the OSS release.

With some initial testing, I was able to run SQL queries against the Spark EMR docker image.
https://gallery.ecr.aws/emr-serverless/spark/emr-7.2.0

Here is what I needed to do:

  1. Create a directory to hold logging configuration. This really just needs two specific shell scripts. In my testing, these were empty shell scripts.
    1. run-adot-collector.sh
    2. run-fluentd-spark.sh
  2. Create a directory that holds the Spark app
  3. Get a local copy of the /etc/spark/conf directory from the Spark EMR Docker image.
  4. Add this line to the end of the file spark-defaults.conf
    spark.sql.legacy.createHiveTableByDefault false
    
  5. Finally run the docker image
    docker run \
        --name emr \
        -v ./spark-logging:/var/loggingConfiguration/spark \
        -v ./app:/app \
        -v ./conf:/etc/spark/conf \
        public.ecr.aws/emr-serverless/spark/emr-7.2.0:20241022 \
        driver \
        --class MyApp \
        /app/myapp_2.12-1.0.jar
    

This is a serverless image, so it will quickly run and then exit. I tested with a Scala app for Spark.

There is still a step missing for including the OpenSearch PPL extension.

@normanj-bitquill
Copy link
Contributor

Another consideration is how to actually run tests on the docker image. I have a proposal:

Structure the tests with several files:

  • Data files
  • Setup queries
  • PPL query
  • Expected results

Create a Spark app. It can do some initial setup, such as creating tables. For each test:

  1. Run the setup queries for the test
  2. Run the PPL query
  3. Write the results to a JSON file that is on a mounted volume of the docker image

After the Spark app finishes, go over each test and verify that the expected results match the actual results.

This means that there will be two components for running the tests:

  • A test running that is running on the host machine and can be started from the SBT build
  • A Spark app that runs the queries and captures the results for each of the tests

@YANG-DB
Copy link
Member

YANG-DB commented Dec 3, 2024

Create a Spark app. It can do some initial setup, such as creating tables. For each test:

@normanj-bitquill thanks for the suggestions
I agree we need a dedicated spark job to run the test and create a report regarding the success and performance of the tests being executed.

In addition I was hoping for using OpenSearch's Dashboard workbench PPL via a Spark data-source to access a local docker spark cluster which is running the Flint job - can you experiment if this is achievable ? and what are the prerequisites for this ?

@YANG-DB
Copy link
Member

YANG-DB commented Dec 3, 2024

As mentioned above, this issue opensearch-project/.github#229 discusses a better structure. Until that is achieved, there may still be desire to run tests against Spark EMR. This should be a stop gap measure until the OpenSearch Serverless release is better aligned with the OSS release.

With some initial testing, I was able to run SQL queries against the Spark EMR docker image. https://gallery.ecr.aws/emr-serverless/spark/emr-7.2.0

Here is what I needed to do:

  1. Create a directory to hold logging configuration. This really just needs two specific shell scripts. In my testing, these were empty shell scripts.

    1. run-adot-collector.sh
    2. run-fluentd-spark.sh
  2. Create a directory that holds the Spark app

  3. Get a local copy of the /etc/spark/conf directory from the Spark EMR Docker image.

  4. Add this line to the end of the file spark-defaults.conf

    spark.sql.legacy.createHiveTableByDefault false
    
  5. Finally run the docker image

    docker run \
        --name emr \
        -v ./spark-logging:/var/loggingConfiguration/spark \
        -v ./app:/app \
        -v ./conf:/etc/spark/conf \
        public.ecr.aws/emr-serverless/spark/emr-7.2.0:20241022 \
        driver \
        --class MyApp \
        /app/myapp_2.12-1.0.jar
    

This is a serverless image, so it will quickly run and then exit. I tested with a Scala app for Spark.

There is still a step missing for including the OpenSearch PPL extension.

@normanj-bitquill this look good - lets create a tutorial page with all the needed scripts for anyone to experiment with

@YANG-DB
Copy link
Member

YANG-DB commented Dec 3, 2024

@ykmr1224 thanks for the Feature request!

I am alarmed at the possibility that the vendor-neutral project will be tied down in any way by a vendor-specific one. The following scenario is what i am afraid of -

  1. Contributor wishes to make changes to opensearch-spark
  2. Works on code and submits a PR
  3. PR fails tests. Changes are required in OpenSearch Core (open source) and OpenSearch Serverless (closed source)
  4. Changes are made to Core (now the feature is compatible with the open-source project)
  5. PR still fails tests because the change breaks when running on OpenSearch Serverless
  6. Contributor must ask folks maintaining OpenSearch Serverless to make changes in their code
  7. Feature is stuck until vendor makes changes (IF they agree to the changes)

This enables the vendor to take control of large portions of open-source code de facto. @ykmr1224 @dblock I am curious if there are positive precedents for this type of dependency out there.

This may not be 100% true for this feature, but could be the case for other vendor-owned integrations, or might start to happen once we open the possibility and push a gray line further and further.

@AmiStrn thanks for the feedback
I do believe that opening our code for multi-vendors contribution in a well managed way would actually benefit the Community - there are some open source projects that utilize this approach allowing vendor specific code contributions in their own rep (similar to OpenTelemetry - contrib repository)

We do need to avoid any direct build or runtime dependencies in these contrib components
We need an additional abstract layer to allow the specific vendor implementation without explicitly depending on it (the next example is NOT a good separations of concerns IMO )

@YANG-DB YANG-DB changed the title [FEATURE] Modify integration tests to run with OpenSearch Serverless [FEATURE] Modify integration tests to run with OpenSearch vendors (multi-vendor IT abilities ) Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants