Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created docker files for an integ test cluster (#601) #986

Merged
merged 6 commits into from
Dec 17, 2024

Conversation

normanj-bitquill
Copy link
Contributor

Description

Created a cluster that can later be used for integration tests. It contains a docker-compose.yml file that can be used to start the whole cluster.

Cluster contains:

  • Spark master
  • Spark worker
  • OpenSearch server
  • OpenSearch dashboards
  • Minio server

Currently the Minio server is unused.

Spark nodes are configured to include the Flint and PPL extensions as well as to be able to query the OpenSearch server.

The OpenSearch dashboards are configured to connect to the OpenSearch server.

Related Issues

#601

Check List

  • Updated documentation (docs/ppl-lang/README.md)
  • Implemented unit tests
  • Implemented tests for combination with other commands
  • New added source code should include a copyright header
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Cluster contains:
* Spark master
* Spark worker
* OpenSearch server
* OpenSearch dashboards
* Minio server

Signed-off-by: Norman Jordan <[email protected]>
Copy link
Member

@YANG-DB YANG-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanj-bitquill thanks!!
lets try to use / utilize the existing IT pythons scripts

@normanj-bitquill
Copy link
Contributor Author

@YANG-DB I am part way through altering the integ test script to run against the docker containers. I have been able to create the indices for http_logs and nested. Those two indices cover about half of the tests.

Some tests now pass when they were expected to fail. This could be caused by more recent changes.

Some tests fail when they were expected to pass. These fall into 3 categories:

I will continue to update the script for running the tests to also get the report at the end.

Copy link
Member

@YANG-DB YANG-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanj-bitquill how would spark-connect be used ?
will it be via python ? scala ?
could you plz describe the use case ?

@normanj-bitquill
Copy link
Contributor Author

@YANG-DB I have been repurposing the script:
https://github.com/opensearch-project/opensearch-spark/blob/main/integ-test/script/SanityTest.py

With that it is:
Python script -> Spark Connect -> Spark Master Node

This would be an initial phase in this PR. The follow up PR would be to make use of the Scala integration test framework already in place. Update it to connect with Spark Connect and run tests.

@YANG-DB
Copy link
Member

YANG-DB commented Dec 12, 2024

@YANG-DB I have been repurposing the script:
https://github.com/opensearch-project/opensearch-spark/blob/main/integ-test/script/SanityTest.py

With that it is:
Python script -> Spark Connect -> Spark Master Node

This would be an initial phase in this PR. The follow up PR would be to make use of the Scala integration test framework already in place. Update it to connect with Spark Connect and run tests.

I'm not sure EMR supports spark connect...

@normanj-bitquill
Copy link
Contributor Author

I doubt that EMR would support Spark Connect. I am keeping that in mind, but I don't have an obvious solution for Spark EMR as yet. In the end the integration tests need to be able to run queries against either standard Spark containers or Spark EMR. The integration tests should not care which they are using.

When I get to creating docker files for integration tests with Spark EMR, I will find a solution to this problem. It may require altering how integration tests connect to run queries, but for now I'd like to get a starting point out.

The Python script for integration tests was updated to run queries against the docker cluster.
The required indices are created as part of the script. The queries for the Python script were
likely out of date. These have been updated when the fix for the query was obvious.

There are still 6 tests that fail.

Signed-off-by: Norman Jordan <[email protected]>
@normanj-bitquill
Copy link
Contributor Author

@YANG-DB I have updated this PR so that the Python script for integration tests will now run against the docker cluster.

Below is one idea for the long term solution of running integration tests. Let me know what you think and if we should discuss this elsewhere.

Proposal

Create a directory structure for the tests.

integ-test-data
  +- queries
  +- query-plans
  +- expected-results

queries - contains the queries. One query per file.
query-plans - expected query plans with names that correspond to filenames in queries
expected-results - expected results of the queries in queries, with names that correspond to filenames in queries

Create a Spark App that makes use of the integ-test-data directory. It runs each query and places the output into another directory. It also calls EXPLAIN for each query and places the output into another directory.

The Spark (either master container or EMR container) have the following directories mounted:

  • integ-test-data
  • query-results
  • explain-results

The integration tests (run from sbt) will start the docker cluster and then upload the Spark App by either calling spark-submit remotely or using docker to run spark-submit.

After the tests finish, the integration tests (run from sbt) examine the query results and explain results to verify if they match the expected results.

@YANG-DB
Copy link
Member

YANG-DB commented Dec 13, 2024

@YANG-DB I have updated this PR so that the Python script for integration tests will now run against the docker cluster.

Below is one idea for the long term solution of running integration tests. Let me know what you think and if we should discuss this elsewhere.

Proposal

Create a directory structure for the tests.

integ-test-data
  +- queries
  +- query-plans
  +- expected-results

queries - contains the queries. One query per file. query-plans - expected query plans with names that correspond to filenames in queries expected-results - expected results of the queries in queries, with names that correspond to filenames in queries

Create a Spark App that makes use of the integ-test-data directory. It runs each query and places the output into another directory. It also calls EXPLAIN for each query and places the output into another directory.

The Spark (either master container or EMR container) have the following directories mounted:

  • integ-test-data
  • query-results
  • explain-results

The integration tests (run from sbt) will start the docker cluster and then upload the Spark App by either calling spark-submit remotely or using docker to run spark-submit.

After the tests finish, the integration tests (run from sbt) examine the query results and explain results to verify if they match the expected results.

@normanj-bitquill
Thanks for the feedback - lets take the discussion and create a dedicated issue (RFC) for that

@normanj-bitquill
Copy link
Contributor Author

@YANG-DB I have created this issue
#992

to continue discussion of how integration tests could be run on each of the Docker clusters.

Copy link
Member

@YANG-DB YANG-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanj-bitquill
looks great , can you please add some link to the ./script/README.md file from our main readme.md file ?
right below this

pip install requests pandas openpyxl
pip install requests pandas openpyxl pyspark setuptools pyarrow grpcio grpcio-status protobuf
```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz also mention that both ppl-spark-integration-assembly-0.7.0-SNAPSHOT.jar & flint-spark-integration-assembly-0.7.0-SNAPSHOT.jar needed to be build using :

  • sbt clean sparkSqlApplicationCosmetic/assembly
  • sbt clean sparkPPLCosmetic/assembly
    before the docker can run...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this section.

```
You need to replace the placeholders with your actual values of URL_ADDRESS, DATASOURCE_NAME and USERNAME, PASSWORD for authentication to your endpoint.
You need to replace the placeholders with your actual values of URL_ADDRESS, OPENSEARCH_URL and USERNAME, PASSWORD for authentication to your endpoint.

For more details of the command line parameters, you can see the help manual via command:
```shell
python SanityTest.py --help

usage: SanityTest.py [-h] --base-url BASE_URL --username USERNAME --password PASSWORD --datasource DATASOURCE --input-csv INPUT_CSV
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is an example value for ${URL_ADDRESS} ? if it the spark's url ?
please mention that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this up. It should actually be SPARK_URL. Also provided an example value.

@normanj-bitquill
Copy link
Contributor Author

@normanj-bitquill looks great , can you please add some link to the ./script/README.md file from our main readme.md file ? right below this

Added a link in the top-level README.md

@normanj-bitquill
Copy link
Contributor Author

@YANG-DB I have added a section to the integ test README.md to describe the test indices.

@YANG-DB YANG-DB merged commit 957de4e into opensearch-project:main Dec 17, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants