Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Distributed Correctness Testing Framework #3220

Open
Swiddis opened this issue Dec 23, 2024 · 9 comments
Open

[RFC] Distributed Correctness Testing Framework #3220

Swiddis opened this issue Dec 23, 2024 · 9 comments
Assignees
Labels
enhancement New feature or request RFC Request For Comments testing Related to improving software testing

Comments

@Swiddis
Copy link
Collaborator

Swiddis commented Dec 23, 2024

[RFC] Distributed Correctness Testing Framework

Date: Dec 27, 2024
Status: In Review

Overview

While the current integration testing framework is effective for development, there have been several bug reports around query correctness due to testing blind-spots. Some common sources for issues include:

In addition to the obvious challenges with predicting these edge cases ahead of time, we have the additional issue that the SQL plugin has multiple runtime contexts:

  • Different query engines such as Legacy, V2, Spark, or (soon) Catalyst.
  • Several configuration options that have effects on the runtime behavior of queries (e.g. pagination mode, size or memory limits, type tolerance).
  • Supporting both SQL and PPL querying.
  • Index-specific settings such as shards or date formats.

The current integration testing process is based on doesn't scale sufficiently to detect edge cases under all of these scenarios. Historically the method has been "fix the bugs as we go", but given the scale of the project nowadays and in-progress refactoring initiatives, a testing suite that can keep up with the scale is needed.

Inspired by the likes of database testing initiatives in other projects, especially SQLancer, Google's OSS-Fuzz, and the PostgreSQL BuildFarm, this RFC proposes a project to implement a distributed random testing process, which will be able to generate a large amount of test scenarios to validate correct behavior. With such a suite, the plugin can be "soak tested" by running a large number of randomized tests and reporting errors.

Glossary

  • Crash Bugs: Bugs that are directly visible as erroneous behavior (e.g. error responses, crashes). Distinct from Logic Bugs.
  • Fuzzing: A testing technique that involves providing invalid, unexpected, or random data as inputs to a system. This usually has a heavier emphasis on asserting the system doesn't crash, as opposed to making sure the responses are actually correct.
  • Logic Bugs: Bugs where the system does not have an obviously incorrect behavior (e.g. error responses, crashes), but nonetheless returns incorrect results. Distinct from Crash Bugs.
  • Property-based testing: A testing methodology that focuses on verifying properties of the system under test for a wide range of randomly-selected inputs. Further reading: What is Hypothesis? and ScalaCheck.

Current State

Our current testing infrastructure consists of several components that serve different purposes, but none of them clearly address the gaps identified here.

  1. The Main Integration Testing Suite:
  • Provides comprehensive coverage for current features, individually.
  • Effective for catching regressions in existing functionality.
  • Limited in its ability to detect edge cases or cross-feature interactions.
  1. Comparison Testing Suite:
  • Attempts to validate query results by comparing against SQLite.
  • Useful for basic SQL functionality shared between our implementation and SQLite.
  • Limited by significant differences in feature sets:
    • Cannot test OpenSearch-specific features (e.g. complex data types, PPL).
    • May produce false positives due to intentional differences in behavior.
  • These tests have not actually been run in a very long time, and seem to currently be out of commission.

We may consider reusing elements of the comparison testing framework for the new suite. In particular, both this framework and the proposed solution connect to OpenSearch via JDBC. The main concern is whether we can parallelize the workload easily.

  1. Mutation Testing Suite (PiTest):
  • Helps validate the effectiveness of our existing test suite under code mutations.
  • Identifies areas where our tests may be insufficient.
  • Still constrained by the scope of our existing tests, as it doesn't generate new test cases.
  • These tests also seem to currently be out of commission.

While these tools provide valuable testing capabilities, they fall short in several key areas:

  • Limited ability to generate diverse, representative test data.
  • Insufficient coverage of complex interactions between different SQL and PPL features.
  • Difficulty in testing across multiple runtime contexts and configurations.
  • Challenges in scaling to test large datasets or high-volume query scenarios.

Analysis of Domain

The most important question to ask is, "why can't we use an existing system?" SQL testing in particular has a lot of existing similar initiatives. A cursory look at these initiatives (ref: SQLancer, Sqllogictest, SQuaLity) reveals a few issues:

  • These suites aren’t designed with PPL or generally non-SQL query languages in mind, so the work to migrate their code to support both languages would be significant.
  • These suites generally assume that the database is not read-only: they include detailed tests for modifying database files directly (including corrupting them), as well as several tests that depend on the behavior of SQL’s write functions (INSERT, UPDATE). Removing this functionality and replacing it with a system that can set up a correctly-configured OpenSearch cluster would be again similar to writing the system from scratch.
  • Since our SQL implementation has several limitations that aren’t shared by other systems, we also would need to remove tests that rely on these limitations not being present. This is nontrivial since many of these systems rely on generating queries in bulk, and may implicitly use those same assumptions even for functionality we support.

Compared to trying to adapt these solutions, it seems like the most reliable method for a long-term solution is to write a custom system. In particular, these methods are well-documented, so we likely can make something that can get a similar degree of effectiveness with less effort than the migration cost. This will also give us a lot of flexibility to toggle specific features under test, such as the assortment of unimplemented features in V2 (#1889, #1718, #1487). The flexibility of supporting OpenSearch-specific semantics and options will also open routes for implementing similar testing on core OpenSearch.

Despite these limitations, all of the linked projects have extensive research behind them that can be used to guide our own implementation. Referencing them and taking what we can is likely valuable.

Goals

Looking at the limitations of our current test cases, I propose the following goals:

  • Scalability: Currently, it's nontrivial to run our integration tests in parallel without multiple clusters due to cross-test interference. To run the tests at the scale necessary for these types of suites to find deep bugs, we need to ensure test isolation from the start. This ties in with the next goal:
  • Cost Effectiveness: The costs of the suite should directly correlate with the amount of tests we're running. For the most part, running clusters is the largest cost, so the suite should be able to load balance parallelized tests across a small number of clusters.
  • Model Real Usage: The existing suites use a small amount of test indices for specific features. The indices generated for the tests should more closely model real indices.

Implementation Strategy

The goal is to create a separate project that can connect to OpenSearch SQL via a dynamically chosen connector (e.g. JDBC, or REST) -- similar to how the comparison testing suite currently uses JDBC. For the first viable version, only the REST connector will be supported, for simplicity.

The role of the test suite as a system is shown in this System Context diagram. It will run independently of our existing per-PR CI on a specified schedule, informing developers of issues found. The suite will have some config that specifies how many batches it should run for during each invocation, similar to how Hypothesis does max_examples.

The testing framework itself needs two main components: the actual test runner, and a tool that can create clusters to test with various configs. For example, toggling cluster settings, whether to use Spark, and other cluster-level options. Making the cluster manager a separate container will help encapsulate a lot of grungy infrastructure code in a dedicated tool. That said, for the initial version, we can just have the pipeline spin up a single test cluster that the runner uses, with a single default configuration, meaning its box can be replaced by the CI pipeline for now.

One unit of work we introduce at this level is test batches, based on the throughput benchmarking below. The idea is that both creating clusters and creating indices are expensive compared to running read-only queries or incremental updates on indices, so we want to reuse created indices wherever possible (aiming for ≥ 100 tests/index). To have a high test throughput, we run tests in cluster-level batches, where each cluster-level batch is made up of several index-level batches. The exact number of tests per batch will be configurable.

  1. When starting a cluster-level batch, configure a cluster. The cluster will be assigned to run a fixed number of index-level batches.
  2. For each index-level batch, configure (1 to N) indices, and populate them with sample data.
  3. For each index-level batch, generate (read-only) queries for the sample data set.
  4. Validate query results on the data set in parallel.
  5. Once the index-level batch is done, delete the associated indices. Once all index-level batches are done, tear down the cluster and consider the cluster-level batch complete.

Zooming into the test runtime, we can look more at its core components. The bulk of the work for "supporting features" happens in the Query Generator and the Data Generator, the rest of the work is mostly plumbing.

For properties we can test, the following high-level strategies are available:

  • Property-based testing as a testing methodology is described in the glossary. For a technology choice, ScalaCheck works well with our current JVM-heavy ecosystem as a widely-supported property-based testing framework.
  • Pivoted Query Synthesis (PQS), Non-Optimizing Reference Engine Construction (NoREC), Ternary Logic Partitioning (TLP), Query Plan Guidance (QPG), and Differential Query Plans (DQP) are all described in the SQLancer Papers.
  • SQuaLity also has some techniques aggregated from other database systems which we can take from.

Implementation Roadmap

The goal is to get a minimum viable suite running by the end of January, which is able to be extended to support more query features along with current feature development.

Bootstrapping

  1. Identify specific historic bugs to work toward and identify minimal query and test data features necessary to reproduce them.
  2. Write a CI pipeline that can invoke the test runner after setting up a cluster configuration which reproduces the bugs.
  3. Write an adapter to communicate with the cluster.
  4. For each bug:
    1. Write a test suite to check for a specific property that aligns with the bug. Properties should be broad enough to apply to multiple queries as a general "correctness rule", not be bug-specific assertion errors. In particular, the more general the properties can be while still being related to correctness, the more powerful. Query Partitioning is maybe one of the stronger examples.
    2. Write a data generator that can create an index which reproduces the bug.
    3. Extend the query generator to produce queries that would reproduce the bug.

By the end of this process, all major components in the diagram above should be stubbed and ready to be incrementally added to. The fact that the suite finds known existing bugs can give confidence for identifying bugs in new features that violate the same properties.

Importantly, for bootstrapping we skip writing a full cluster configuration manager, and stick to a minimal adaptor and index configuration.

Extending

Once we can find known bugs, the main steps for identifying unknown bugs are similar:

  1. Categorize bugs that come up, since that indicates gaps in the suite. Categories may include run configuration options, languages, or data types (per the issue categories identified in the Overview).
  2. Add any necessary properties that would align with those categories as a test suite.
  3. Add any missing language features related to those categories for query generation.
  4. Add any missing data generation features related to those categories.

Open Questions

  • How will we handle testing of queries that involve very large result sets or long-running operations?
    • We can leave this for the future, likely a lot of the behavior here can be reproduced by shrinking the pagination thresholds in the OpenSearch config.
  • How will we handle testing of queries that involve external resources (e.g., SQL functions that call external APIs)? Spark is the primary example here.
    • The configuration manager is meant to handle this, but it's not entirely clear how. CloudFormation? Teraform? Would like input from someone who's done similar things before, or who knows how we currently do integration testing for Spark.
  • How will we handle testing of security features and access controls within this framework?
    • As part of the cluster configuration.
  • How will we prioritize which features or combinations of features to test first?
    • Either features under active development, or features that have a high frequency of bug reports.
  • What provisions will we make for testing query behavior under different cluster states (e.g., node failures, network partitions)?
    • Leaving this open for the future, not sure if it's really practical or useful right now.

Structurizr DSL for the diagrams

workspace "OpenSearch-SQL" "Correctness Testing" {
!identifiers hierarchical

model {
    user = person "Developer"

    correctness = softwareSystem "Correctness Testing Framework" {
        tester = container "Test Runtime" {
            queries = component "Query Generator" {
                description "Implements the bulk of the logic for toggling language features and building queries"
            }
            scheduler = component "Scheduler" {
                description "Handles the async runtime for parallelizing tests, and managing batch-level configurations"
            }
            index = component "Data Generator" {
                description "Creates sample indices and populates them with type-level data"
            }
            adapter = component "Adapters" {
                description "Defines methods for running queries on a cluster (JDBC, REST, Async Jobs...)"
            }
            test = component "Test Suites" {
                description "Defines correctness properties of interest (TLP, PQS, pagination semantics...)"
            }


            scheduler -> test "Runs with configured connection details, adapter, language features, and suite-scoped indices"
            scheduler -> index "Requests test indices"
            test -> queries "Requests queries with features of interest"
            test -> adapter "Executes queries"
        }

        config = container "Cluster configuration manager"

        cluster = container "OpenSearch Cluster" {
            tags "Database"
            tags "External"
        }

        bin = container "SQL Binary" {
            tags "External"
        }

        tester.adapter -> cluster "Executes queries"
        # config -> tester.scheduler "Connection details"
        tester.scheduler -> config "Requests a new cluster"
        tester.index -> cluster "Creates sample indices"
    }

    results = softwareSystem "Test Report" {
        tags "External"
    }

    ci = softwareSystem "CI Pipeline" {
        tags "External"
    }

    codebase = softwareSystem "Codebase" {
        tags "External"
    }

    user -> ci "Schedules"
    user -> results "Reads"
    user -> codebase "Patches"

    ci -> codebase "Pulls"
    ci -> correctness.tester "Runs"
    ci -> correctness.tester.scheduler "Invokes with config"
    ci -> correctness.bin "Compiles"

    correctness.tester -> correctness.config "Requests cluster for batch"
    correctness.config -> correctness.bin "Runs"
    correctness.config -> correctness.cluster "Configures"
    correctness.tester -> correctness.cluster "Queries via tests"
    correctness.tester -> results "Results"
    correctness.tester.scheduler -> results "Appends results to report"
}

views {
    systemContext correctness "Diagram1" {
        include *
        include user
        include codebase
        autolayout lr
    }

    container correctness "Diagram2" {
        include *
        autolayout lr
    }

    component correctness.tester "Diagram3" {
        include *
        autolayout lr
    }

    styles {
        element "Element" {
            color #ffffff
        }
        element "Person" {
            background #D00000
            shape person
        }
        element "Software System" {
            background #D00000
        }
        element "Container" {
            background #D00000
        }
        element "Component" {
            background #D00000
        }
        element "Database" {
            shape cylinder
        }
        element "External" {
            background #888888
        }
    }
}

configuration {
    scope softwaresystem
}

}

@Swiddis Swiddis added enhancement New feature or request untriaged and removed untriaged labels Dec 23, 2024
@Swiddis Swiddis removed the untriaged label Dec 23, 2024
@Swiddis Swiddis self-assigned this Dec 23, 2024
@RyanL1997
Copy link

Hi @Swiddis , thanks for putting this together. I know some of the parts are still work in progress, so just wanna leave some general thoughts after reading the current version as my first round of review:

  1. "For the edge cases listed in the overview section, is there a severity or priority ranking to help us identify the most critical issues to address first? Or is the intention to resolve all of them simultaneously? This would provide better clarity on the focus and scope of the implementation."

  2. Interactions between our plugin and supported external interfaces such as JDBC and Spark.

I saw that you also mentioned in the open questions: "How will we handle versioning/compatibility across OpenSearch versions?" Especially for these external dependencies, I'm actually having the similar question that will backward compatibility testing be a focus as part of the proposed framework? Are there plans to define specific test cases or scenarios to validate compatibility with older versions, and how will this be managed over time as new features are introduced?

  1. Challenges in scaling to test large datasets or high-volume query scenarios.

Personally, I need to do more homework on the scenario for large datasets. However, by taking a first look, I think It might be useful to include a plan for benchmarking these scenarios to measure performance impact. (This may not be the P0)

  1. The tests will more-or-less run independently in the background (likely on EC2), as part of a cluster separate from GitHub actions, and report bugs asynchronously as we make changes. Common bug types can get specialized tests introduced in the main test suite.

I noticed the mention of the above. Does this imply the introduction of another internal test pipeline, distinct from the open-source GitHub-based pipeline? If so, how will the results from this EC2-based pipeline be integrated or communicated back to the broader testing framework?

@RyanL1997
Copy link

RyanL1997 commented Dec 24, 2024

Also, for this open question:

What metrics will we use to measure the effectiveness of this new testing framework compared to our current testing methods?

Here are some general ideas in my mind:

Coverage Metrics

  • Test Coverage Improvement: Measure the percentage of edge cases, scenarios, or code paths covered by the new framework compared to the current one.
  • Edge Case Handling: Number of previously missed edge cases or bugs identified and addressed by the new framework.

Accuracy and Reliability Metrics

  • False Positive Rate: Track reductions in the rate of incorrect bug reports generated by the framework.
  • Bug Detection Accuracy: Percentage of valid bugs detected out of the total issues flagged by the framework.

Efficiency Metrics

  • Execution Time: Compare the time taken to execute tests in the new framework versus the current one.
  • Test Resource Utilization: Assess compute resource usage (e.g., EC2/GH runner?) of the new testing framework versus existing pipelines.

Outcome-Based Metrics
*This one maybe a little bit tricky since we do have other pre-release procedures, such as sanity testing to reduce the bug rate

  • Post-Release Bug Rate: Measure the reduction in bugs reported after release due to better pre-release testing.
  • Severity of Undetected Bugs: Evaluate the severity of any bugs that still slip through the framework compared to before.

Integration and Usability Metrics

  • Integration Time: Time required to integrate the new framework into existing pipelines or workflows.
  • Ease of Debugging: Feedback from developers on how easily the framework helps identify and resolve bugs.

@anasalkouz anasalkouz added the RFC Request For Comments label Dec 26, 2024
@Swiddis
Copy link
Collaborator Author

Swiddis commented Dec 26, 2024

I tried setting up some crude benchmarking code to estimate how much test throughput we could get. I tried versions of a Locust script that tried two scenarios. Left: many parallel tests that each create, update, and delete their own index. Right: many tests sharing a small number of indices.

In general, insert and query requests are very fast while index create/delete is slow.

  • Index-per-test
    image

  • Shared indices
    image

This gives some data to back up that we should have batches of tests run on batch-scoped indices, it's a throughput difference of 25x on equal hardware. It's also more reliable -- with a small number of test-scoped index workers there were timeout failures with even just 50 tests in parallel, but batched workers can handle 1000 tests in parallel without any failures (cluster generally just slows to land at around 1800 requests/second on my machine). For a real soak-test we probably want to run at least O(a million) tests total, which my dev machine can do for this benchmark in ~10 minutes.

@Swiddis
Copy link
Collaborator Author

Swiddis commented Dec 26, 2024

@RyanL1997

For the edge cases listed in the overview section, is there a severity or priority ranking to help us identify the most critical issues to address first? Or is the intention to resolve all of them simultaneously? This would provide better clarity on the focus and scope of the implementation.

I think we can work backwards from specific past bug reports (such as those linked in the overview) to the features the query generator needs to support, then see if we reproduce them. If the process is able to find specific known issues, we can have some confidence in its ability to find more unknown issues as we add more features.

I saw that you also mentioned in the open questions: "How will we handle versioning/compatibility across OpenSearch versions?" Especially for these external dependencies, I'm actually having the similar question that will backward compatibility testing be a focus as part of the proposed framework?

I think we should leave it at first. Since we don't typically update older versions, running tests there would mostly be data collection. I do know there have been some bugs involving specific upgrade flows (example), but I'm not convinced the extra complexity would be paid back.

Personally, I need to do more homework on the scenario for large datasets. However, by taking a first look, I think It might be useful to include a plan for benchmarking these scenarios to measure performance impact.

We can probably extend the testing to do some sort of benchmarking, but let's not step on the toes of opensearch-benchmark where we don't have to. The focus for the moment is just correctness. Since we'll be running many tests per index (see above comment), we can probably afford to make each index much larger than if we were doing 1 test per index.

Does this imply the introduction of another internal test pipeline, distinct from the open-source GitHub-based pipeline? If so, how will the results from this EC2-based pipeline be integrated or communicated back to the broader testing framework?

I think it's a separately-running job, I don't dislike having it scheduled in pipelines?

It will probably be reported by some sort of email mailing list, or a dashboard that's periodically checked. We shouldn't automatically publish issues to GH (both to avoid noise and also in case there's any issues that are security-relevant). That said, I think my original idea of running this 24/7 and just tracking failures also is complicated, it'd require a whole live notification system. For simplicity, let's make the suite run a configurable finite number of tests with an existing framework, like how Hypothesis does max_examples, and deposit the HTML report somewhere. We can upgrade it if we need to, I'm not convinced we will.

@anasalkouz
Copy link
Member

Thanks @Swiddis for putting this proposal, I have few generic comments:

These suites generally assume that the database is not read-only: they include detailed tests for modifying database files directly (including corrupting them), as well as several tests that depend on the behavior of SQL’s write functions (INSERT, UPDATE). Removing this functionality and replacing it with a system that can set up a correctly-configured OpenSearch cluster would be again similar to writing the system from scratch.

If this is the main concern of using out the box existing frameworks, can we explore the option to provide those write functions if possible, even if those only for testing purposes.

  • Generate data sets, each containing (potentially multiple) indices (and accelerations for Spark). For each data set:
  • Generate (read-only) queries for that data set, including multiple languages.

How to make sure we keep data sets and queries up to date with our new development on SQL/PPL language? since those most probably will be on a separate repository

@Swiddis
Copy link
Collaborator Author

Swiddis commented Dec 26, 2024

@anasalkouz

If this is the main concern of using out the box existing frameworks, can we explore the option to provide those write functions if possible, even if those only for testing purposes.

It's possible, but I'm not sure it's efficient. One way that could work is to have these frameworks create a SQL table locally, and we write some logic that transforms the SQL table into an OS index before starting the read queries. Getting all the datatypes to work right would be tricky, and we also would still face the other issues with PPL and OpenSearch-specific semantics.

To clarify: I think these frameworks do have valuable lessons to teach, and I think we would benefit a lot from copying select parts of their code where we can. I just think that building something from scratch will give us flexibility that we can't really have otherwise.

How to make sure we keep data sets and queries up to date with our new development on SQL/PPL language? since those most probably will be on a separate repository

So far the best idea I have is making it a reviewer step, reviewers should enforce whether we need to add test features for the new functionality1. This is already done for OSD with the functional test repository.

Footnotes

  1. Not strictly related to this RFC: to mechanize this, I think we would benefit a lot from introducing a full-featured "reviewer checklist" that specifies what to check for. We already have PR checklists for PR authors but they're mostly ignored. Having a reviewer checklist is a practice that seems pretty common for other organizations and may reduce review churn a lot for both authors and reviewers.

@Swiddis Swiddis changed the title [WIP] [RFC] Distributed Correctness Testing Framework [RFC] Distributed Correctness Testing Framework Dec 27, 2024
@Swiddis
Copy link
Collaborator Author

Swiddis commented Dec 27, 2024

I updated the document with an implementation strategy and diagrams, and left answers for several of the open questions I had (and removed a few that felt like they weren't relevant anymore). Going to call this ready for review, a lot of the remaining grungy specifics of the implementation can be done when we actually start writing.

In particular I'd like to go into more specific detail on some of the properties I have in mind, but there's already a lot of literature on this so the value is limited. See the linked papers in the doc, or I particularly recommend this talk by Manuel Rigger, one of the SQLancer authors.

@YANG-DB YANG-DB added the testing Related to improving software testing label Dec 31, 2024
@YANG-DB
Copy link
Member

YANG-DB commented Dec 31, 2024

@Swiddis Its a Great review and document !

My comments regarding the framework's Goals:

I would like to add the next goals or expand the Model Real Usage goal:

  • Allow vendor specific compatibility tests - for example with Spark based PPL
    • Standard Spark based PPL Tests
    • EMR Spark based PPL Tests
    • Add non trivial datasource tests based validations (S3 / parquet / json ) for SQL/PPL correctness validation

These tests should be simple and easy to run using docker-compose or other containerized based solution

  • Allow Performance benchmark & testing as part of this framework capabilities
    • create multiple levels of V^3 dimensions based testing branches:
      • Volume - amount of data
      • Velocity - speed of data being ingested ( window streaming )
      • Variety - test multiple data types both structured and un-structured

@penghuo
Copy link
Collaborator

penghuo commented Jan 3, 2025

@Swiddis Great document. Couple comments

To clarify: I think these frameworks do have valuable lessons to teach, and I think we would benefit a lot from copying select parts of their code where we can. I just think that building something from scratch will give us flexibility that we can't really have otherwise.

Q1, I assume we aim to build new frameworks to address scalability issues. Since OpenSearch core integration tests currently take 2-3 hours to run, have we considered improving the existing OpenSearch integration test framework as well?
Could you elaborate more on "batches of tests run on batch-scoped indices"? Does the current integration test framework already support this capability?

Q2, Could eleberate more on Query Generator and the Data Generator?
Q3, For Spark-SQL/PPL test, we do not need to depend on OpenSearch, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request RFC Request For Comments testing Related to improving software testing
Projects
Status: New
Development

No branches or pull requests

5 participants