Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensured that schema validation matches nested structures that are in different order #144

Merged
merged 2 commits into from
Jun 11, 2021

Conversation

skestle
Copy link
Contributor

@skestle skestle commented May 14, 2021

Adds test for nested out-of-order structures and resolved with recursive checking.

This (was found / is "necessary" for spark 3.2) because it seems that scala 2.13 has changed the hashing function, and the .groupBy call results in the existing nested validation test failing due to the schema being shuffled while being validated.

Enables nested structures to be compared regardless of order.
@skestle skestle marked this pull request as ready for review May 14, 2021 09:17
@skestle skestle changed the title Highlighted issues with schema validation's order sensitivity of nested structures Ensured that schema validation matches nested structures that are in different order May 14, 2021
@MrPowers
Copy link
Collaborator

MrPowers commented May 18, 2021

@skestle - thanks for submitting this PR.

Is the "Just a test at this stage" comment still the latest?

A high level overview of the current behavior and what this PR changes would be great! Thank you so much for helping!

case reqSchema: StructType =>
namedField.dataType match {
case fieldSchema: StructType =>
diff(reqSchema, fieldSchema).isEmpty
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only place basicMatch is necessary - the && should happen here and the value should be lazy. (Nitpick)

@skestle
Copy link
Contributor Author

skestle commented May 19, 2021

Thanks - I'd forgotten to update the description.
It's done now - I've noted that the code could be improved technically, but I won't modify until you've had a look.

@MrPowers
Copy link
Collaborator

@skestle - thanks for following up. I'm still trying to figure out the high level description of the issue. Here's what I can tell from the other PR and this one.

Suppose you have two DataFrames, df1 and df2 with a nested schema.

Here's the schema of df1:

  • foo
    • baz
    • bar

Here's the schema of df2:

  • foo
    • bar
    • baz

DariaValidator.validateSchema(df1, df2) does not consider these schemas equal and throws an error.

You're proposing changing this behavior and considering these schemas true by default. Is that correct?

What's your proposal for schema that are not nested, but also out of order? Does this PR impact that scenario as well?

Thanks for the help!

@skestle
Copy link
Contributor Author

skestle commented May 22, 2021

The Seq.diff that was there before already processed out of order elements. That behaviour is unchanged.

The issue was is that the nested Structs will just be tested for equality. The test verifies that not nested, nested, and double nested out-of-order structures are handled consistently.
If you check out the first commit (having the test without source changes), you'll see it only fails once nesting enters the picture.

So yes, I'm proposing changing the nested behaviour to consider the schemas the same by default - since that makes it consistent with the non-nested schema verification.

@MrPowers MrPowers merged commit 7a6079a into mrpowers-io:main Jun 11, 2021
@MrPowers
Copy link
Collaborator

Thanks and sorry for the delay in getting this merged!

skestle added a commit to skestle/spark-daria that referenced this pull request Jun 20, 2021
…different order (mrpowers-io#144)

* Highlighted issues with schema validation's order sensitivity of nested structures.

* Implemented recursive schema checker

Enables nested structures to be compared regardless of order.
@skestle skestle deleted the deep-df-schema-mismatch branch June 20, 2021 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants