Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write enforce_invariant() function #592

Closed
wjones127 opened this issue Apr 29, 2022 · 3 comments · Fixed by #834
Closed

Write enforce_invariant() function #592

wjones127 opened this issue Apr 29, 2022 · 3 comments · Fixed by #834
Labels
enhancement New feature or request

Comments

@wjones127
Copy link
Collaborator

Description

For both the datafusion and pyarrow-based writers to support writer protocol v2, we'll need to support enforcing invariants. It seems like the following signature could be reused by both implementations:

fn enforce_invariant(batch: RecordBatch, invariants: &Vec<(i32, &str)>) -> Result<(), DatafusionError> {
    // rough implementation:
    for (column_index, sql_invariant) in invariants {
        // ... (run data fusion query)
        // ... If failure, return error indicating which invariant failed.
    }
    Ok(())
}

Then this function could be applied to each record batch that comes in during a write.

We might also need to check whether the column is nullable and make sure we are enforcing that too, either as part of this or part of the schema enforcement. Should add a test for that.

What do you think @roeap?

Related Issue(s)

Related docs

https://github.com/delta-io/delta/blob/master/PROTOCOL.md#column-invariants
https://books.japila.pl/delta-lake-internals/constraints/Invariants/

@wjones127 wjones127 added the enhancement New feature or request label Apr 29, 2022
@wjones127
Copy link
Collaborator Author

BTW here are the invariant test suites: https://github.com/delta-io/delta/blob/master/core/src/test/scala/org/apache/spark/sql/delta/schema/InvariantEnforcementSuite.scala

Also note "invariants" are distinct from "constraints", so there's a separate test suite for that. Constraint enforcement is part of writer version 3, so don't need to worry about it yet; But it seems like it has a very similar implementation as invariants, so maybe write this function with constraints in mind.

If I understand correctly, the main difference between invariants and constraints is that the former is a property of a single column, whereas a constraint is a property of the table and thus can enforce relationships between columns (example constraint).

@roeap
Copy link
Collaborator

roeap commented May 2, 2022

At first sight this sounds very reasonable :). With regards to testing, maybe it makes sense to combine that effort with the pyspark integration tests, along the lines of do we get the same results (error or not) when writing here or with pyspark?

@wjones127
Copy link
Collaborator Author

Two notes:

wjones127 added a commit that referenced this issue Sep 28, 2022
# Description

Adds support to retrieve invariants from the Delta schema and also a
struct `DeltaDataChecker` to use DataFusion to check them and report
useful errors.

This also hooks it up to the Python bindings, allowing
`write_deltalake()` to support Writer Protocol V2.

I looked briefly at the Rust writer, but then realized we don't want to
introduce a dependency on DataFusion. We should discuss how we want to
design that API. I suspect we'll turn DeltaDataChecker into a trait, so
we can have a DataFusion one available but also allow other engines to
implement it themselves if they don't wish to use DataFusion.

# Related Issue(s)

- closes #592
- closes #575

# Documentation


https://github.com/delta-io/delta/blob/master/PROTOCOL.md#column-invariants
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants