-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write enforce_invariant()
function
#592
Comments
BTW here are the invariant test suites: https://github.com/delta-io/delta/blob/master/core/src/test/scala/org/apache/spark/sql/delta/schema/InvariantEnforcementSuite.scala Also note "invariants" are distinct from "constraints", so there's a separate test suite for that. Constraint enforcement is part of writer version 3, so don't need to worry about it yet; But it seems like it has a very similar implementation as invariants, so maybe write this function with constraints in mind. If I understand correctly, the main difference between invariants and constraints is that the former is a property of a single column, whereas a constraint is a property of the table and thus can enforce relationships between columns (example constraint). |
At first sight this sounds very reasonable :). With regards to testing, maybe it makes sense to combine that effort with the pyspark integration tests, along the lines of do we get the same results (error or not) when writing here or with pyspark? |
Two notes:
|
# Description Adds support to retrieve invariants from the Delta schema and also a struct `DeltaDataChecker` to use DataFusion to check them and report useful errors. This also hooks it up to the Python bindings, allowing `write_deltalake()` to support Writer Protocol V2. I looked briefly at the Rust writer, but then realized we don't want to introduce a dependency on DataFusion. We should discuss how we want to design that API. I suspect we'll turn DeltaDataChecker into a trait, so we can have a DataFusion one available but also allow other engines to implement it themselves if they don't wish to use DataFusion. # Related Issue(s) - closes #592 - closes #575 # Documentation https://github.com/delta-io/delta/blob/master/PROTOCOL.md#column-invariants
Description
For both the datafusion and pyarrow-based writers to support writer protocol v2, we'll need to support enforcing invariants. It seems like the following signature could be reused by both implementations:
Then this function could be applied to each record batch that comes in during a write.
We might also need to check whether the column is nullable and make sure we are enforcing that too, either as part of this or part of the schema enforcement. Should add a test for that.
What do you think @roeap?
Related Issue(s)
Related docs
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#column-invariants
https://books.japila.pl/delta-lake-internals/constraints/Invariants/
The text was updated successfully, but these errors were encountered: