Not null invariant incorrectly fails on non-nullable field inside a nullable struct #860

Kimahriman · 2021-12-08T12:56:19Z

An existing test actually tests for what I think is incorrect behavior: https://github.com/delta-io/delta/blob/master/core/src/test/scala/org/apache/spark/sql/delta/schema/InvariantEnforcementSuite.scala#L186

 testQuietly("reject non-nullable nested column") {
    val schema = new StructType()
      .add("top", new StructType()
        .add("key", StringType, nullable = false)
        .add("value", IntegerType))
    testBatchWriteRejection(
      NotNull(Seq("key")),
      schema,
      spark.createDataFrame(Seq(Row(Row("a", 1)), Row(Row(null, 2))).asJava, schema.asNullable),
      "top.key"
    )
    testBatchWriteRejection(
      NotNull(Seq("key")),
      schema,
      spark.createDataFrame(Seq(Row(Row("a", 1)), Row(null)).asJava, schema.asNullable),
      "top.key"
    )
  }

Here the top struct is nullable but the key field is not, and the current invariant checks only care about the fact that key is non-nullable therefore selecting that value (through a series of GetStructField's), will always not be null. However, it is valid for top to be null, and it's more accurate to say that key is never null when top is not null I think.

So in this test case, the first one is a valid test case. top is not null but key is. However, in the second test case, top is null, which should be valid behavior and not throw an exception I believe.

After looking through the code I can see a few ways to make it basically skip checking key in this case, but it might be more ideal but more complicated to have it check top first, and only if that's not null, then check key and fail only in that case.

The text was updated successfully, but these errors were encountered:

scottsand-db · 2021-12-08T16:44:49Z

Hi @Kimahriman, thanks for bringing this to our attention. We will look into this.

zsxwing · 2022-01-13T18:55:41Z

@brkyvz do you remember why we picked up this behavior?

brkyvz · 2022-01-13T21:00:50Z

I believe this is the behavior that most databases follow. The idea was that if foo.bar is not null, then foo should always be not null. If you want the case that foo can be null, but whenever foo is not null, then bar should also be not null, then you should be using a check constraint such as:

foo is null or foo.bar is not null

Kimahriman · 2022-01-13T21:43:06Z

Spark seems to allow for that behavior of foo being nullable but foo.bar not

zsxwing · 2022-01-18T17:27:05Z

Spark seems to allow for that behavior of foo being nullable but foo.bar not

@Kimahriman This is done by Delta. I think the current behavior makes sense in the SQL world. Even if foo.bar is not nullable, it can still be null if we allow foo to be nullable. For example, if foo is null, SELECT foo.bar will return null even if foo.bar is not nullable. The current behavior will ensure you never see null returned by foo.bar.

Kimahriman · 2022-01-18T19:08:26Z

Yeah I'm just saying this is valid Spark behavior, so you can hit an exception for valid Spark code. Simplest reproduction:

import pyspark.sql.functions as F
from delta.tables import DeltaTable
test = spark.range(1).withColumn('nested', F.when(F.col('id') < 5, F.struct('id')))
test.printSchema()

root
|-- id: long (nullable = false)
|-- nested: struct (nullable = true)
|    |-- id: long(nullable = false)

DeltaTable.createOrReplace().location('file:///tmp/test').addColumns(test.schema).execute()
test.write.format('delta').mode('append').save('file:///tmp/test')

Kimahriman · 2022-01-18T19:09:59Z

And Spark correctly updates the nullability when it needs to:

test.select('nested.*').printSchema()

root
|-- id: long (nullable = true)

zsxwing · 2022-05-10T16:57:43Z

@Kimahriman Sorry. I missed your latest update. Thanks for the example. Yep, Spark will update the nullability automatically. However, for Delta, if users set a not-null constraint, they likely will expect they should never see a null value when writing the code (such as write a UDF to handle a nested column value). Hence, we would like to keep the existing behavior.

I'm going to close this as this is the intended behavior. Feel free to reopen this if you have more feedback 😄

Kimahriman · 2022-05-10T17:16:48Z

if users set a not-null constraint, they likely will expect they should never see a null value when writing the code

That's the main issue, not-null constraints are automatically applied with no way to disable them. I have been slowly playing around with ways update the not-null constraint checking to operate in-line with how Spark works, haven't made too much progress on it though. At the very least it should be an opt-in thing to get around this. I still consider it a bug 😅

Kimahriman · 2022-05-10T17:18:28Z

Hah, triggered a pipeline failure by commenting on a closed issue I guess

zsxwing · 2022-05-10T18:08:49Z

We prefer to avoid creating non standard behaviors. Could you clarify what use cases that may require this non standard behavior?

Kimahriman · 2022-05-10T18:16:12Z

I would argue that the current behavior is non-standard and this ticket is to make it (spark) standard 😛

Basically what actually triggers this for us is a Scala UDF that takes a nullable string and returns struct. One of the fields in the struct is non-nullable, because it always has a value iff the string input is not null. The end result in Spark is a nullable struct with a non-nullable struct field

zsxwing · 2022-05-24T16:35:29Z

it always has a value iff the string input is not null

Can you set the nested field nullable and add a check constraint such as nested is null or nested.id is not null to the table instead?

Kimahriman · 2022-05-24T16:56:38Z

Yeah there are workarounds, just less than ideal. You lose the 0.01% speedup you would get by avoiding null-checks on the field hah. I'm less concerned about enforcing the right not-null checks and just being able to opt-in to skip the check all together because I know the nullability is correct

zsxwing · 2022-05-24T17:10:38Z

Adding a new behavior even if it's opt-in would also add extra work when we expend the not null constraint behaviors to other engines. And I feel it's not worth right now.

I will keep this open to hear more voice.

megri · 2022-10-31T12:21:27Z

I was bitten by this today. I'm using Scala and have this model

case class MyRow(id: Int, value: String)

case class Changes(before: Option[MyRow], after: MyRow)

modelling changes between MyRows of the same id. I expected this to work but get org.apache.spark.sql.delta.schema.DeltaInvariantViolationException: NOT NULL constraint violated for column: before.id.

I can work around it but it feels like incorrect behaviour.

Kimahriman · 2022-11-03T11:29:38Z

Do we know how other engines like Presto/Trino handle nullability of structs? Because Delta essentially doesn't support struct nullability. One option would be when creating a table or new field in the schema to force everything under a nullable struct to be nullable as well. This seems like the behavior Delta expects.

zsxwing · 2022-12-08T19:37:44Z

Yep. It would be great if anyone familiar with Presto/Trino can provide the feedback.

Kimahriman · 2024-06-21T21:32:52Z

My team continues to get bit by this repeatedly, and we have to find workarounds to force certain fields to be nullable even when they don't have to be, and I see more people bring this up (someone just did on Slack).

Any reconsideration on respecting struct nullability? I still don't see how there's any valid way to consider that a non nullable field inside a nullable struct has to always be not null. That's an invalid interpretation of the struct being nullable. And it isn't even consistent with how Parquet files treat nullable structs. According to the user on slack at least Kafka interprets this correctly

PhillHenry · 2024-06-24T19:02:07Z

Yeah there are workarounds, just less than ideal. You lose the 0.01% speedup you would get by avoiding null-checks on the field hah.

You'd be surprised. Typically, code will check a single pointer for null and that pointer might have been assigned to something that lives in the CPU's cache (ie, access to it will be super fast). Checking for null with big data generally pulls large amounts from RAM (relatively slow) and pushes anything out of the CPU cache. If you do this for a large amount of data, it starts to show in in terms of performance.

jqmtor · 2024-09-30T14:02:49Z

This is a problem for my team as well and I think this example demonstrates it well.

We are using Delta to transport data in Debezium CDC payloads. In that format, the top-level before and after fields need to be nullable. A delete event is denoted by a null after field and a create event is denoted by a null before field.

However, both those fields follow the same schema specification and, more importantly, the schema of the before and after fields is independent from the schema of the envelope that encloses them. We define the expected schema of the envelope by composing the schema of the entity (the before and after fields), with the schema of the envelope (the op, source, etc. fields).
This feature effectively ties the schema of the entity with the schema of the enclosing envelope. For our use cases, this coupling is problematic in several ways. The schema of the entities transported in the most important thing and it is used by several systems. Preserving their schema when transported via a Debezium CDC payload allows us to preserve their integrity and is the most important thing. These schemas are used in Delta, but also by other systems.

Kimahriman · 2024-09-30T14:58:33Z

Ironically that's exactly how the Delta transaction log is stored as well. add will only be non-null if it's an add action, remove will only be non-null if it's a remove action, etc.

Based on the way Delta handles nullability currently, there is no way to actually express the equivalent of "if the action is add, then add.path must be non-null". Trying to do that makes Delta interpret this as add.path must be non null even if the action is not add.

Kimahriman · 2024-10-04T17:54:03Z

Also FWIW this isn't an issue in delta-rs (they don't do any implicit null checking AFAIK). So you can write this "valid" data in other ways and then read in Spark, you just can't write it in Spark right now.

jqmtor · 2024-10-08T09:26:31Z

I forgot to mention before, but using flink-delta yields the same result as delta-rs then. The data can be written from Flink and then read by Spark. However, this prevents using the maintenance jobs, which is how we learned about this limitation. We tried to run the optimize job on the Delta table that we had written from Flink and the issue surfaced.

scottsand-db added acknowledged This issue has been read and acknowledged by Delta admins bug Something isn't working labels Dec 8, 2021

dennyglee assigned zsxwing Jan 4, 2022

zsxwing closed this as completed May 10, 2022

zsxwing reopened this May 10, 2022

Kimahriman linked a pull request Jul 27, 2022 that will close this issue

Don't enforce null checks on fields inside nullable structs #1296

Open

Kimahriman linked a pull request Oct 4, 2024 that will close this issue

[SPARK] Add new config to disable implicit not null invariants #3752

Open

5 tasks

Kimahriman mentioned this issue Oct 9, 2024

Non-nullable columns implies invariants but invariants not enabled in v3,7 delta-io/delta-rs#2882

Open

Kimahriman mentioned this issue Feb 8, 2025

[Delta] When checks schema for writing, Delta enforces not null on a Nested Field only when its parent is not null #4121

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not null invariant incorrectly fails on non-nullable field inside a nullable struct #860

Not null invariant incorrectly fails on non-nullable field inside a nullable struct #860

Kimahriman commented Dec 8, 2021

scottsand-db commented Dec 8, 2021

zsxwing commented Jan 13, 2022

brkyvz commented Jan 13, 2022

Kimahriman commented Jan 13, 2022

zsxwing commented Jan 18, 2022

Kimahriman commented Jan 18, 2022 •

edited by zsxwing

Loading

Kimahriman commented Jan 18, 2022

zsxwing commented May 10, 2022

Kimahriman commented May 10, 2022 •

edited

Loading

Kimahriman commented May 10, 2022

zsxwing commented May 10, 2022

Kimahriman commented May 10, 2022

zsxwing commented May 24, 2022

Kimahriman commented May 24, 2022

zsxwing commented May 24, 2022

megri commented Oct 31, 2022 •

edited

Loading

Kimahriman commented Nov 3, 2022

zsxwing commented Dec 8, 2022

Kimahriman commented Jun 21, 2024

PhillHenry commented Jun 24, 2024

jqmtor commented Sep 30, 2024

Kimahriman commented Sep 30, 2024

Kimahriman commented Oct 4, 2024

jqmtor commented Oct 8, 2024

Not null invariant incorrectly fails on non-nullable field inside a nullable struct #860

Not null invariant incorrectly fails on non-nullable field inside a nullable struct #860

Comments

Kimahriman commented Dec 8, 2021

scottsand-db commented Dec 8, 2021

zsxwing commented Jan 13, 2022

brkyvz commented Jan 13, 2022

Kimahriman commented Jan 13, 2022

zsxwing commented Jan 18, 2022

Kimahriman commented Jan 18, 2022 • edited by zsxwing Loading

Kimahriman commented Jan 18, 2022

zsxwing commented May 10, 2022

Kimahriman commented May 10, 2022 • edited Loading

Kimahriman commented May 10, 2022

zsxwing commented May 10, 2022

Kimahriman commented May 10, 2022

zsxwing commented May 24, 2022

Kimahriman commented May 24, 2022

zsxwing commented May 24, 2022

megri commented Oct 31, 2022 • edited Loading

Kimahriman commented Nov 3, 2022

zsxwing commented Dec 8, 2022

Kimahriman commented Jun 21, 2024

PhillHenry commented Jun 24, 2024

jqmtor commented Sep 30, 2024

Kimahriman commented Sep 30, 2024

Kimahriman commented Oct 4, 2024

jqmtor commented Oct 8, 2024

Kimahriman commented Jan 18, 2022 •

edited by zsxwing

Loading

Kimahriman commented May 10, 2022 •

edited

Loading

megri commented Oct 31, 2022 •

edited

Loading