Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: new create_one ExpressionHandler API #662

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

zachschuermann
Copy link
Collaborator

@zachschuermann zachschuermann commented Jan 24, 2025

What changes are proposed in this pull request?

Adds a new create_one API for creating single-row EngineData by implementing a SchemaTransform to transform the given schema + leaf values into a single-row ArrowEngineData

  1. Adds the new fn create_one to our ExpressionHandler trait (breaking)
  2. Implements create_one for ArrowExpressionHandler

This PR affects the following public APIs

New create_one API required for ExpressionHandler. And added a new len() method to StructType.

How was this change tested?

Bunch of new unit tests.

@zachschuermann
Copy link
Collaborator Author

note I'll be cleaning up/adding more tests. wanted to get some eyes on this approach first

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Jan 24, 2025
Copy link
Collaborator

@OussamaSaoudi OussamaSaoudi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I want to consider/discuss about this approach is that we require that the expression struct heirarchy matches the schema one. So a schema Struct(Struct(Scalar(Int)) requires an expression Struct(Struct(Literal(int))). This code wouldn't allow a Literal(int) expression.

Idk if we want to enforce that requirement in the long run? It's very common for kernel to flatten out the fields of a schema (ex: in a visitor), so I don't see why we shouldn't allow flattened expressions.

Perhaps this acts as a safety thing. Kernel is the only one calling create_one, and it ensures that things are nested as we expected.

kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick pass, couple reactions
(overall approach looks good)

kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
Copy link

codecov bot commented Jan 28, 2025

Codecov Report

Attention: Patch coverage is 92.74448% with 46 lines in your changes missing coverage. Please review.

Project coverage is 84.44%. Comparing base (06d8dbb) to head (a4237b2).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/engine/arrow_expression.rs 92.48% 41 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #662      +/-   ##
==========================================
+ Coverage   84.14%   84.44%   +0.30%     
==========================================
  Files          77       77              
  Lines       17710    18383     +673     
  Branches    17710    18383     +673     
==========================================
+ Hits        14902    15524     +622     
- Misses       2096     2143      +47     
- Partials      712      716       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zachschuermann zachschuermann changed the title feat: new create_one ExpressionHandler API feat!: new create_one ExpressionHandler API Jan 31, 2025
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing comments from an interrupted-and-forgotten review...

kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
Comment on lines 578 to 581
match self.stack.pop() {
Some(array) => Ok(array),
None => Err(Error::generic("didn't build array")),
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relating to the other FIXME about panicking:

Suggested change
match self.stack.pop() {
Some(array) => Ok(array),
None => Err(Error::generic("didn't build array")),
}
let Some(array) = self.stack.pop() else {
return Err(Error::generic("didn't build array"));
}
let Some(array) = array.as_struct_opt() else {
return Err(Error::generic("not a struct"));
}
Ok(array)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as_struct_opt will return an &StructArray - and I want to avoid having to clone that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just ended up checking array.data_type() though I wonder if it would be better to actually return an Arc<StructArray> instead of the trait object ArrayRef?

kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
Comment on lines 736 to 753
for (child, field) in child_arrays.iter().zip(struct_type.fields()) {
if !field.is_nullable() && child.is_null(0) {
// if we have a null child array for a not-nullable field, either all other
// children must be null (and we make a null struct) or error
if child_arrays.iter().all(|c| c.is_null(0))
&& self.nullability_stack.iter().any(|n| *n)
{
self.stack.push(Arc::new(StructArray::new_null(fields, 1)));
return Some(Cow::Borrowed(struct_type));
} else {
self.set_error(Error::Generic(format!(
"Non-nullable field {} is null in single-row struct",
field.name()
)));
return None;
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm i'm not convinced by this. Pls correct me if I'm missing something! We're keeping track of the parent nullability with the nullability stack. Seems that we allow a nullability violation if any ancestor node is nullable and all the children are null. But I may have found a counter example:
Consider this schema

{
  x(nullable): {
    a (non-nullable),
    b (non-nullable) {
      c (non-nullable)
    }
  }
}

suppose we get the scalar: [1, NULL]

When we're processing struct b, we'll iterate over all of its fields. We'll find that c is null when it's non-nullable. At b I think the nullability stack would be [true, false] from x and b respectively.

Given all these, we don't return an error. We allowed c to be null because we thought its ancestor x is null. That's this check

if child_arrays.iter().all(|c| c.is_null(0)) && self.nullability_stack.iter().any(|n| *n)

But if x is null, then a should also be null, which it isn't.

Copy link
Collaborator

@OussamaSaoudi OussamaSaoudi Feb 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming I'm not missing something, I thought up an alternate solution.

Definitions

We should fail if there is a nullability violation. Nullability violations can happen in 2 cases:

  • Base case: a leaf field is non-nullable, but the value is null.
  • Struct case: A struct has a nullability violation if both hold:
    1. at least one of its children has a nullability violation
    2. The struct does not resolve the nullability violation.

A nullability violation for a struct node is resolved when both hold:
1) all of its children are null
2) the node is nullable.

This is the case where the entire struct is null. All of its children may be null, and violations can be safely ignored.

Solution

We keep track of 2 variables for each node:

  • Null_subtree: This is true if all of the node and all its descendants are null.
  • null_violation: This is true if the node has a nullability violation (as defined above).

And an additional variable for struct nodes:

  • is_resolved: This is true if the node is nullable and the node is null_subtree is True

Base case:

  • null_subtree = True if the leaf is null
  • null_violation = True if the field is non-nullable, but the value is null

Inductive case:

  • null_subtree = True if all the children are null
  • is_resolved = True if null_subtree and current node is nullable
  • null_violation = True if (any child has null_violation) and !(is_resolved)

Return an error if at the top level (null_violation == true).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have found a counter example:

Ignoring all code for a moment, and tweaking slightly to add d as a sibling to c:

x(nullable): { 
  a (non-nullable), 
  b (non-nullable) { 
    c (non-nullable) 
    d (nullable) 
  }
}
Analysis

At the time we encounter NULL for c, there are only two possible outcomes:

  1. b is non-NULL => definitely an error
  2. b is NULL => possibly allowed (depending on whether b is allowed to be NULL, which in turn depends on whether x is NULL)

However, we are doing a depth-first traversal. So at the time we process e.g. c we have not even seen d yet, let alone processed parent parent b and grandparent x. The stack is [a:<whatever>, c:NULL].

Since we cannot yet know the correct handling of c, we just push its NULL value on the stack and move on to d (which we also just push onto the stack). Once the recursion unwinds to b, we have two possibilities:

  1. [a:<whatever>, c:NULL, d:NULL] -- because all children of b are NULL (c and d), and at least one of those children is "immediately" non-nullable, we assume the intent was to express (by transitivity) the fact that b itself is NULL (recall that b is not a leaf so we can't represent its nullness directly). Result: [a:<whatever>, b:NULL]. Whether that's good or bad is still to be determined transitively as the recursion unwinds.
  2. [a:<whatever>, c:NULL, d:<something>] -- because d is non-NULL, we know b cannot be NULL and therefore it is an error for "immediately" non-nullable c to be NULL. Result: **ERROR**.

Assuming we did not already error out, we again have two possibilities:

  1. [a:NULL, b:NULL] -- as before, all children of x are NULL (a and b), and at least one of those children is "immediately" non-nullable, so we assume the intent was to express x is NULL. Since x is immediately nullable, this is totally legitimate and the recursion completes successfully.
  2. [a:<something>, b: NULL] -- again as before, x cannot be NULL because it has a non-NULL child a. So NULL value for "immediately" non-nullable b is illegal and the recursion errors out.

Coming back to code:

The recursive algorithm would seem to be:

  • For all leaf values, accept NULL values unconditionally, deferring correctness checks to the parent.
  • Whenever the recursion unwinds to reach a (now complete) struct node, examine the children. We have several possible child statuses:
    • All children non-NULL -- No problem, nothing to see here, move on.
    • All children NULL.
      • If all children are nullable, this is fine, and we interpret the parent as non-NULL with all-null children.
      • Otherwise, we interpret this as an indirect way of expression that the parent itself is NULL. As with a leaf value, we accept that NULL value unconditionally, deferring correctness checks to the parent.
    • Otherwise, we have a mix of NULL and non-NULL children. The parent thus cannot be NULL.
      • If any of the NULL children are immediately non-nullable => ERROR
      • Otherwise, no problem, nothing to see here, move on.

If we consider all combos of the above schema, that involve least one NULL:

  • [a:<something>, c:<something>, d:NULL] - OK (x.b.d is nullable)
  • [a:<something>, c:NULL, d:<something>] - ERROR (x.b.c is non-nullable, detected by b)
  • [a:NULL, c:<something>, d:<something>] - ERROR (x.a is non-nullable, detected by x)
  • [a:<something>, c:NULL, d:NULL] - ERROR (x.b is non-nullable, detected by x)
  • [a:NULL, c:<something>, d:NULL] - ERROR (x.a is non-nullable, detected by x)
  • [a:NULL, c:NULL, d:<something>] - ERROR (x.b.c is non-nullable, detected by b)
  • [a:NULL, c:NULL, d:NULL] - OK (x is nullable)

Notably, I dont' think we need a stack to track nullability -- each parent just verifies its direct children for correct match-up of their nullability (and NULL values) vs. its own nullability. If there is no obvious local conflict, it makes itself either NULL or non-null as appropriate and then trusts its parent to do the same checking as needed.

Code
fn transform_struct(&mut self, struct_type: &'a StructType) -> Option<Cow<'a, StructType>> {
    // NOTE: This is an optimization; the other early-return suffices to produce correct behavior.
    if self.error.is_some() {
        return None;
    }
    
    // Only consume newly-added entries (if any). There could be fewer than expected if
    // the recursion encountered an error.
    let mark = self.stack.len();
    let _ = self.recurse_into_struct(struct_type);
    let field_values = self.stack.split_off(mark);
    if self.error.is_some() {
        return None;
    }
    
    require!(field_values.len() == struct_type.len(), ...);
    let mut found_non_nullable_null = false;
    let mut all_null = true;
    for (f, v) in struct_type.fields().zip(&field_values) {
        if v.is_valid(0) {
            all_null = false;
        } else if !f.is_nullable() {
            found_non_nullable_null = true;
        }
    }
    
    let null_buffer = found_non_nullable_null.then(|| {
        // The struct had a non-nullable NULL. This is only legal if all fields were NULL, which we
        // interpret as the struct itself being NULL.
        require!(all_null, ...);
        
        // We already have the all-null columns we need, just need a null buffer
        NullBuffer::new_null(1)
    });
    
    // Assemble the struct normally but mark it NULL? Or make a NULL struct directly?
    let sa = match StructArray::try_new(..., null_buffer) { ... };
    self.stack.push(sa);
    None
}  

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness of testing, we probably need a schema that exercises every possible combo of fields, along with one set of leaf scalars for every possible combo of NULL and non-NULL.

There are six "interesting" combos (n = nullable, ! = non-null):

n { n, n }
n { n, ! }
n { !, ! }
! { n, n }
! { n, ! }
! { !, ! }

Each one can have 4 distinct input value combinations, for a total of 6x4 = 24 cases to test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: What happens when a struct is non-nullable, but all its children are nullable? Does this mean that we enforce that at least one of the children is non-null?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of great discussion :) I have the 24 tests written so we can certainly continue to use those (will push in a sec)

two bits to close on:

If we're comfortable always interpreting all-fields-null as struct-null -- and I'm not aware of any specific reason to forbid it -- then the code simplifies a lot more [...]

I think this is reasonable - I can't think of a reason not to? Wanted to highlight that either way we choose we are basically 'disallowing' one case (i.e. we don't support creating it with create_one)

  1. (original case) we don't support constructing a NULL struct if all fields are null (and allowed null)
  2. (new case) we don't support constructing a struct with all-null fields

my strawman: do the new case since it is simpler and has straightforward semantics

lastly:

then the only other solution I can think of would be to define a utility method that turns the list of scalars into a Expression::Struct containing other struct and literal expressions, which then becomes the input the engine sees

this feels similar to the approach we had before in which we wanted to traverse the struct/schema but then require some specific semantics for the input expression (no column references, etc.) and it was a similarly complex (maybe even more so) implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing I wanted to flag: my original 'nullability stack' started off with 'false' for the root (the root struct array must not be null in order to create a RecordBatch out of it). In the new approach, it's slightly more general and could produce a NULL top-level StructArray which is unable to become a RecordBatch so I've introduced just a simple one-off check that will cause create_one to fail if the transform hands back a NULL StructArray.

aside: I'm not sure why there isn't just an easy API for StructArray to RecordBatch that doesn't panic..? Am I missing it?

Copy link
Collaborator Author

@zachschuermann zachschuermann Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the expected in this case? do we need to treat the top-level NULLs differently? I would expect the following to fail but it seems that arrow disagrees...

x: (not_null) {
  a: (nullable) LONG,
  b: (not_null) LONG,
}

if values = [Null, Null], we get the "all null" struct collapsing at level a,b.
this gives x: (not_null) { NULL }

if we consider all-null children to always be safe, this will also simplify to just a single top-level NULL (feels incorrect)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some additional context it seems arrow will happily create a StructArray with a not-null field if the null buffer passed in to try_new contains all of the of the corresponding child array's nulls.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my original 'nullability stack' started off with 'false' for the root (the root struct array must not be null in order to create a RecordBatch out of it). In the new approach, it's slightly more general and could produce a NULL top-level StructArray which is unable to become a RecordBatch

That's definitely annoying, and possibly a good reason to keep old behavior that all-null only translates to null struct if some fields are non-nullable...

it seems arrow will happily create a StructArray with a not-null field if the null buffer passed in to try_new contains all of the of the corresponding child array's nulls.

Right, this is similar to our recursive algo -- whether that null top-level value is bad depends on the parent. For example, record batch as a parent does not like top-level NULL, but a nullable field as a parent is totally fine.

kernel/src/engine/arrow_expression.rs Show resolved Hide resolved
Comment on lines 736 to 753
for (child, field) in child_arrays.iter().zip(struct_type.fields()) {
if !field.is_nullable() && child.is_null(0) {
// if we have a null child array for a not-nullable field, either all other
// children must be null (and we make a null struct) or error
if child_arrays.iter().all(|c| c.is_null(0))
&& self.nullability_stack.iter().any(|n| *n)
{
self.stack.push(Arc::new(StructArray::new_null(fields, 1)));
return Some(Cow::Borrowed(struct_type));
} else {
self.set_error(Error::Generic(format!(
"Non-nullable field {} is null in single-row struct",
field.name()
)));
return None;
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have found a counter example:

Ignoring all code for a moment, and tweaking slightly to add d as a sibling to c:

x(nullable): { 
  a (non-nullable), 
  b (non-nullable) { 
    c (non-nullable) 
    d (nullable) 
  }
}
Analysis

At the time we encounter NULL for c, there are only two possible outcomes:

  1. b is non-NULL => definitely an error
  2. b is NULL => possibly allowed (depending on whether b is allowed to be NULL, which in turn depends on whether x is NULL)

However, we are doing a depth-first traversal. So at the time we process e.g. c we have not even seen d yet, let alone processed parent parent b and grandparent x. The stack is [a:<whatever>, c:NULL].

Since we cannot yet know the correct handling of c, we just push its NULL value on the stack and move on to d (which we also just push onto the stack). Once the recursion unwinds to b, we have two possibilities:

  1. [a:<whatever>, c:NULL, d:NULL] -- because all children of b are NULL (c and d), and at least one of those children is "immediately" non-nullable, we assume the intent was to express (by transitivity) the fact that b itself is NULL (recall that b is not a leaf so we can't represent its nullness directly). Result: [a:<whatever>, b:NULL]. Whether that's good or bad is still to be determined transitively as the recursion unwinds.
  2. [a:<whatever>, c:NULL, d:<something>] -- because d is non-NULL, we know b cannot be NULL and therefore it is an error for "immediately" non-nullable c to be NULL. Result: **ERROR**.

Assuming we did not already error out, we again have two possibilities:

  1. [a:NULL, b:NULL] -- as before, all children of x are NULL (a and b), and at least one of those children is "immediately" non-nullable, so we assume the intent was to express x is NULL. Since x is immediately nullable, this is totally legitimate and the recursion completes successfully.
  2. [a:<something>, b: NULL] -- again as before, x cannot be NULL because it has a non-NULL child a. So NULL value for "immediately" non-nullable b is illegal and the recursion errors out.

Coming back to code:

The recursive algorithm would seem to be:

  • For all leaf values, accept NULL values unconditionally, deferring correctness checks to the parent.
  • Whenever the recursion unwinds to reach a (now complete) struct node, examine the children. We have several possible child statuses:
    • All children non-NULL -- No problem, nothing to see here, move on.
    • All children NULL.
      • If all children are nullable, this is fine, and we interpret the parent as non-NULL with all-null children.
      • Otherwise, we interpret this as an indirect way of expression that the parent itself is NULL. As with a leaf value, we accept that NULL value unconditionally, deferring correctness checks to the parent.
    • Otherwise, we have a mix of NULL and non-NULL children. The parent thus cannot be NULL.
      • If any of the NULL children are immediately non-nullable => ERROR
      • Otherwise, no problem, nothing to see here, move on.

If we consider all combos of the above schema, that involve least one NULL:

  • [a:<something>, c:<something>, d:NULL] - OK (x.b.d is nullable)
  • [a:<something>, c:NULL, d:<something>] - ERROR (x.b.c is non-nullable, detected by b)
  • [a:NULL, c:<something>, d:<something>] - ERROR (x.a is non-nullable, detected by x)
  • [a:<something>, c:NULL, d:NULL] - ERROR (x.b is non-nullable, detected by x)
  • [a:NULL, c:<something>, d:NULL] - ERROR (x.a is non-nullable, detected by x)
  • [a:NULL, c:NULL, d:<something>] - ERROR (x.b.c is non-nullable, detected by b)
  • [a:NULL, c:NULL, d:NULL] - OK (x is nullable)

Notably, I dont' think we need a stack to track nullability -- each parent just verifies its direct children for correct match-up of their nullability (and NULL values) vs. its own nullability. If there is no obvious local conflict, it makes itself either NULL or non-null as appropriate and then trusts its parent to do the same checking as needed.

Code
fn transform_struct(&mut self, struct_type: &'a StructType) -> Option<Cow<'a, StructType>> {
    // NOTE: This is an optimization; the other early-return suffices to produce correct behavior.
    if self.error.is_some() {
        return None;
    }
    
    // Only consume newly-added entries (if any). There could be fewer than expected if
    // the recursion encountered an error.
    let mark = self.stack.len();
    let _ = self.recurse_into_struct(struct_type);
    let field_values = self.stack.split_off(mark);
    if self.error.is_some() {
        return None;
    }
    
    require!(field_values.len() == struct_type.len(), ...);
    let mut found_non_nullable_null = false;
    let mut all_null = true;
    for (f, v) in struct_type.fields().zip(&field_values) {
        if v.is_valid(0) {
            all_null = false;
        } else if !f.is_nullable() {
            found_non_nullable_null = true;
        }
    }
    
    let null_buffer = found_non_nullable_null.then(|| {
        // The struct had a non-nullable NULL. This is only legal if all fields were NULL, which we
        // interpret as the struct itself being NULL.
        require!(all_null, ...);
        
        // We already have the all-null columns we need, just need a null buffer
        NullBuffer::new_null(1)
    });
    
    // Assemble the struct normally but mark it NULL? Or make a NULL struct directly?
    let sa = match StructArray::try_new(..., null_buffer) { ... };
    self.stack.push(sa);
    None
}  

Comment on lines 736 to 753
for (child, field) in child_arrays.iter().zip(struct_type.fields()) {
if !field.is_nullable() && child.is_null(0) {
// if we have a null child array for a not-nullable field, either all other
// children must be null (and we make a null struct) or error
if child_arrays.iter().all(|c| c.is_null(0))
&& self.nullability_stack.iter().any(|n| *n)
{
self.stack.push(Arc::new(StructArray::new_null(fields, 1)));
return Some(Cow::Borrowed(struct_type));
} else {
self.set_error(Error::Generic(format!(
"Non-nullable field {} is null in single-row struct",
field.name()
)));
return None;
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness of testing, we probably need a schema that exercises every possible combo of fields, along with one set of leaf scalars for every possible combo of NULL and non-NULL.

There are six "interesting" combos (n = nullable, ! = non-null):

n { n, n }
n { n, ! }
n { !, ! }
! { n, n }
! { n, ! }
! { !, ! }

Each one can have 4 distinct input value combinations, for a total of 6x4 = 24 cases to test.

@roeap
Copy link
Collaborator

roeap commented Feb 3, 2025

Just thinking out loud here...

I do think we can already do a lot of data generation using the existing expression API. The main thing that is missing is the ability to communicate the desired number of rows in evaluate.

The code below produces data much like we want it to.

let add_expr = Expression::struct_from([
    Expression::literal("file:///path"),
    Expression::literal(100),
    Expression::literal(Scalar::Null(DeltaDataTypes::INTEGER)),
]);
let schema = StructType::new(vec![
    StructField::new("path", DeltaDataTypes::STRING, false),
    StructField::new("size", DeltaDataTypes::INTEGER, false),
    StructField::new("size_null", DeltaDataTypes::INTEGER, true),
]);

let dummy_schema = Schema::new(vec![Field::new("a", DataType::Boolean, false)]);
let dummy_batch = RecordBatch::try_new(
    Arc::new(dummy_schema),
    vec![Arc::new(BooleanArray::from(vec![true]))],
)
.unwrap();

let handler = ArrowExpressionHandler {};
let evaluator = handler.get_evaluator(schema.clone().into(), add_expr, schema.into());

let data = Box::new(ArrowEngineData::new(dummy_batch));

let result = evaluator.evaluate(data.as_ref()).unwrap();
let result = result
    .any_ref()
    .downcast_ref::<ArrowEngineData>()
    .unwrap()
    .record_batch()
    .clone();

print_batches(&[result]).unwrap();

As the implementation we expect engines for to provide for expression evaluation, I wonder if it is simpler for the engine if we use the expression mechanics and maybe add a method evaluate_one(&self) ... which tells the engine to evaluate an expression over an empty batch with one row?

The current approach here feels more explicit, but would also incur more work for engines wanting to adopt?

@scovich
Copy link
Collaborator

scovich commented Feb 4, 2025

Interesting. If I try to distill/refine the idea, is it basically this?

  1. Define a new API whose only job is to produce a "dummy" engine data (***) with the requested number of rows
  2. Kernel uses the result of that API call as the input to an otherwise unremarkable expression evaluation

(***) The ideal "dummy" engine data would have no columns, but arrow probably doesn't allow that. So the next best would wrap a NullArray in a RecordBatch with an unpredictable field name. A uuid would work nicely for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants