Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement predicate pruning for like expressions (prefix matching) #12978

Merged
merged 8 commits into from
Dec 30, 2024

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented Oct 16, 2024

The idea is that we can push certain like expressions down into statistics pruning.
For example, a filter like url LIKE 'https://www.google.com%' can be (basically, some caveats and other tricks) used to filter url_min <= 'https://www.google.com' and 'https://www.google.com' <= url_max such that a row group that only has https://www.example.com would be excluded.

Closes #507
Closes #13253

@github-actions github-actions bot added the core Core DataFusion crate label Oct 16, 2024
@adriangb
Copy link
Contributor Author

cc @alamb

@alamb
Copy link
Contributor

alamb commented Oct 16, 2024

This is very clever -- I will review it tomorrow

Comment on lines 1499 to 1501
(false, true) => Operator::ILikeMatch,
(true, true) => Operator::NotILikeMatch,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is dead code as if like_expr.case_insensitive() { catches the case insensitive case

I think this code would be clearer if it just matched on like_expr.negated() (or alternately returned unhandled_hook.handle(expr); directly for these last 2 cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will change it to do so. I think I was getting a bit ahead of myself to implement ILIKE support, which as per the comment should be possible, maybe you can show me how to construct the physical expression to call lower() and upper() on another expression.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adriangb -- this is very cool.

I think there are a few more tests needed but otherwise the implementation looks very solid. Thank you

@@ -1610,6 +1625,93 @@ fn build_statistics_expr(
Ok(statistics_expr)
}

fn extract_string_literal(expr: &Arc<dyn PhysicalExpr>) -> Result<&String> {
if let Some(lit) = expr.as_any().downcast_ref::<phys_expr::Literal>() {
if let ScalarValue::Utf8(Some(s)) = lit.value() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should probably also handle the cases ScalarValue::LargeUtf8, ScalarValue::Utff8View as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And Dictionary!

if prefix.is_empty() {
return plan_err!("Empty prefix in LIKE expression");
}
Ok(Arc::new(phys_expr::Literal::new(ScalarValue::Utf8(Some(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also have to test if there are other occurences of % in the string 🤔 (like foo%bar%)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is pretty simple (truncate at the first one) but I agree another test would be nice.

// column LIKE '%foo%' => min <= '' && '' <= max => true
// column LIKE 'foo' => min <= 'foo' && 'foo' <= max

// I *think* that ILIKE could be handled by making the min lowercase and max uppercase
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I agree. Figuring out how to make those call would be the trick

fn build_like_match(
expr_builder: &mut PruningExpressionBuilder,
) -> Result<Arc<dyn PhysicalExpr>> {
// column LIKE literal => (min, max) LIKE literal split at % => min <= split literal && split literal <= max
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the LIKE literal split at % part

column LIKE literal is the same as column = literal if there are no wild cards, so you should be able to use the same rules as equality I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right that's the point, by splitting it at the first % we are able to apply the same rules as equality:

column LIKE literal -> (min, max) LIKE (literal split at %) -> min <= split literal && split literal <= max
vs
column = literal -> (min, max) = literal -> min <= literal && literal <= max

let expected_ret = &[true, true, false, false, true, true];

prune_with_expr(
// s1 LIKE 'A%'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comments say s1 LIKE A% but the code builds a different expressions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I've tried to get all of these comments right, I think they can be removed tbh the expression is pretty self explanatory, but left them for now, let me know if you'd prefer that I remove them or if you want to keep them if there are any obviously wrong

let expected_ret = &[true, true, false, false, true, true];

prune_with_expr(
// s1 LIKE 'A%'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also please add tests for other combinations:

  • s1 LIKE 'A'
  • s1 LIKE '%A%'
    I think it is important the matching is doing the right thing

I also think it is important to to cover cases for NOT LIKE as well

let expected_ret = &[true, true, true, true, true, true];

prune_with_expr(
// s1 LIKE 'A%'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// s1 LIKE 'A%'
// s1 LIKE '%A'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is still wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I haven't pushed pending discussing the correctness of the general approach.

@adriangb
Copy link
Contributor Author

adriangb commented Oct 17, 2024

@alamb can these stats be truncated? I know stats in pages truncate large strings, e.g. if the min value is "B" could it be that the actual min value is "BA"? If so I think this approach may not work at all. Imagine we have a row group with data ["BA", "ZD"] which generates min/max stats ["B", "Z"]. Now we want to know if col LIKE '%A%' is possible. Clearly the answer should be yes but if we convert it to the predicate form we get 'B' <= '' AND '' <= 'Z' which gives false 😞. I think this could be resolved by truncating the stats column to be the same length as the prefix?

@alamb
Copy link
Contributor

alamb commented Oct 18, 2024

@adriangb in theory I think parquet statistics can be truncated.

Now we want to know if col LIKE '%A%'

I don't think we can use statistics for substring match -- we can only use statistics for equality and prefix matching

so like col LIKE 'A%'

The predicate would be transformed into 'B' <= 'A' AND 'A' <= 'Z' which I do think is correct

@adriangb
Copy link
Contributor Author

Consider the values ["ABC", "XYZ"] with stats ["AB", "XY"] and the filter col like 'A%'. This becomes 'AB' <= 'A' AND 'A' <= 'XY' which is false, but we need true. To fix this we'd need to truncate the stats to the length of the filter to get 'A' <= 'A' AND 'A' <= 'XY' which then gives the right result. %A% is just an obvious case because you get '' as the prefix which gives obvious issues.

@adriangb
Copy link
Contributor Author

adriangb commented Oct 19, 2024

Okay @alamb I pushed a pretty big rework. Lots of new test cases, lots of comments explaining what's going on. I removed the not like part; I'm thinking this is complex enough as is and most of the benefit (maybe even in ClickBench?) will come from like. We can tackle not like, ilike and not ilike in future PRs. Especially since it's going to be important to evaluate each test case carefully.

I will note that I am a bit concerned about the interaction of truncated stats and how we apply these filters. Take the (absurd) case of stats that were truncated so that all you have is "","". You basically know nothing about the data, there could be anything in there. Yet col = 'A' transforms into '' <= 'A' and 'A' <= '' which is false. Substitute in non-truncated stats and 'A' <= 'A' and 'A' <= 'Z' is true. I added a test case for this behavior on the existing code. It doesn't differentiate between "ABC" truncated to "" and "" actually being the min string but it shows the behavior which would be the same in both cases.

This is important because for the case of a max stat of "A" and a filter col like 'A_' if the stat might have been truncated from "AB" I need to let it pass, if I know for a fact that "A" is the max string in the column I can indeed reject the column.

@adriangb
Copy link
Contributor Author

Argh the only ClickBench query this maybe could improve is 23:

WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%'

Since the like starts with a wildcard this won't help. Maybe we can get not like to do something smart there, tbd...

@Dandandan
Copy link
Contributor

Dandandan commented Oct 19, 2024

I am wondering if simple like patterns are not already converted to startswith etc, such that pruning already is applied or we need to implement that case instead of for like?

@adriangb
Copy link
Contributor Author

I am wondering if simple like patterns are not already converted to startswith etc, such that pruning already is applied.

Good question. Based on performance of some queries I saw I'd say no, but it it's worth double checking. Any suggestions as to a definitive easy way to check? I guess I can run datafusion-cli against a very large parquet file in object storage (high latency) and a query that should filter (col like 'A%') and one that can't?

I don't see where startswith or any other transformations (lower, upper, etc.) are handled in the pruning transformation.

@Dandandan
Copy link
Contributor

I would take a look at the (physical plan) of queries involving like first to see if it still uses like or is transformed to another function.

@adriangb
Copy link
Contributor Author

adriangb commented Oct 19, 2024

I made a big parquet file as follows:

import random
import string
import polars as pl

df = pl.DataFrame({'col': ["A" + "".join(random.choices(string.ascii_letters, k=1_000)) for _ in range(1_000_000)]})
df.write_parquet('data.parquet', compression='uncompressed')

This came out to ~1GB. I then uploaded it to a GCS bucket.

I ran queries col = 'Z' and col like 'Z' against it and got 2s and 23s respectively. IMO that means it's not getting pushed down.

The explain plans reflect that as well:

ParquetExec: file_groups={10 groups: [[data.parquet:0..100890471], [data.parquet:100890471..201780942], [data.parquet:201780942..302671413], [data.parquet:302671413..403561884], [data.parquet:403561884..504452355], ...]}, projection=[col], predicate=col@0 = Z, pruning_predicate=CASE WHEN col_null_count@2 = col_row_count@3 THEN false ELSE col_min@0 <= Z AND Z <= col_max@1 END, required_guarantees=[col in (Z)], metrics=[output_rows=0, elapsed_compute=10ns, predicate_evaluation_errors=0, bytes_scanned=19368790, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=3, pushdown_rows_filtered=0, page_index_rows_filtered=0, row_groups_matched_statistics=0, row_groups_matched_bloom_filter=0, file_scan_errors=0, file_open_errors=0, num_predicate_creation_errors=0, time_elapsed_scanning_until_data=18.748µs, time_elapsed_opening=7.717746249s, time_elapsed_processing=64.457827ms, page_index_eval_time=10.134µs, pushdown_eval_time=20ns, time_elapsed_scanning_total=19.21µs]
ParquetExec: file_groups={10 groups: [[data.parquet:0..100890471], [data.parquet:100890471..201780942], [data.parquet:201780942..302671413], [data.parquet:302671413..403561884], [data.parquet:403561884..504452355], ...]}, projection=[col], predicate=col@0 LIKE Z, metrics=[output_rows=1000000, elapsed_compute=10ns, predicate_evaluation_errors=0, bytes_scanned=1006955145, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, pushdown_rows_filtered=0, page_index_rows_filtered=0, row_groups_matched_statistics=0, row_groups_matched_bloom_filter=0, file_scan_errors=0, file_open_errors=0, num_predicate_creation_errors=0, time_elapsed_scanning_until_data=49.346124581s, time_elapsed_opening=2.18377s, time_elapsed_processing=1.545583231s, page_index_eval_time=20ns, pushdown_eval_time=20ns, time_elapsed_scanning_total=49.654700084s]

The like query also has:

FilterExec: col@0 LIKE Z, metrics=[output_rows=0, elapsed_compute=1.878551ms]

So it doesn't seem like it's being transformed into another expression. It probably would be smart to do so as a general optimization outside of pruning.

I also think pruning should handle whatever that produces (startswith in the case of like 'A%' or = in the case of like 'A') as well as additional simple cases like upper(), lower(), etc.

@adriangb adriangb changed the title Implement predicate pruning for LIKE and NOT LIKE Implement predicate pruning for like expressions Oct 19, 2024
@alamb
Copy link
Contributor

alamb commented Oct 21, 2024

So it doesn't seem like it's being transformed into another expression. It probably would be smart to do so as a general optimization outside of pruning.

I also think pruning should handle whatever that produces (startswith in the case of like 'A%' or = in the case of like 'A') as well as additional simple cases like upper(), lower(), etc.

That certainly makes sense to me

Perhaps we could implement some sort of simplification in https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs for LIKE into starts_with https://docs.rs/datafusion/latest/datafusion/functions/expr_fn/fn.starts_with.html (though since it is a function figuring out how to make the rewrite work might be tricky)

Then we can implement the rules in pruning predicate for starts_with 🤔

@adriangb
Copy link
Contributor Author

adriangb commented Oct 21, 2024

At most we could simplify the col like 'A%' case but we can't simplify 'A%B' so I think it's still worth it to implement pruning rewrites for both.

Do you have any thoughts on my concerns for possibly truncated stats, in particular how = may even be wrong as of today if the stats are truncated enough?

@Dandandan
Copy link
Contributor

I think it’s fine to support like for now and leave the further combination / optimization for future work. I see that only simplifying to starts_with won't get all the benefits.

Of course, the pruning needs to be correct :). We could add (negative) test cases / add issues if the already implemented logic for = pruning is incorrect.

@adriangb
Copy link
Contributor Author

adriangb commented Oct 21, 2024

Well there's no tests for hypothetical cases with truncated stats. All of the tests are against the stats themselves with no indication of how those are meant to correspond with the original data. There were no unit tests of Utf8 filtering at all as far as I can tell.

The current implementation of = is certainly not wrong in the real world, but I'm not sure if that's because it's not used in situations where stats are truncated, if truncation only happens at extremes like a 10MB value where practically it's not a problem, etc.

@alamb
Copy link
Contributor

alamb commented Oct 21, 2024

I suggest we (I can help tomorrow):

  1. File a ticket to simplify like to = and starts_with when possible (to help follow on optimizations like this)
  2. File / find a ticket about testing with truncated statistics
  3. Determining what, if anything, is left to add directly to pruning prediate

@adriangb
Copy link
Contributor Author

Sounds good.

All of that said, I think this PR is currently as correct as = and had pretty good test coverage. Does it need to wait on those tasks or can it proceed in parallel?

let (min_lit, max_lit) = if let Some(wildcard_index) = first_wildcard_index {
let prefix = &s[..wildcard_index];
let prefix_min_lit = Arc::new(phys_expr::Literal::new(ScalarValue::Utf8(Some(
format!("{prefix}\u{10ffff}"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'highest character' should be appended to the max range, not the min.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When implementing similar thing for Trino (trinodb/trino@6aea881), i decided to stay within ASCII characters to avoid any potential issues due to misinterpretation of "difficult" code points.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm my intuition was that you want to add the highest character to the upper lower bound of the min value such that 'A%' can match 'AB'. Assuming a column with only 1 row "AB" and the query 'AB' like 'A%':

  • 'AB' <= 'A\u{10ffff}' and 'A' <= 'AB' -> t
  • 'AB' <= 'A' and 'A\u{10ffff}' <= 'AB' -> f
    Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean to stay within ASCII? As far as I know everything here is Utf8 so I'm not sure how we can restrict it to ASCII?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if comment is related to the confusing naming in https://github.com/apache/datafusion/pull/12978/files#r1810513072?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean to stay within ASCII? As far as I know everything here is Utf8 so I'm not sure how we can restrict it to ASCII?

We can if only we want to. The code will do whatever we ask it to do.
Then the question is whether we want to. If we apply the predicate locally in memory only, then no need to be cautious, no need for "stay within ASCII". If we later interop with other systems (eg send a plan somewhere or TableProvider calls remote system), then it might be beneficial to restrict ourselves.

Hmm my intuition was that you want to add the highest character to the upper lower bound

i agree with this

... of the min value

min value of what?
all column values need to be in the range [like_constant_prefix, like_constant_prefix[0..-1] + \u10ffff)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... of the min value

min value of what?

i get it now

So for a column we have stats: min value and max value. Let's call them col_min and col_max.
For like AB% we derive lower and upper bound (AB and AB\u10ffff which is actually incorrect, will comment about this elsewhere).

For pruning we need to check whether [col_min, col_max] ∩ [lower_bound, upper_bound) is non-empty (note the upper_bound will be non-inclusive)
It's empty when upper_bound <= col_min OR col_max < lower_bound
It's non-empty when upper_bound > col_min AND col_max >= lower_bound

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For correct upper bound and why exclusive see #12978 (comment)

// Otherwise 'AB' <= 'A' AND 'A' <= 'AB' would be *wrong* because 'AB' LIKE 'A%' is should be true!
// Credit to https://stackoverflow.com/a/35881551 for inspiration on this approach.
// ANSI SQL specifies two wildcards: % and _. % matches zero or more characters, _ matches exactly one character.
let first_wildcard_index = s.find(['%', '_']);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not support escape characters, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> SELECT '' LIKE '' ESCAPE '%';
Execution error: LIKE does not support escape_char

So I think not.
But just for reference, how would you suggest escape characters be handled? I've never used them in practice (I think at that point I'd just go for a regex).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 1661 to 1664
// **IMPORTANT** we need to make sure that the min and max are in the range of the prefix
// If we truncate 'A%' to 'A', we need to make sure that 'A' is less than 'AB' so that
// when we make this a range query we get 'AB' <= 'A\u{10ffff}' AND 'A' <= 'AB'.
// Otherwise 'AB' <= 'A' AND 'A' <= 'AB' would be *wrong* because 'AB' LIKE 'A%' is should be true!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This important, but a bit difficult to follow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to clarifications on the wording. The point I'm trying to make is why we have to append characters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this whole comment. If someone understand how LIKE works, they will get the code without comment.
If someone doesn't understand how LIKE works, they won't understand the code even with the comment.

let min_expr = Arc::new(phys_expr::BinaryExpr::new(
min_column_expr.clone(),
Operator::LtEq,
min_lit,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min_lit actually represents max value (upper bound)
would you consider swapping the naming of the variables?

Also, the upper bound has added "Z" ('highest codepoint') at the end, so can be compared with Lt without Eq part

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min_lit actually represents max value (upper bound)
would you consider swapping the naming of the variables?

I'm open to any suggestions on naming but I do think it is confusing because min_lit is the upper bound on col_min and max_lit is the lower bound on col_max 🤯

Also, the upper bound has added "Z" ('highest codepoint') at the end, so can be compared with Lt without Eq part

👍🏻

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is confusing because min_lit is the upper bound on col_min and max_lit is the lower bound on col_max 🤯

Yes, it is.

i understand now that you're fitting this into terminology of existing code.
i am not sure what the right naming would be. maybe neutral: lower_bound and upper_bound?

Comment on lines 1678 to 1700
let prefix_lit =
Arc::new(phys_expr::Literal::new(ScalarValue::Utf8(Some(s.clone()))));
(prefix_lit.clone(), prefix_lit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In such case we should produce single Eq check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by a single eq check. We are basically saying col like 'constant' -> col = 'constant' -> col_min <= 'constant' and 'constant' <= col_max

let s = extract_string_literal(scalar_expr)?;
// **IMPORTANT** we need to make sure that the min and max are in the range of the prefix
// If we truncate 'A%' to 'A', we need to make sure that 'A' is less than 'AB' so that
// when we make this a range query we get 'AB' <= 'A\u{10ffff}' AND 'A' <= 'AB'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for A% pattern the lower bound is A (obvious)
what should be the upper bound?

A\u{10ffff} is not a correct upper bound since A\u{10ffff}\u{10ffff} is even bigger but still matches A% input.
The correct upper bound would be:

  • A\u{10ffff}\u{10ffff}\u{10ffff}...\u{10ffff} include -- up to max length of the column, so potentially very very long, so absolutely not practical
  • B (exclusive).

Thus to calculate upper bound you need (pseudo-code)

let s = extract_string_literal(scalar_expr)?;
let first_wildcard_index = ...;
let prefix = &s[..wildcard_index];
let last_incrementable_character = /* find last code point of `prefix` that can be incremented
   if we choose to stay within ascii, this will be a code point < 127
   otherwise it will be any code point != the max code point (0x10FFFF) */;
if last_incrementable_character not found {
  // For `%`, or `\u{10ffff}...\u{10ffff}%` patterns, we cannot calculate an upper bound
  return None
}
let upper_bound = 
   prefix[..last_incrementable_character-1] +  // take prefix of the prefix up to  and excluding the last character that can be incremented
   str(prefix[last_incrementable_character] + 1) // take last character and increment it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about this explanation. Maybe you can provide a failing test case that would help me understand? There is already a test case for 'A%' and it is as far as I can tell doing the correct thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try add a test case for 'A%' where stats are min=AB max=A\u{10ffff}\u{10ffff}\u{10ffff}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I added a test case in e29ed50. It worked as expected, let me know if I got the expected outcomes wrong or missed something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #12978 (comment) on how to make the test expose the problem

// If we truncate 'A%' to 'A', we need to make sure that 'A' is less than 'AB' so that
// when we make this a range query we get 'AB' <= 'A\u{10ffff}' AND 'A' <= 'AB'.
// Otherwise 'AB' <= 'A' AND 'A' <= 'AB' would be *wrong* because 'AB' LIKE 'A%' is should be true!
// Credit to https://stackoverflow.com/a/35881551 for inspiration on this approach.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link isn't useful. That page conveniently avoids any details that are important. Please remove the link.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a better resource or name for this transformation? I'm happy to point at any documentation or prior art in Trino.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just not finding this link useful being being inspirational. But source of inspiration doesn't need to be reflected in the code comment 🙂
I am not asking for replacing if any other link.

I personally find trinodb/trino@6aea881 valuable because (1) I know this code as its author and (2) it might actually be correct. Once we get the code here correct, that link wouldn't be useful either.

@adriangb adriangb force-pushed the like-prune branch 2 times, most recently from a0a1c37 to ae3426d Compare October 26, 2024 21:46
@adriangb
Copy link
Contributor Author

@alamb I re-arranged some of the comments on assertions in ae3426d which I feel like helped a lot with readability of the tests. There's a couple other tests with a similar pattern that I think could benefit.

I was also thinking about doing some more black-box testing: I think given any min, max you can always convert that into a RecordBatch with an array of the form [min, max] and you should never have the pruning say the array should be excluded but the array has any matches. Does that make sense? Maybe this could even be fuzz tested?

@alamb
Copy link
Contributor

alamb commented Dec 20, 2024

Clearly I failed to review this -- I will do so hopefully later today but may be tomorrow

@alamb
Copy link
Contributor

alamb commented Dec 23, 2024

This is still on my list, hopefully other people can check it out too

@alamb
Copy link
Contributor

alamb commented Dec 23, 2024

This is my top priority after DF 44 is released:

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adriangb -- I think this PR is ready to go

One thing I noticed is that the fuzz test takes over a minute on my machine:

        SLOW [> 60.000s] datafusion::fuzz fuzz_cases::pruning::test_fuzz_utf8
        PASS [  65.772s] datafusion::fuzz fuzz_cases::pruning::test_fuzz_utf8
------------
     Summary [  72.749s] 47 tests run: 47 passed (1 slow), 0 skipped
andrewlamb@Mac:~/Software/datafusion$

Is there some way to make it faster? Maybe with multiple threads or crank down the number of things to teset?

"~",
"ß",
"℣",
"%", // this one is useful for like/not like tests since it will result in randomly inserted wildcards
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

datafusion/core/tests/fuzz_cases/pruning.rs Outdated Show resolved Hide resolved
/// of "fo" that may have originally been "foz" or anything else with the prefix "fo".
/// E.g. `increment_utf8("foo") >= "foo"` and `increment_utf8("foo") >= "fooz"`
/// In this example `increment_utf8("foo") == "fop"
fn increment_utf8(data: &str) -> Option<String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be ok to potentially replace this with the implementation from @etseidl in apache/arrow-rs#6870 ?

If so, I can file a ticket to do so as a follow on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed that implementation but yes I think we should consider it!

@adriangb
Copy link
Contributor Author

Thanks @adriangb -- I think this PR is ready to go

One thing I noticed is that the fuzz test takes over a minute on my machine:

        SLOW [> 60.000s] datafusion::fuzz fuzz_cases::pruning::test_fuzz_utf8
        PASS [  65.772s] datafusion::fuzz fuzz_cases::pruning::test_fuzz_utf8
------------
     Summary [  72.749s] 47 tests run: 47 passed (1 slow), 0 skipped
andrewlamb@Mac:~/Software/datafusion$

Is there some way to make it faster? Maybe with multiple threads or crank down the number of things to teset?

Yeah this is what I was hinting at in #12978 (comment).

I'm happy to throw threads at it for a start, and restricting the search space might be necessary but I think requires a more careful eye to minimize how much valuable testing is discarded. The other thing that I think we can do is speed up the tests themselves, in particular minimizing unnecessary round trips to Parquet, but I'm not sure where the right places to hook in would be that still give us a realistic test but remove the need to re-parse the same data over and over again.

@alamb
Copy link
Contributor

alamb commented Dec 28, 2024

Yeah this is what I was hinting at in #12978 (comment).

I'm happy to throw threads at it for a start, and restricting the search space might be necessary but I think requires a more careful eye to minimize how much valuable testing is discarded. The other thing that I think we can do is speed up the tests themselves, in particular minimizing unnecessary round trips to Parquet, but I'm not sure where the right places to hook in would be that still give us a realistic test but remove the need to re-parse the same data over and over again.

Awesome -- I'll try and find time later today or tomorrow to give it a critical eye. Otherwise I'll plan to merge this PR later today or tomorrow as well.

@alamb
Copy link
Contributor

alamb commented Dec 30, 2024

Yeah this is what I was hinting at in #12978 (comment).

I'm happy to throw threads at it for a start, and restricting the search space might be necessary but I think requires a more careful eye to minimize how much valuable testing is discarded. The other thing that I think we can do is speed up the tests themselves, in particular minimizing unnecessary round trips to Parquet, but I'm not sure where the right places to hook in would be that still give us a realistic test but remove the need to re-parse the same data over and over again.

@alamb alamb merged commit fb1d4bc into apache:main Dec 30, 2024
27 checks passed
@alamb alamb added the performance Make DataFusion faster label Dec 30, 2024
@alamb
Copy link
Contributor

alamb commented Dec 30, 2024

Thank you again @adriangb for bearing with us -- I know this took a long time

However, I am pretty stoked that we now have this optimization and it is an example of the very careful engineering required for this kind of optimization. The fact we are at this point in DataFusion is pretty sweet in my mind

🚀

wiedld pushed a commit to influxdata/arrow-datafusion that referenced this pull request Jan 16, 2025
…pache#12978)

* Implement predicate pruning for like expressions

* add function docstring

* re-order bounds calculations

* fmt

* add fuzz tests

* fix clippy

* Update datafusion/core/tests/fuzz_cases/pruning.rs

Co-authored-by: Andrew Lamb <[email protected]>

---------

Co-authored-by: Andrew Lamb <[email protected]>
// column LIKE '%foo%' => min <= '' && '' <= max => true
// column LIKE 'foo' => min <= 'foo' && 'foo' <= max

fn unpack_string(s: &ScalarValue) -> Option<&String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I proposed pulling this function out into its own API here: #14167

jayzhan211 added a commit that referenced this pull request Jan 20, 2025
* Minor: Use `div_ceil`

* Fix hash join with sort push down (#13560)

* fix: join with sort push down

* chore:
insert some value

* apply suggestion

* recover handle_costom_pushdown change

* apply suggestion

* add more test

* add partition

* Improve substr() performance by avoiding using owned string (#13688)

Co-authored-by: zhangli20 <[email protected]>

* reinstate down_cast_any_ref (#13705)

* Optimize performance of `character_length` function (#13696)

* Optimize performance of  function

Signed-off-by: Tai Le Manh <[email protected]>

* Add pre-check array is null

* Fix clippy warnings

---------

Signed-off-by: Tai Le Manh <[email protected]>

* Update prost-build requirement from =0.13.3 to =0.13.4 (#13698)

Updates the requirements on [prost-build](https://github.com/tokio-rs/prost) to permit the latest version.
- [Release notes](https://github.com/tokio-rs/prost/releases)
- [Changelog](https://github.com/tokio-rs/prost/blob/master/CHANGELOG.md)
- [Commits](https://github.com/tokio-rs/prost/compare/v0.13.3...v0.13.4)

---
updated-dependencies:
- dependency-name: prost-build
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Minor: Output elapsed time for sql logic test (#13718)

* Minor: Output elapsed time for sql logic test

* refactor: simplify the `make_udf_function` macro (#13712)

* refactor: replace `Vec` with `IndexMap` for expression mappings in `ProjectionMapping` and `EquivalenceGroup` (#13675)

* refactor: replace Vec with IndexMap for expression mappings in ProjectionMapping and EquivalenceGroup

* chore

* chore: Fix CI

* chore: comment

* chore: simplify

* Handle alias when parsing sql(parse_sql_expr) (#12939)

* fix: Fix parse_sql_expr not handling alias

* cargo fmt

* fix parse_sql_expr example(remove alias)

* add testing

* add SUM udaf to TestContextProvider and modify test_sql_to_expr_with_alias for function

* revert change on example `parse_sql_expr`

* Improve documentation for TableProvider (#13724)

* Reveal implementing type and return type in simple UDF implementations (#13730)

Debug trait is useful for understanding what something is and how it's
configured, especially if the implementation is behind dyn trait.

* minor: Extract tests for `EXTRACT` AND `date_part` to their own file (#13731)

* Support unparsing `UNNEST` plan to `UNNEST` table factor SQL (#13660)

* add `unnest_as_table_factor` and `UnnestRelationBuilder`

* unparse unnest as table factor

* fix typo

* add tests for the default configs

* add a static const for unnest_placeholder

* fix tests

* fix tests

* Update to apache-avro 0.17, fix compatibility changes schema handling  (#13727)

* Update apache-avro requirement from 0.16 to 0.17

---
updated-dependencies:
- dependency-name: apache-avro
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Fix compatibility changes schema handling apache-avro 0.17

- Handle ArraySchema struct
- Handle MapSchema struct
- Map BigDecimal => LargeBinary
- Map TimestampNanos => Timestamp(TimeUnit::Nanosecond, None)
- Map LocalTimestampNanos => todo!()
- Add Default to FixedSchema test

* Update Cargo.lock file for apache-avro 0.17

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Marc Droogh <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>

* Minor: Add doc example to RecordBatchStreamAdapter (#13725)

* Minor: Add doc example to RecordBatchStreamAdapter

* Update datafusion/physical-plan/src/stream.rs

Co-authored-by: Berkay Şahin <[email protected]>

---------

Co-authored-by: Berkay Şahin <[email protected]>

* Implement GroupsAccumulator for corr(x,y) aggregate function (#13581)

* Implement GroupsAccumulator for corr(x,y)

* feedbacks

* fix CI MSRV

* review

* avoid collect in accumulation

* add back cast

* fix union serialisation order in proto (#13709)

* fix union serialisation order in proto

* clippy

* address comments

* Minor: make unsupported `nanosecond` part a real (not internal) error (#13733)

* Minor: make unsupported `nanosecond` part a real (not internal) error

* fmt

* Improve wording to refer to date part

* Add tests for date_part on columns + timestamps with / without timezones (#13732)

* Add tests for date_part on columns + timestamps with / without timezones

* Add tests from https://github.com/apache/datafusion/pull/13372

* remove trailing whitespace

* Optimize performance of `initcap` function (~2x faster) (#13691)

* Optimize performance of initcap (~2x faster)

Signed-off-by: Tai Le Manh <[email protected]>

* format

---------

Signed-off-by: Tai Le Manh <[email protected]>

* Minor: Add documentation explaining that initcap oly works for ASCII (#13749)

* Support sqllogictest --complete with postgres (#13746)

Before the change, the request to use PostgreSQL was simply ignored when
`--complete` flag was present.

* doc-gen: migrate window functions documentation to attribute based (#13739)

* doc-gen: migrate window functions documentation

Signed-off-by: zjregee <[email protected]>

* fix: update Cargo.lock

---------

Signed-off-by: zjregee <[email protected]>

* Minor: Remove memory reservation in `JoinLeftData` used in HashJoin (#13751)

* Refactor JoinLeftData structure by removing unused memory reservation field in hash join implementation

* Add Debug and Clone derives for HashJoinStreamState and ProcessProbeBatchState enums

This commit enhances the HashJoinStreamState and ProcessProbeBatchState structures by implementing the Debug and Clone traits, allowing for easier debugging and cloning of these state representations in the hash join implementation.

* Update to bigdecimal 0.4.7 (#13747)

* Add big decimal formatting test cases with potential trailing zeros

* Rename and simplify decimal rendering functions

- add `decimal` to function name
- drop `precision` parameter as it is not supposed to affect the result

* Update to bigdecimal 0.4.7

Utilize new `to_plain_string` function

* chore: clean up dependencies (#13728)

* CI: Warn on unused crates

* CI: Warn on unused crates

* CI: Warn on unused crates

* CI: Warn on unused crates

* CI: Clean up dependencies

* CI: Clean up dependencies

* fix: Implicitly plan `UNNEST` as lateral (#13695)

* plan implicit lateral if table factor is UNNEST

* check for outer references in `create_relation_subquery`

* add sqllogictest

* fix lateral constant test to not expect a subquery node

* replace sqllogictest in favor of logical plan test

* update lateral join sqllogictests

* add sqllogictests

* fix logical plan test

* Minor: improve the Deprecation / API health guidelines (#13701)

* Minor: improve the Deprecation / API health policy

* prettier

* Update docs/source/library-user-guide/api-health.md

Co-authored-by: Jonah Gao <[email protected]>

* Add version guidance and make more copy/paste friendly

* prettier

* better

* rename to guidelines

---------

Co-authored-by: Jonah Gao <[email protected]>

* fix: specify roottype in substrait fieldreference (#13647)

* fix: specify roottype in fieldreference

Signed-off-by: MBWhite <[email protected]>

* Fix formatting

Signed-off-by: MBWhite <[email protected]>

* review suggestion

Signed-off-by: MBWhite <[email protected]>

---------

Signed-off-by: MBWhite <[email protected]>

* Simplify type signatures using `TypeSignatureClass` for mixed type function signature (#13372)

* add type sig class

Signed-off-by: jayzhan211 <[email protected]>

* timestamp

Signed-off-by: jayzhan211 <[email protected]>

* date part

Signed-off-by: jayzhan211 <[email protected]>

* fmt

Signed-off-by: jayzhan211 <[email protected]>

* taplo format

Signed-off-by: jayzhan211 <[email protected]>

* tpch test

Signed-off-by: jayzhan211 <[email protected]>

* msrc issue

Signed-off-by: jayzhan211 <[email protected]>

* msrc issue

Signed-off-by: jayzhan211 <[email protected]>

* explicit hash

Signed-off-by: jayzhan211 <[email protected]>

* Enhance type coercion and function signatures

- Added logic to prevent unnecessary casting of string types in `native.rs`.
- Introduced `Comparable` variant in `TypeSignature` to define coercion rules for comparisons.
- Updated imports in `functions.rs` and `signature.rs` for better organization.
- Modified `date_part.rs` to improve handling of timestamp extraction and fixed query tests in `expr.slt`.
- Added `datafusion-macros` dependency in `Cargo.toml` and `Cargo.lock`.

These changes improve type handling and ensure more accurate function behavior in SQL expressions.

* fix comment

Signed-off-by: Jay Zhan <[email protected]>

* fix signature

Signed-off-by: Jay Zhan <[email protected]>

* fix test

Signed-off-by: Jay Zhan <[email protected]>

* Enhance type coercion for timestamps to allow implicit casting from strings. Update SQL logic tests to reflect changes in timestamp handling, including expected outputs for queries involving nanoseconds and seconds.

* Refactor type coercion logic for timestamps to improve readability and maintainability. Update the `TypeSignatureClass` documentation to clarify its purpose in function signatures, particularly regarding coercible types. This change enhances the handling of implicit casting from strings to timestamps.

* Fix SQL logic tests to correct query error handling for timestamp functions. Updated expected outputs for `date_part` and `extract` functions to reflect proper behavior with nanoseconds and seconds. This change improves the accuracy of test cases in the `expr.slt` file.

* Enhance timestamp handling in TypeSignature to support timezone specification. Updated the logic to include an additional DataType for timestamps with a timezone wildcard, improving flexibility in timestamp operations.

* Refactor date_part function: remove redundant imports and add missing not_impl_err import for better error handling

---------

Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>

* Minor: Add some more blog posts to the readings page (#13761)

* Minor: Add some more blog posts to the readings page

* prettier

* prettier

* Update docs/source/user-guide/concepts-readings-events.md

---------

Co-authored-by: Oleks V <[email protected]>

* docs: update GroupsAccumulator instead of GroupAccumulator (#13787)

Fixing `GroupsAccumulator` trait name in its docs

* Improve Deprecation Guidelines more (#13776)

* Improve deprecation guidelines more

* prettier

* fix: add `null_buffer` length check to `StringArrayBuilder`/`LargeStringArrayBuilder` (#13758)

* fix: add `null_buffer` check for `LargeStringArray`

Add a safety check to ensure that the alignment of buffers cannot be
overflowed. This introduces a panic if they are not aligned through a
runtime assertion.

* fix: remove value_buffer assertion

These buffers can be misaligned and it is not problematic, it is the
`null_buffer` which we care about being of the same length.

* feat: add `null_buffer` check to `StringArray`

This is in a similar vein to `LargeStringArray`, as the code is the
same, except for `i32`'s instead of `i64`.

* feat: use `row_count` var to avoid drift

* Revert the removal of reservation in HashJoin (#13792)

* fix: restore memory reservation in JoinLeftData for accurate memory accounting in HashJoin

This commit reintroduces the `_reservation` field in the `JoinLeftData` structure to ensure proper tracking of memory resources during join operations. The absence of this field could lead to inconsistent memory usage reporting and potential out-of-memory issues as upstream operators increase their memory consumption.

* fmt

Signed-off-by: Jay Zhan <[email protected]>

---------

Signed-off-by: Jay Zhan <[email protected]>

* added count aggregate slt (#13790)

* Update documentation guidelines for contribution content (#13703)

* Update documentation guidelines for contribution content

* Apply suggestions from code review

Co-authored-by: Piotr Findeisen <[email protected]>
Co-authored-by: Oleks V <[email protected]>

* clarify discussions and remove requirements note

* prettier

* Update docs/source/contributor-guide/index.md

Co-authored-by: Piotr Findeisen <[email protected]>

---------

Co-authored-by: Piotr Findeisen <[email protected]>
Co-authored-by: Oleks V <[email protected]>

* Add Round trip tests for Array <--> ScalarValue (#13777)

* Add Round trip tests for Array <--> ScalarValue

* String dictionary test

* remove unecessary value

* Improve comments

* fix: Limit together with pushdown_filters (#13788)

* fix: Limit together with pushdown_filters

* Fix format

* Address new comments

* Fix testing case to hit the problem

* Minor: improve Analyzer docs (#13798)

* Minor: cargo update in datafusion-cli (#13801)

* Update datafusion-cli toml to pin home=0.5.9

* update Cargo.lock

* Fix `ScalarValue::to_array_of_size` for DenseUnion (#13797)

* fix: enable pruning by bloom filters for dictionary columns (#13768)

* Handle empty rows for `array_distinct` (#13810)

* handle empty array distinct

* ignore

* fix

---------

Co-authored-by: Cyprien Huet <[email protected]>

* Fix get_type for higher-order array functions (#13756)

* Fix get_type for higher-order array functions

* Fix recursive flatten

The fix is covered by recursive flatten test case in array.slt

* Restore "keep LargeList" in Array signature

* clarify naming in the test

* Chore: Do not return empty record batches from streams (#13794)

* do not emit empty record batches in plans

* change function signatures to Option<RecordBatch> if empty batches are possible

* format code

* shorten code

* change list_unnest_at_level for returning Option value

* add documentation
take concat_batches into compute_aggregates function again

* create unit test for row_hash.rs

* add test for unnest

* add test for unnest

* add test for partial sort

* add test for bounded window agg

* add test for window agg

* apply simplifications and fix typo

* apply simplifications and fix typo

* Handle possible overflows in StringArrayBuilder / LargeStringArrayBuilder (#13802)

* test(13796): reproducer of overflow on capacity

* fix(13796): handle overflows with proper max capacity number which is valid for MutableBuffer

* refactor: use simple solution and provide panic

* fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema (#13750)

* fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema

* clippy

* fix csv and json tests

* add testing for parquet

* cleanup

* fix parquet tests

* document describe_partition, add back repartition options to one of the csv empty files tests

* Support Null regex override in csv parser options. (#13228)

Co-authored-by: Andrew Lamb <[email protected]>

* Minor: Extend ScalarValue::new_zero() (#13828)

* Update mod.rs

* Update mod.rs

* Update mod.rs

* Update mod.rs

* chore: temporarily disable windows flow (#13833)

* feat: `parse_float_as_decimal` supports scientific notation and Decimal256 (#13806)

* feat: `parse_float_as_decimal` supports scientific notation and Decimal256

* Fix test

* Add test

* Add test

* Refine negative scales

* Update comment

* Refine bigint_to_i256

* UT for bigint_to_i256

* Add ut for parse_decimal

* Replace `BooleanArray::extend` with `append_n` (#13832)

* Rename `TypeSignature::NullAry` --> `TypeSignature::Nullary` and improve comments (#13817)

* Rename `TypeSignature::NullAry` --> `TypeSignature::Nullary` and improve comments

* Apply suggestions from code review

Co-authored-by: Piotr Findeisen <[email protected]>

* improve docs

---------

Co-authored-by: Piotr Findeisen <[email protected]>

* [bugfix] ScalarFunctionExpr does not preserve the nullable flag on roundtrip (#13830)

* [test] coalesce round trip schema mismatch

* [proto] added the nullable flag in PhysicalScalarUdfNode

* [bugfix] propagate the nullable flag for serialized scalar UDFS

* Add example of interacting with a remote catalog (#13722)

* Add example of interacting with a remote catalog

* Update datafusion/core/src/execution/session_state.rs

Co-authored-by: Berkay Şahin <[email protected]>

* Apply suggestions from code review

Co-authored-by: Jonah Gao <[email protected]>
Co-authored-by: Weston Pace <[email protected]>

* Use HashMap to hold tables

---------

Co-authored-by: Berkay Şahin <[email protected]>
Co-authored-by: Jonah Gao <[email protected]>
Co-authored-by: Weston Pace <[email protected]>

* Update substrait requirement from 0.49 to 0.50 (#13808)

* Update substrait requirement from 0.49 to 0.50

Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version.
- [Release notes](https://github.com/substrait-io/substrait-rs/releases)
- [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/substrait-io/substrait-rs/compare/v0.49.0...v0.50.0)

---
updated-dependencies:
- dependency-name: substrait
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Fix compilation

* Add expr test

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: jonahgao <[email protected]>

* typo: remove extraneous "`" in doc comment, fix header (#13848)

* typo: extraneous "`" in doc comment

* Update datafusion/execution/src/runtime_env.rs

* Update datafusion/execution/src/runtime_env.rs

---------

Co-authored-by: Oleks V <[email protected]>

* typo: remove extra "`" interfering with doc formatting (#13847)

* Support n-ary monotonic functions in ordering equivalence (#13841)

* Support n-ary monotonic functions in `discover_new_orderings`

* Add tests for n-ary monotonic functions in `discover_new_orderings`

* Fix tests

* Fix non-monotonic test case

* Fix unintended simplification

* Minor comment changes

* Fix tests

* Add `preserves_lex_ordering` field

* Use `preserves_lex_ordering` on `discover_new_orderings()`

* Add `output_ordering` and `output_preserves_lex_ordering` implementations for `ConcatFunc`

* Update tests

* Move logic to UDF

* Cargo fmt

* Refactor

* Cargo fmt

* Simply use false value on default implementation

* Remove unnecessary import

* Clippy fix

* Update Cargo.lock

* Move dep to dev-dependencies

* Rename output_preserves_lex_ordering to preserves_lex_ordering

* minor

---------

Co-authored-by: berkaysynnada <[email protected]>

* Replace `execution_mode` with `emission_type` and `boundedness` (#13823)

* feat: update execution modes and add bitflags dependency

- Introduced `Incremental` execution mode alongside existing modes in the DataFusion execution plan.
- Updated various execution plans to utilize the new `Incremental` mode where applicable, enhancing streaming capabilities.
- Added `bitflags` dependency to `Cargo.toml` for better management of execution modes.
- Adjusted execution mode handling in multiple files to ensure compatibility with the new structure.

* add exec API

Signed-off-by: Jay Zhan <[email protected]>

* replace done but has stackoverflow

Signed-off-by: Jay Zhan <[email protected]>

* exec API done

Signed-off-by: Jay Zhan <[email protected]>

* Refactor execution plan properties to remove execution mode

- Removed the `ExecutionMode` parameter from `PlanProperties` across multiple physical plan implementations.
- Updated related functions to utilize the new structure, ensuring compatibility with the changes.
- Adjusted comments and cleaned up imports to reflect the removal of execution mode handling.

This refactor simplifies the execution plan properties and enhances maintainability.

* Refactor execution plan to remove `ExecutionMode` and introduce `EmissionType`

- Removed the `ExecutionMode` parameter from `PlanProperties` and related implementations across multiple files.
- Introduced `EmissionType` to better represent the output characteristics of execution plans.
- Updated functions and tests to reflect the new structure, ensuring compatibility and enhancing maintainability.
- Cleaned up imports and adjusted comments accordingly.

This refactor simplifies the execution plan properties and improves the clarity of memory handling in execution plans.

* fix test

Signed-off-by: Jay Zhan <[email protected]>

* Refactor join handling and emission type logic

- Updated test cases in `sanity_checker.rs` to reflect changes in expected outcomes for bounded and unbounded joins, ensuring accurate test coverage.
- Simplified the `is_pipeline_breaking` method in `execution_plan.rs` to clarify the conditions under which a plan is considered pipeline-breaking.
- Enhanced the emission type determination logic in `execution_plan.rs` to prioritize `Final` over `Both` and `Incremental`, improving clarity in execution plan behavior.
- Adjusted join type handling in `hash_join.rs` to classify `Right` joins as `Incremental`, allowing for immediate row emission.

These changes improve the accuracy of tests and the clarity of execution plan properties.

* Implement emission type for execution plans

- Updated multiple execution plan implementations to replace `unimplemented!()` with `EmissionType::Incremental`, ensuring that the emission type is correctly defined for various plans.
- This change enhances the clarity and functionality of the execution plans by explicitly specifying their emission behavior.

These updates contribute to a more robust execution plan framework within the DataFusion project.

* Enhance join type documentation and refine emission type logic

- Updated the `JoinType` enum in `join_type.rs` to include detailed descriptions for each join type, improving clarity on their behavior and expected results.
- Modified the emission type logic in `hash_join.rs` to ensure that `Right` and `RightAnti` joins are classified as `Incremental`, allowing for immediate row emission when applicable.

These changes improve the documentation and functionality of join operations within the DataFusion project.

* Refactor emission type logic in join and sort execution plans

- Updated the emission type determination in `SortMergeJoinExec` and `SymmetricHashJoinExec` to utilize the `emission_type_from_children` function, enhancing the accuracy of emission behavior based on input characteristics.
- Clarified comments in `sort.rs` regarding the conditions under which results are emitted, emphasizing the relationship between input sorting and emission type.
- These changes improve the clarity and functionality of the execution plans within the DataFusion project, ensuring more robust handling of emission types.

* Refactor emission type handling in execution plans

- Updated the `emission_type_from_children` function to accept an iterator instead of a slice, enhancing flexibility in how child execution plans are passed.
- Modified the `SymmetricHashJoinExec` implementation to utilize the new function signature, improving code clarity and maintainability.

These changes streamline the emission type determination process within the DataFusion project, contributing to a more robust execution plan framework.

* Enhance execution plan properties with boundedness and emission type

- Introduced `boundedness` and `pipeline_behavior` methods to the `ExecutionPlanProperties` trait, improving the handling of execution plan characteristics.
- Updated the `CsvExec`, `SortExec`, and related implementations to utilize the new methods for determining boundedness and emission behavior.
- Refactored the `ensure_distribution` function to use the new boundedness logic, enhancing clarity in distribution decisions.
- These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project.

* Refactor execution plans to enhance boundedness and emission type handling

- Updated multiple execution plan implementations to incorporate `Boundedness` and `EmissionType`, improving the clarity and functionality of execution plans.
- Replaced instances of `unimplemented!()` with appropriate emission types, ensuring that plans correctly define their output behavior.
- Refactored the `PlanProperties` structure to utilize the new boundedness logic, enhancing decision-making in execution plans.
- These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project.

* Refactor memory handling in execution plans

- Updated the condition for checking memory requirements in execution plans from `has_finite_memory()` to `boundedness().requires_finite_memory()`, improving clarity in memory management.
- This change enhances the robustness of execution plans within the DataFusion project by ensuring more accurate assessments of memory constraints.

* Refactor boundedness checks in execution plans

- Updated conditions for checking boundedness in various execution plans to use `is_unbounded()` instead of `requires_finite_memory()`, enhancing clarity in memory management.
- Adjusted the `PlanProperties` structure to reflect these changes, ensuring more accurate assessments of memory constraints across the DataFusion project.
- These modifications contribute to a more robust and maintainable execution plan framework, improving the handling of boundedness in execution strategies.

* Remove TODO comment regarding unbounded execution plans in `UnboundedExec` implementation

- Eliminated the outdated comment suggesting a switch to unbounded execution with finite memory, streamlining the code and improving clarity.
- This change contributes to a cleaner and more maintainable codebase within the DataFusion project.

* Refactor execution plan boundedness and emission type handling

- Updated the `is_pipeline_breaking` method to use `requires_finite_memory()` for improved clarity in determining pipeline behavior.
- Enhanced the `Boundedness` enum to include detailed documentation on memory requirements for unbounded streams.
- Refactored `compute_properties` methods in `GlobalLimitExec` and `LocalLimitExec` to directly use the input's boundedness, simplifying the logic.
- Adjusted emission type determination in `NestedLoopJoinExec` to utilize the `emission_type_from_children` function, ensuring accurate output behavior based on input characteristics.

These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project, improving clarity and functionality in handling boundedness and emission types.

* Refactor emission type and boundedness handling in execution plans

- Removed the `OptionalEmissionType` struct from `plan_properties.rs`, simplifying the codebase.
- Updated the `is_pipeline_breaking` function in `execution_plan.rs` for improved readability by formatting the condition across multiple lines.
- Adjusted the `GlobalLimitExec` implementation in `limit.rs` to directly use the input's boundedness, enhancing clarity in memory management.

These changes contribute to a more streamlined and maintainable execution plan framework within the DataFusion project, improving the handling of emission types and boundedness.

* Refactor GlobalLimitExec and LocalLimitExec to enhance boundedness handling

- Updated the `compute_properties` methods in both `GlobalLimitExec` and `LocalLimitExec` to replace `EmissionType::Final` with `Boundedness::Bounded`, reflecting that limit operations always produce a finite number of rows.
- Changed the input's boundedness reference to `pipeline_behavior()` for improved clarity in execution plan properties.

These changes contribute to a more streamlined and maintainable execution plan framework within the DataFusion project, enhancing the handling of boundedness in limit operations.

* Review Part1

* Update sanity_checker.rs

* addressing reviews

* Review Part 1

* Update datafusion/physical-plan/src/execution_plan.rs

* Update datafusion/physical-plan/src/execution_plan.rs

* Shorten imports

* Enhance documentation for JoinType and Boundedness enums

- Improved descriptions for the Inner and Full join types in join_type.rs to clarify their behavior and examples.
- Added explanations regarding the boundedness of output streams and memory requirements in execution_plan.rs, including specific examples for operators like Median and Min/Max.

---------

Signed-off-by: Jay Zhan <[email protected]>
Co-authored-by: berkaysynnada <[email protected]>
Co-authored-by: Mehmet Ozan Kabak <[email protected]>

* Preserve ordering equivalencies on `with_reorder` (#13770)

* Preserve ordering equivalencies on `with_reorder`

* Add assertions

* Return early if filtered_exprs is empty

* Add clarify comment

* Refactor

* Add comprehensive test case

* Add comment for exprs_equal

* Cargo fmt

* Clippy fix

* Update properties.rs

* Update exprs_equal and add tests

* Update properties.rs

---------

Co-authored-by: berkaysynnada <[email protected]>

* replace CASE expressions in predicate pruning with boolean algebra (#13795)

* replace CASE expressions in predicate pruning with boolean algebra

* fix merge

* update tests

* add some more tests

* add some more tests

* remove duplicate test case

* Update datafusion/physical-optimizer/src/pruning.rs

* swap NOT for !=

* replace comments, update docstrings

* fix example

* update tests

* update tests

* Apply suggestions from code review

Co-authored-by: Andrew Lamb <[email protected]>

* Update pruning.rs

Co-authored-by: Chunchun Ye <[email protected]>

* Update pruning.rs

Co-authored-by: Chunchun Ye <[email protected]>

---------

Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Chunchun Ye <[email protected]>

* enable DF's nested_expressions feature by in datafusion-substrait tests to make them pass (#13857)

fixes #13854

Co-authored-by: Arttu Voutilainen <[email protected]>

* Add configurable normalization for configuration options and preserve case for S3 paths (#13576)

* Do not normalize values

* Fix tests & update docs

* Prettier

* Lowercase config params

* Unify transform and parse

* Fix tests

* Rename `default_transform` and relax boundaries

* Make `compression` case-insensitive

* Comment to new line

* Deprecate and ignore `enable_options_value_normalization`

* Update datafusion/common/src/config.rs

* fix typo

---------

Co-authored-by: Oleks V <[email protected]>

* Improve`Signature` and `comparison_coercion` documentation (#13840)

* Improve Signature documentation more

* Apply suggestions from code review

Co-authored-by: Piotr Findeisen <[email protected]>

---------

Co-authored-by: Piotr Findeisen <[email protected]>

* feat: support normalized expr in CSE (#13315)

* feat: support normalized expr in CSE

* feat: support normalize_eq in cse optimization

* feat: support cumulative binary expr result in normalize_eq

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Upgrade to sqlparser `0.53.0` (#13767)

* chore: Udpate to sqlparser 0.53.0

* Update for new sqlparser API

* more api updates

* Avoid serializing query to SQL string unless it is necessary

* Box wildcard options

* chore: update datafusion-cli Cargo.lock

* Minor: Use `resize` instead of `extend` for adding static values in SortMergeJoin logic (#13861)

Thanks @Dandandan

* feat(function): add `least` function (#13786)

* start adding least fn

* feat(function): add least function

* update function name

* fix scalar smaller function

* add tests

* run Clippy and Fmt

* Generated docs using `./dev/update_function_docs.sh`

* add comment why `descending: false`

* update comment

* Update least.rs

Co-authored-by: Bruce Ritchie <[email protected]>

* Update scalar_functions.md

* run ./dev/update_function_docs.sh to update docs

* merge greatest and least implementation to one

* add header

---------

Co-authored-by: Bruce Ritchie <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>

* Improve SortPreservingMerge::enable_round_robin_repartition  docs (#13826)

* Clarify SortPreservingMerge::enable_round_robin_repartition  docs

* tweaks

* Improve comments more

* clippy

* fix doc link

* Minor: Unify `downcast_arg` method (#13865)

* Implement `SHOW FUNCTIONS` (#13799)

* introduce rid for different signature

* implement show functions syntax

* add syntax example

* avoid duplicate join

* fix clippy

* show function_type instead of routine_type

* add some doc and comments

* Update bzip2 requirement from 0.4.3 to 0.5.0 (#13740)

* Update bzip2 requirement from 0.4.3 to 0.5.0

Updates the requirements on [bzip2](https://github.com/trifectatechfoundation/bzip2-rs) to permit the latest version.
- [Release notes](https://github.com/trifectatechfoundation/bzip2-rs/releases)
- [Commits](https://github.com/trifectatechfoundation/bzip2-rs/compare/0.4.4...v0.5.0)

---
updated-dependencies:
- dependency-name: bzip2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Fix test

* Fix CLI cargo.lock

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: jonahgao <[email protected]>

* Fix build (#13869)

* feat(substrait): modular substrait consumer (#13803)

* feat(substrait): modular substrait consumer

* feat(substrait): include Extension Rel handlers in default consumer

Include SerializerRegistry based handlers for Extension Relations in the
DefaultSubstraitConsumer

* refactor(substrait) _selection -> _field_reference

* refactor(substrait): remove SubstraitPlannerState usage from consumer

* refactor: get_state() -> get_function_registry()

* docs: elide imports from example

* test: simplify test

* refactor: remove Arc from DefaultSubstraitConsumer

* doc: add ticket for API improvements

* doc: link DefaultSubstraitConsumer to from_subtrait_plan

* refactor: remove redundant Extensions parsing

* Minor: fix: Include FetchRel when producing LogicalPlan from Sort (#13862)

* include FetchRel when producing LogicalPlan from Sort

* add suggested test

* address review feedback

* Minor: improve error message when ARRAY literals can not be planned (#13859)

* Minor: improve error message when ARRAY literals can not be planned

* fmt

* Update datafusion/sql/src/expr/value.rs

Co-authored-by: Oleks V <[email protected]>

---------

Co-authored-by: Oleks V <[email protected]>

* Add documentation for `SHOW FUNCTIONS` (#13868)

* Support unicode character for `initcap` function (#13752)

* Support unicode character for 'initcap' function

Signed-off-by: Tai Le Manh <[email protected]>

* Update unit tests

* Fix clippy warning

* Update sqllogictests - initcap

* Update scalar_functions.md docs

* Add suggestions change

Signed-off-by: Tai Le Manh <[email protected]>

---------

Signed-off-by: Tai Le Manh <[email protected]>

* [minor] make recursive package dependency optional  (#13778)

* make recursive optional

* add to default for common package

* cargo update

* added to readme

* make test conditional

* reviews

* cargo update

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Minor: remove unused async-compression `futures-io` feature (#13875)

* Minor: remove unused async-compression feature

* Fix cli cargo lock

* Consolidate Example: dataframe_output.rs into dataframe.rs (#13877)

* Restore `DocBuilder::new()` to avoid breaking API change (#13870)

* Fix build

* Restore DocBuilder::new(), deprecate

* cmt

* clippy

* Improve error messages for incorrect zero argument signatures (#13881)

* Improve error messages for incorrect zero argument signatures

* fix errors

* fix fmt

* Consolidate Example: simplify_udwf_expression.rs into advanced_udwf.rs (#13883)

* minor: fix typos in  comments / structure names (#13879)

* minor: fix typo error in datafusion

* fix: fix rebase error

* fix: format HashJoinExec doc

* doc: recover thiserror/preemptively

* fix: other typo error fixed

* fix: directories to dir_entries in catalog example

* Support 1 or 3 arg in generate_series() UDTF (#13856)

* Support 1 or 3 args in generate_series() UDTF

* address comment

* Support (order by / sort) for DataFrameWriteOptions (#13874)

* Support (order by / sort) for DataFrameWriteOptions

* Fix fmt

* Fix import

* Add insert into example

* Update sort_merge_join.rs (#13894)

* Update join_selection.rs (#13893)

* Fix `recursive-protection` feature flag (#13887)

* Fix recursive-protection feature flag

* rename feature flag to be consistent

* Make default

* taplo format

* Fix visibility of swap_hash_join (#13899)

* Minor: Avoid emitting empty batches in partial sort (#13895)

* Update partial_sort.rs

* Update partial_sort.rs

* Update partial_sort.rs

* Prepare for 44.0.0 release: version and changelog (#13882)

* Prepare for 44.0.0 release: version and changelog

* update changelog

* update configs

* update before release

* Support unparsing implicit lateral `UNNEST` plan to SQL text (#13824)

* support unparsing the implicit lateral unnest plan

* cargo clippy and fmt

* refactor for `check_unnest_placeholder_with_outer_ref`

* add const for the prefix string of unnest and outer refernece column

* fix case_column_or_null with nullable when conditions (#13886)

* fix case_column_or_null with nullable when conditions

* improve sqllogictests for case_column_or_null

---------

Co-authored-by: zhangli20 <[email protected]>

* Fixed Issue #13896 (#13903)

The URL to the external website was returning a 404. Presuming recent changes in the external website's structure, the required data has been moved to a different URL. The commit ensures the new URL is used.

* Introduce `UserDefinedLogicalNodeUnparser` for User-defined Logical Plan unparsing (#13880)

* make ast builder public

* introduce udlp unparser

* add documents

* add examples

* add negative tests and fmt

* fix the doc

* rename udlp to extension

* apply the first unparsing result only

* improve the doc

* seperate the enum for the unparsing result

* fix the doc

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Preserve constant values across union operations (#13805)

* Add value tracking to ConstExpr for improved union optimization

* Update PartialEq impl

* Minor change

* Add docstring for ConstExpr value

* Improve constant propagation across union partitions

* Add assertion for across_partitions

* fix fmt

* Update properties.rs

* Remove redundant constant removal loop

* Remove unnecessary mut

* Set across_partitions=true when both sides are constant

* Extract and use constant values in filter expressions

* Add initial SLT for constant value tracking across UNION ALL

* Assign values to ConstExpr where possible

* Revert "Set across_partitions=true when both sides are constant"

This reverts commit 3051cd470b0ad4a70cd8bd3518813f5ce0b3a449.

* Temporarily take value from literal

* Lint fixes

* Cargo fmt

* Add get_expr_constant_value

* Make `with_value()` accept optional value

* Add todo

* Move test to union.slt

* Fix changed slt after merge

* Simplify constexpr

* Update properties.rs

---------

Co-authored-by: berkaysynnada <[email protected]>

* chore(deps): update sqllogictest requirement from 0.23.0 to 0.24.0 (#13902)

* fix RecordBatch size in topK (#13906)

* ci improvements, update protoc (#13876)

* Fix md5 return_type to only return Utf8 as per current code impl.

* ci improvements

* Lock taiki-e/install-action to a githash for apache action policy - Release 2.46.19 in the case of this hash.

* Lock taiki-e/install-action to a githash for apache action policy - Release 2.46.19 in the case of this hash.

* Revert nextest change until action is approved.

* Exclude requires workspace

* Fixing minor typo to verify ci caching of builds is working as expected.

* Updates from PR review.

* Adding issue link for disabling intel mac build

* improve performance of running examples

* remove cargo check

* Introduce LogicalPlan invariants, begin automatically checking them (#13651)

* minor(13525): perform LP validation before and after each possible mutation

* minor(13525): validate unique field names on query and subquery schemas, after each optimizer pass

* minor(13525): validate union after each optimizer passes

* refactor: make explicit what is an invariant of the logical plan, versus assertions made after a given analyzer or optimizer pass

* chore: add link to invariant docs

* fix: add new invariants module

* refactor: move all LP invariant checking into LP, delineate executable (valid semantic plan) vs basic LP invariants

* test: update test for slight error message change

* fix: push_down_filter optimization pass can push a IN(<subquery>) into a TableScan's filter clause

* refactor: move collect_subquery_cols() to common utils crate

* refactor: clarify the purpose of assert_valid_optimization(), runs after all optimizer passes, except in debug mode it runs after each pass.

* refactor: based upon performance tests, run the maximum number of checks without impa ct:
* assert_valid_optimization can run each optimizer pass
* remove the recursive cehck_fields, which caused the performance regression
* the full LP Invariants::Executable can only run in debug

* chore: update error naming and terminology used in code comments

* refactor: use proper error methods

* chore: more cleanup of error messages

* chore: handle option trailer to error message

* test: update sqllogictests tests to not use multiline

* Correct return type for initcap scalar function with utf8view (#13909)

* Set utf8view as return type when input type is the same

* Verify that the returned type from call to scalar function matches the return type specified in the return_type function

* Match return type to utf8view

* Consolidate example: simplify_udaf_expression.rs into advanced_udaf.rs (#13905)

* Implement maintains_input_order for AggregateExec (#13897)

* Implement maintains_input_order for AggregateExec

* Update mod.rs

* Improve comments

---------

Co-authored-by: berkaysynnada <[email protected]>
Co-authored-by: mertak-synnada <[email protected]>
Co-authored-by: Mehmet Ozan Kabak <[email protected]>

* Move join type input swapping to pub methods on Joins (#13910)

* doc-gen: migrate scalar functions (string) documentation 3/4 (#13926)

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917)

* Update sqllogictest requirement from 0.24.0 to 0.25.0

Updates the requirements on [sqllogictest](https://github.com/risinglightdb/sqllogictest-rs) to permit the latest version.
- [Release notes](https://github.com/risinglightdb/sqllogictest-rs/releases)
- [Changelog](https://github.com/risinglightdb/sqllogictest-rs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/risinglightdb/sqllogictest-rs/compare/v0.24.0...v0.25.0)

---
updated-dependencies:
- dependency-name: sqllogictest
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Remove labels

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: jonahgao <[email protected]>

* Consolidate Examples: memtable.rs and parquet_multiple_files.rs (#13913)

* doc-gen: migrate scalar functions (crypto) documentation (#13918)

* doc-gen: migrate scalar functions (crypto) documentation

* doc-gen: fix typo and update function docs

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* doc-gen: migrate scalar functions (datetime) documentation 1/2 (#13920)

* doc-gen: migrate scalar functions (datetime) documentation 1/2

* fix: fix typo and update function docs

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* fix RecordBatch size in hash join (#13916)

* doc-gen: migrate scalar functions (array) documentation 1/3 (#13928)

* doc-gen: migrate scalar functions (array) documentation 1/3

* fix: remove unsed import, fix typo and update function docs

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* doc-gen: migrate scalar functions (math) documentation 1/2 (#13922)

* doc-gen: migrate scalar functions (math) documentation 1/2

* fix: fix typo

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* doc-gen: migrate scalar functions (math) documentation 2/2 (#13923)

* doc-gen: migrate scalar functions (math) documentation 2/2

* fix: fix typo

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* doc-gen: migrate scalar functions (array) documentation 3/3 (#13930)

* doc-gen: migrate scalar functions (array) documentation 3/3

* fix: import doc and macro, fix typo and update function docs

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* doc-gen: migrate scalar functions (array) documentation 2/3 (#13929)

* doc-gen: migrate scalar functions (array) documentation 2/3

* fix: import doc and macro, fix typo and update function docs

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* doc-gen: migrate scalar functions (string) documentation 4/4 (#13927)

* doc-gen: migrate scalar functions (string) documentation 4/4

* fix: fix typo and update function docs

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* Support explain query when running dfbench with clickbench (#13942)

* Support explain query when running dfbench

* Address comments

* Consolidate example to_date.rs into dateframe.rs (#13939)

* Consolidate example to_date.rs into dateframe.rs

* Assert results using assert_batches_eq

* clippy

* Revert "Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917)" (#13945)

* Revert "Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917)"

This reverts commit 0989649214a6fe69ffb33ed38c42a8d3df94d6bf.

* add comment

* Implement predicate pruning for `like` expressions (prefix matching) (#12978)

* Implement predicate pruning for like expressions

* add function docstring

* re-order bounds calculations

* fmt

* add fuzz tests

* fix clippy

* Update datafusion/core/tests/fuzz_cases/pruning.rs

Co-authored-by: Andrew Lamb <[email protected]>

---------

Co-authored-by: Andrew Lamb <[email protected]>

* doc-gen: migrate scalar functions (string) documentation 1/4 (#13924)

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* consolidate dataframe_subquery.rs into dataframe.rs (#13950)

* migrate btrim to user_doc macro (#13952)

* doc-gen: migrate scalar functions (datetime) documentation 2/2 (#13921)

* doc-gen: migrate scalar functions (datetime) documentation 2/2

* fix: fix typo and update function docs

* doc: update function docs

* doc-gen: remove slash

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* Add sqlite test files, progress bar, and automatic postgres container management into sqllogictests (#13936)

* Fix md5 return_type to only return Utf8 as per current code impl.

* Add support for sqlite test files to sqllogictest

* Force version 0.24.0 of sqllogictest dependency until issue with labels is fixed.

* Removed workaround for bug that was fixed.

* Git submodule update ... err update, link to sqlite tests.

* Git submodule update

* Readd submodule

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Supporting writing schema metadata when writing Parquet in parallel (#13866)

* refactor: make ParquetSink tests a bit more readable

* chore(11770): add new ParquetOptions.skip_arrow_metadata

* test(11770): demonstrate that the single threaded ParquetSink is already writing the arrow schema in the kv_meta, and allow disablement

* refactor(11770): replace  with new method, since the kv_metadata is inherent to TableParquetOptions and therefore we should explicitly make the API apparant that you have to include the arrow schema or not

* fix(11770): fix parallel ParquetSink to encode arrow  schema into the file metadata, based on the ParquetOptions

* refactor(11770): provide deprecation warning for TryFrom

* test(11770): update tests with new default to include arrow schema

* refactor: including partitioning of arrow schema inserted into kv_metdata

* test: update tests for new config prop, as well as the new file partition offsets based upon larger metadata

* chore: avoid cloning in tests, and update code docs

* refactor: return to the WriterPropertiesBuilder::TryFrom<TableParquetOptions>, and separately add the arrow_schema to the kv_metadata on the TableParquetOptions

* refactor: require the arrow_schema key to be present in the kv_metadata, if is required by the configuration

* chore: update configs.md

* test: update tests to handle the (default) required arrow schema in the kv_metadata

* chore: add reference to arrow-rs upstream PR

* chore: Create devcontainer.json (#13520)

* Create devcontainer.json

* update devcontainer

* remove useless features

* Minor: consolidate ConfigExtension example into API docs (#13954)

* Update examples README.md

* Minor: consolidate ConfigExtension example into API docs

* more docs

* Remove update

* clippy

* Fix issue with ExtensionsOptions docs

* Parallelize pruning utf8 fuzz test (#13947)

* Add swap_inputs to SMJ (#13984)

* fix(datafusion-functions-nested): `arrow-distinct` now work with null rows (#13966)

* added failing test

* fix(datafusion-functions-nested): `arrow-distinct` now work with null rows

* Update datafusion/functions-nested/src/set_ops.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Update set_ops.rs

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Update release instructions for 44.0.0 (#13959)

* Update release instructions for 44.0.0

* update macros and order

* add functions-table

* Add datafusion python 43.1.0 blog post to doc. (#13974)

* Include license and notice files in more crates (#13985)

* Extract postgres container from sqllogictest, update datafusion-testing pin (#13971)

* Add support for sqlite test files to sqllogictest

* Removed workaround for bug that was fixed.

* Refactor sqllogictest to extract postgres functionality into a separate file. Removed dependency on once_cell in favour of LazyLock.

* Add missing license header.

* Update rstest requirement from 0.23.0 to 0.24.0 (#13977)

Updates the requirements on [rstest](https://github.com/la10736/rstest) to permit the latest version.
- [Release notes](https://github.com/la10736/rstest/releases)
- [Changelog](https://github.com/la10736/rstest/blob/master/CHANGELOG.md)
- [Commits](https://github.com/la10736/rstest/compare/v0.23.0...v0.23.0)

---
updated-dependencies:
- dependency-name: rstest
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Move hash collision test to run only when merging to main. (#13973)

* Update itertools requirement from 0.13 to 0.14 (#13965)

* Update itertools requirement from 0.13 to 0.14

Updates the requirements on [itertools](https://github.com/rust-itertools/itertools) to permit the latest version.
- [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-itertools/itertools/compare/v0.13.0...v0.13.0)

---
updated-dependencies:
- dependency-name: itertools
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Fix build

* Simplify

* Update CLI lock

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: jonahgao <[email protected]>

* Change trigger, rename `hash_collision.yml` to `extended.yml` and add comments (#13988)

* Rename hash_collision.yml to extended.yml and add comments

* Adjust schedule, add comments

* Update job, rerun

* doc-gen: migrate scalar functions (string) documentation 2/4 (#13925)

* doc-gen: migrate scalar functions (string) documentation 2/4

* doc-gen: update function docs

* doc: fix related udf order for upper function in documentation

* Update datafusion/functions/src/string/concat_ws.rs

* Update datafusion/functions/src/string/concat_ws.rs

* Update datafusion/functions/src/string/concat_ws.rs

* doc-gen: update function docs

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>
Co-authored-by: Oleks V <[email protected]>

* Update substrait requirement from 0.50 to 0.51 (#13978)

Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version.
- [Release notes](https://github.com/substrait-io/substrait-rs/releases)
- [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/substrait-io/substrait-rs/compare/v0.50.0...v0.51.0)

---
updated-dependencies:
- dependency-name: substrait
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Update release README for datafusion-cli publishing (#13982)

* Enhance LastValueAccumulator logic and add SQL logic tests for last_value function (#13980)

- Updated LastValueAccumulator to include requirement satisfaction check before updating the last value.
- Added SQL logic tests to verify the behavior of the last_value function with merge batches and ensure correct aggregation in various scenarios.

* Improve deserialize_to_struct example (#13958)

* Cleanup deserialize_to_struct example

* prettier

* Apply suggestions from code review

Co-authored-by: Jonah Gao <[email protected]>

---------

Co-authored-by: Jonah Gao <[email protected]>

* Update docs (#14002)

* Optimize CASE expression for "expr or expr" usage. (#13953)

* Apply optimization for ExprOrExpr.

* Implement optimization similar to existing code.

* Add sqllogictest.

* feat(substrait): introduce consume_rel and consume_expression (#13963)

* feat(substrait): introduce consume_rel and consume_expression

Route calls to from_substrait_rel and from_substrait_rex through the
SubstraitConsumer in order to allow users to provide their own behaviour

* feat(substrait): consume nulls of user-defined types

* docs(substrait): consume_rel and consume_expression docstrings

* Consolidate csv_opener.rs and json_opener.rs into a single example (#… (#13981)

* Consolidate csv_opener.rs and json_opener.rs into a single example (#13955)

* Update datafusion-examples/examples/csv_json_opener.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Update datafusion-examples/README.md

Co-authored-by: Andrew Lamb <[email protected]>

* Apply code formatting with cargo fmt

---------

Co-authored-by: Sergey Zhukov <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>

* FIX : Incorrect NULL handling in BETWEEN expression (#14007)

* submodule update

* FIX : Incorrect NULL handling in BETWEEN expression

* Revert "submodule update"

This reverts commit 72431aadeaf33a27775a88c41931572a0b66bae3.

* fix incorrect unit test

* move sqllogictest to expr

* feat(substrait): modular substrait producer (#13931)

* feat(substrait): modular substrait producer

* refactor(substrait): simplify col_ref_offset handling in producer

* refactor(substrait): remove column offset tracking from producer

* docs(substrait): document SubstraitProducer

* refactor: minor cleanup

* feature: remove unused SubstraitPlanningState

BREAKING CHANGE: SubstraitPlanningState is no longer available

* refactor: cargo fmt

* refactor(substrait): consume_ -> handle_

* refactor(substrait): expand match blocks

* refactor: DefaultSubstraitProducer only needs serializer_registry

* refactor: remove unnecessary warning suppression

* fix(substrait): route expr conversion through handle_expr

* cargo fmt

* fix: Avoid re-wrapping planning errors  Err(DataFusionError::Plan) for use in plan_datafusion_err (#14000)

* fix: unwrapping Err(DataFusionError::Plan) for use in plan_datafusion_err

* test: add tests for error formatting during planning

* feat: support `RightAnti` for `SortMergeJoin` (#13680)

* feat: support `RightAnti` for `SortMergeJoin`

* feat: preserve session id when using cxt.enable_url_table() (#14004)

* Return error message during planning when inserting into a MemTable with zero partitions. (#14011)

* Minor: Rewrite LogicalPlan::max_rows for Join and Union, made it easier to understand (#14012)

* Refactor max_rows for join plan, made it easier to understand

* Simplified max_rows for Union

* Chore: update wasm-supported crates, add tests (#14005)

* Chore: update wasm-supported crates

* format

* Use workspace rust-version for all workspace crates (#14009)

* [Minor] refactor: make ArraySort public for broader access (#14006)

* refactor: make ArraySort public for broader access

Changes the visibility of the ArraySort struct fromsuper to public. allows broader access to the 
struct, enabling its use in other modules and 
promoting better code reuse.

* clippy and docs

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Update sqllogictest requirement from =0.24.0 to =0.26.0 (#14017)

* Update sqllogictest requirement from =0.24.0 to =0.26.0

Updates the requirements on [sqllogictest](https://github.com/risinglightdb/sqllogictest-rs) to permit the latest version.
- [Release notes](https://github.com/risinglightdb/sqllogictest-rs/releases)
- [Changelog](https://github.com/risinglightdb/sqllogictest-rs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/risinglightdb/sqllogictest-rs/compare/v0.24.0...v0.26.0)

---
updated-dependencies:
- dependency-name: sqllogictest
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* remove version pin and note

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Eduard Karacharov <[email protected]>

* `url` dependancy update (#14019)

* `url` dependancy update

* `url` version update for datafusion-cli

* Minor: Improve zero partition check when inserting into `MemTable` (#14024)

* Improve zero partition check when inserting into `MemTable`

* update err msg

* refactor: make structs public and implement Default trait (#14030)

* Minor: Remove redundant implementation of `StringArrayType` (#14023)

* Minor: Remove redundant implementation of StringArrayType

Signed-off-by: Tai Le Manh <[email protected]>

* Deprecate rather than remove StringArrayType

---------

Signed-off-by: Tai Le Manh <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>

* Added references to IDE documentation for dev containers along with a small note about why one may choose to do development using a dev container. (#14014)

* Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream (#13995)

* Refactor spill handling in GroupedHashAggregateStream to use partial aggregate schema

* Implement aggregate functions with spill handling in tests

* Add tests for aggregate functions with and without spill handling

* Move test related imports into mod test

* Rename spill pool test functions for clarity and consistency

* Refactor aggregate function imports to use fully qualified paths

* Remove outdated comments regarding input batch schema for spilling in GroupedHashAggregateStream

* Update aggregate test to use AVG instead of MAX

* assert spill count

* Refactor partial aggregate schema creation to use create_schema function

* Refactor partial aggregation schema creation and remove redundant function

* Remove unused import of Schema from arrow::datatypes in row_hash.rs

* move spill pool testing for aggregate functions to physical-plan/src/aggregates

* Use Arc::clone for schema references in aggregate functions

* Encapsulate fields of `EquivalenceProperties` (#14040)

* Encapsulate fields of `EquivalenceGroup` (#14039)

* Fix error on `array_distinct` when input is empty #13810 (#14034)

* fix

* add test

* oops

---------

Co-authored-by: Cyprien Huet <[email protected]>

* Update petgraph requirement from 0.6.2 to 0.7.1 (#14045)

* Update petgraph requirement from 0.6.2 to 0.7.1

Updates the requirements on [petgraph](https://github.com/petgraph/petgraph) to permit the latest version.
- [Changelog](https://github.com/petgraph/petgraph/blob/master/RELEASES.rst)
- [Commits](https://github.com/petgraph/petgraph/compare/[email protected]@v0.7.1)

---
updated-dependencies:
- dependency-name: petgraph
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Update datafusion-cli/Cargo.lock

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Andrew Lamb <[email protected]>

* Encapsulate fields of `OrderingEquivalenceClass` (make field non pub) (#14037)

* Complete encapsulatug `OrderingEquivalenceClass` (make fields non pub)

* fix doc

* Fix: ensure that compression type is also taken into consideration during ListingTableConfig infer_options (#14021)

* chore: add test to verify that schema is inferred as expected

* chore: add comment to method as suggested

* chore: restructure to avoid need to clone

* chore: fix flaw in rewrite

* feat(optimizer): Enable filter pushdown on window functions (#14026)

* feat(optimizer): Enable filter pushdown on window functions

Ensures selections can be pushed past window functions similarly
to what is already done with aggregations, when possible.

* fix: Add missing dependency

* minor(optimizer): Use 'datafusion-functions-window' as a dev dependency

* docs(optimizer): Add example to filter pushdown on LogicalPlan::Window

* Unparsing optimized (> 2 inputs) unions (#14031)

* tests and optimizer in testing queries

* unparse optimized unions

* format Cargo.toml

* format Cargo.toml

* revert test

* rewrite test to avoid cyclic dep

* remove old test

* cleanup

* comments and error handling

* handle union with lt 2 inputs

* Minor: Document output schema of LogicalPlan::Aggregate and LogicalPlan::Window (#14047)

* Simplify error handling in case.rs (#13990) (#14033)

* Simplify error handling in case.rs (#13990)

* Fix issues causing GitHub checks to fail

* Update datafusion/physical-expr/src/expressions/case.rs

Co-authored-by: Andrew Lamb <[email protected]>

---------

Co-authored-by: Sergey Zhukov <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>

* feat: add `AsyncCatalogProvider` helpers for asynchronous catalogs (#13800)

* Add asynchronous catalog traits to help users that have asynchronous catalogs

* Apply clippy suggestions

* Address PR reviews

* Remove allow_unused exceptions

* Update remote catalog example to demonstrate new helper structs

* Move schema_name / catalog_name parameters into resolve f…
jayzhan211 added a commit that referenced this pull request Jan 23, 2025
* Handle alias when parsing sql(parse_sql_expr) (#12939)

* fix: Fix parse_sql_expr not handling alias

* cargo fmt

* fix parse_sql_expr example(remove alias)

* add testing

* add SUM udaf to TestContextProvider and modify test_sql_to_expr_with_alias for function

* revert change on example `parse_sql_expr`

* Improve documentation for TableProvider (#13724)

* Reveal implementing type and return type in simple UDF implementations (#13730)

Debug trait is useful for understanding what something is and how it's
configured, especially if the implementation is behind dyn trait.

* minor: Extract tests for `EXTRACT` AND `date_part` to their own file (#13731)

* Support unparsing `UNNEST` plan to `UNNEST` table factor SQL (#13660)

* add `unnest_as_table_factor` and `UnnestRelationBuilder`

* unparse unnest as table factor

* fix typo

* add tests for the default configs

* add a static const for unnest_placeholder

* fix tests

* fix tests

* Update to apache-avro 0.17, fix compatibility changes schema handling  (#13727)

* Update apache-avro requirement from 0.16 to 0.17

---
updated-dependencies:
- dependency-name: apache-avro
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Fix compatibility changes schema handling apache-avro 0.17

- Handle ArraySchema struct
- Handle MapSchema struct
- Map BigDecimal => LargeBinary
- Map TimestampNanos => Timestamp(TimeUnit::Nanosecond, None)
- Map LocalTimestampNanos => todo!()
- Add Default to FixedSchema test

* Update Cargo.lock file for apache-avro 0.17

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Marc Droogh <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>

* Minor: Add doc example to RecordBatchStreamAdapter (#13725)

* Minor: Add doc example to RecordBatchStreamAdapter

* Update datafusion/physical-plan/src/stream.rs

Co-authored-by: Berkay Şahin <[email protected]>

---------

Co-authored-by: Berkay Şahin <[email protected]>

* Implement GroupsAccumulator for corr(x,y) aggregate function (#13581)

* Implement GroupsAccumulator for corr(x,y)

* feedbacks

* fix CI MSRV

* review

* avoid collect in accumulation

* add back cast

* fix union serialisation order in proto (#13709)

* fix union serialisation order in proto

* clippy

* address comments

* Minor: make unsupported `nanosecond` part a real (not internal) error (#13733)

* Minor: make unsupported `nanosecond` part a real (not internal) error

* fmt

* Improve wording to refer to date part

* Add tests for date_part on columns + timestamps with / without timezones (#13732)

* Add tests for date_part on columns + timestamps with / without timezones

* Add tests from https://github.com/apache/datafusion/pull/13372

* remove trailing whitespace

* Optimize performance of `initcap` function (~2x faster) (#13691)

* Optimize performance of initcap (~2x faster)

Signed-off-by: Tai Le Manh <[email protected]>

* format

---------

Signed-off-by: Tai Le Manh <[email protected]>

* Minor: Add documentation explaining that initcap oly works for ASCII (#13749)

* Support sqllogictest --complete with postgres (#13746)

Before the change, the request to use PostgreSQL was simply ignored when
`--complete` flag was present.

* doc-gen: migrate window functions documentation to attribute based (#13739)

* doc-gen: migrate window functions documentation

Signed-off-by: zjregee <[email protected]>

* fix: update Cargo.lock

---------

Signed-off-by: zjregee <[email protected]>

* Minor: Remove memory reservation in `JoinLeftData` used in HashJoin (#13751)

* Refactor JoinLeftData structure by removing unused memory reservation field in hash join implementation

* Add Debug and Clone derives for HashJoinStreamState and ProcessProbeBatchState enums

This commit enhances the HashJoinStreamState and ProcessProbeBatchState structures by implementing the Debug and Clone traits, allowing for easier debugging and cloning of these state representations in the hash join implementation.

* Update to bigdecimal 0.4.7 (#13747)

* Add big decimal formatting test cases with potential trailing zeros

* Rename and simplify decimal rendering functions

- add `decimal` to function name
- drop `precision` parameter as it is not supposed to affect the result

* Update to bigdecimal 0.4.7

Utilize new `to_plain_string` function

* chore: clean up dependencies (#13728)

* CI: Warn on unused crates

* CI: Warn on unused crates

* CI: Warn on unused crates

* CI: Warn on unused crates

* CI: Clean up dependencies

* CI: Clean up dependencies

* fix: Implicitly plan `UNNEST` as lateral (#13695)

* plan implicit lateral if table factor is UNNEST

* check for outer references in `create_relation_subquery`

* add sqllogictest

* fix lateral constant test to not expect a subquery node

* replace sqllogictest in favor of logical plan test

* update lateral join sqllogictests

* add sqllogictests

* fix logical plan test

* Minor: improve the Deprecation / API health guidelines (#13701)

* Minor: improve the Deprecation / API health policy

* prettier

* Update docs/source/library-user-guide/api-health.md

Co-authored-by: Jonah Gao <[email protected]>

* Add version guidance and make more copy/paste friendly

* prettier

* better

* rename to guidelines

---------

Co-authored-by: Jonah Gao <[email protected]>

* fix: specify roottype in substrait fieldreference (#13647)

* fix: specify roottype in fieldreference

Signed-off-by: MBWhite <[email protected]>

* Fix formatting

Signed-off-by: MBWhite <[email protected]>

* review suggestion

Signed-off-by: MBWhite <[email protected]>

---------

Signed-off-by: MBWhite <[email protected]>

* Simplify type signatures using `TypeSignatureClass` for mixed type function signature (#13372)

* add type sig class

Signed-off-by: jayzhan211 <[email protected]>

* timestamp

Signed-off-by: jayzhan211 <[email protected]>

* date part

Signed-off-by: jayzhan211 <[email protected]>

* fmt

Signed-off-by: jayzhan211 <[email protected]>

* taplo format

Signed-off-by: jayzhan211 <[email protected]>

* tpch test

Signed-off-by: jayzhan211 <[email protected]>

* msrc issue

Signed-off-by: jayzhan211 <[email protected]>

* msrc issue

Signed-off-by: jayzhan211 <[email protected]>

* explicit hash

Signed-off-by: jayzhan211 <[email protected]>

* Enhance type coercion and function signatures

- Added logic to prevent unnecessary casting of string types in `native.rs`.
- Introduced `Comparable` variant in `TypeSignature` to define coercion rules for comparisons.
- Updated imports in `functions.rs` and `signature.rs` for better organization.
- Modified `date_part.rs` to improve handling of timestamp extraction and fixed query tests in `expr.slt`.
- Added `datafusion-macros` dependency in `Cargo.toml` and `Cargo.lock`.

These changes improve type handling and ensure more accurate function behavior in SQL expressions.

* fix comment

Signed-off-by: Jay Zhan <[email protected]>

* fix signature

Signed-off-by: Jay Zhan <[email protected]>

* fix test

Signed-off-by: Jay Zhan <[email protected]>

* Enhance type coercion for timestamps to allow implicit casting from strings. Update SQL logic tests to reflect changes in timestamp handling, including expected outputs for queries involving nanoseconds and seconds.

* Refactor type coercion logic for timestamps to improve readability and maintainability. Update the `TypeSignatureClass` documentation to clarify its purpose in function signatures, particularly regarding coercible types. This change enhances the handling of implicit casting from strings to timestamps.

* Fix SQL logic tests to correct query error handling for timestamp functions. Updated expected outputs for `date_part` and `extract` functions to reflect proper behavior with nanoseconds and seconds. This change improves the accuracy of test cases in the `expr.slt` file.

* Enhance timestamp handling in TypeSignature to support timezone specification. Updated the logic to include an additional DataType for timestamps with a timezone wildcard, improving flexibility in timestamp operations.

* Refactor date_part function: remove redundant imports and add missing not_impl_err import for better error handling

---------

Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>

* Minor: Add some more blog posts to the readings page (#13761)

* Minor: Add some more blog posts to the readings page

* prettier

* prettier

* Update docs/source/user-guide/concepts-readings-events.md

---------

Co-authored-by: Oleks V <[email protected]>

* docs: update GroupsAccumulator instead of GroupAccumulator (#13787)

Fixing `GroupsAccumulator` trait name in its docs

* Improve Deprecation Guidelines more (#13776)

* Improve deprecation guidelines more

* prettier

* fix: add `null_buffer` length check to `StringArrayBuilder`/`LargeStringArrayBuilder` (#13758)

* fix: add `null_buffer` check for `LargeStringArray`

Add a safety check to ensure that the alignment of buffers cannot be
overflowed. This introduces a panic if they are not aligned through a
runtime assertion.

* fix: remove value_buffer assertion

These buffers can be misaligned and it is not problematic, it is the
`null_buffer` which we care about being of the same length.

* feat: add `null_buffer` check to `StringArray`

This is in a similar vein to `LargeStringArray`, as the code is the
same, except for `i32`'s instead of `i64`.

* feat: use `row_count` var to avoid drift

* Revert the removal of reservation in HashJoin (#13792)

* fix: restore memory reservation in JoinLeftData for accurate memory accounting in HashJoin

This commit reintroduces the `_reservation` field in the `JoinLeftData` structure to ensure proper tracking of memory resources during join operations. The absence of this field could lead to inconsistent memory usage reporting and potential out-of-memory issues as upstream operators increase their memory consumption.

* fmt

Signed-off-by: Jay Zhan <[email protected]>

---------

Signed-off-by: Jay Zhan <[email protected]>

* added count aggregate slt (#13790)

* Update documentation guidelines for contribution content (#13703)

* Update documentation guidelines for contribution content

* Apply suggestions from code review

Co-authored-by: Piotr Findeisen <[email protected]>
Co-authored-by: Oleks V <[email protected]>

* clarify discussions and remove requirements note

* prettier

* Update docs/source/contributor-guide/index.md

Co-authored-by: Piotr Findeisen <[email protected]>

---------

Co-authored-by: Piotr Findeisen <[email protected]>
Co-authored-by: Oleks V <[email protected]>

* Add Round trip tests for Array <--> ScalarValue (#13777)

* Add Round trip tests for Array <--> ScalarValue

* String dictionary test

* remove unecessary value

* Improve comments

* fix: Limit together with pushdown_filters (#13788)

* fix: Limit together with pushdown_filters

* Fix format

* Address new comments

* Fix testing case to hit the problem

* Minor: improve Analyzer docs (#13798)

* Minor: cargo update in datafusion-cli (#13801)

* Update datafusion-cli toml to pin home=0.5.9

* update Cargo.lock

* Fix `ScalarValue::to_array_of_size` for DenseUnion (#13797)

* fix: enable pruning by bloom filters for dictionary columns (#13768)

* Handle empty rows for `array_distinct` (#13810)

* handle empty array distinct

* ignore

* fix

---------

Co-authored-by: Cyprien Huet <[email protected]>

* Fix get_type for higher-order array functions (#13756)

* Fix get_type for higher-order array functions

* Fix recursive flatten

The fix is covered by recursive flatten test case in array.slt

* Restore "keep LargeList" in Array signature

* clarify naming in the test

* Chore: Do not return empty record batches from streams (#13794)

* do not emit empty record batches in plans

* change function signatures to Option<RecordBatch> if empty batches are possible

* format code

* shorten code

* change list_unnest_at_level for returning Option value

* add documentation
take concat_batches into compute_aggregates function again

* create unit test for row_hash.rs

* add test for unnest

* add test for unnest

* add test for partial sort

* add test for bounded window agg

* add test for window agg

* apply simplifications and fix typo

* apply simplifications and fix typo

* Handle possible overflows in StringArrayBuilder / LargeStringArrayBuilder (#13802)

* test(13796): reproducer of overflow on capacity

* fix(13796): handle overflows with proper max capacity number which is valid for MutableBuffer

* refactor: use simple solution and provide panic

* fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema (#13750)

* fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema

* clippy

* fix csv and json tests

* add testing for parquet

* cleanup

* fix parquet tests

* document describe_partition, add back repartition options to one of the csv empty files tests

* Support Null regex override in csv parser options. (#13228)

Co-authored-by: Andrew Lamb <[email protected]>

* Minor: Extend ScalarValue::new_zero() (#13828)

* Update mod.rs

* Update mod.rs

* Update mod.rs

* Update mod.rs

* chore: temporarily disable windows flow (#13833)

* feat: `parse_float_as_decimal` supports scientific notation and Decimal256 (#13806)

* feat: `parse_float_as_decimal` supports scientific notation and Decimal256

* Fix test

* Add test

* Add test

* Refine negative scales

* Update comment

* Refine bigint_to_i256

* UT for bigint_to_i256

* Add ut for parse_decimal

* Replace `BooleanArray::extend` with `append_n` (#13832)

* Rename `TypeSignature::NullAry` --> `TypeSignature::Nullary` and improve comments (#13817)

* Rename `TypeSignature::NullAry` --> `TypeSignature::Nullary` and improve comments

* Apply suggestions from code review

Co-authored-by: Piotr Findeisen <[email protected]>

* improve docs

---------

Co-authored-by: Piotr Findeisen <[email protected]>

* [bugfix] ScalarFunctionExpr does not preserve the nullable flag on roundtrip (#13830)

* [test] coalesce round trip schema mismatch

* [proto] added the nullable flag in PhysicalScalarUdfNode

* [bugfix] propagate the nullable flag for serialized scalar UDFS

* Add example of interacting with a remote catalog (#13722)

* Add example of interacting with a remote catalog

* Update datafusion/core/src/execution/session_state.rs

Co-authored-by: Berkay Şahin <[email protected]>

* Apply suggestions from code review

Co-authored-by: Jonah Gao <[email protected]>
Co-authored-by: Weston Pace <[email protected]>

* Use HashMap to hold tables

---------

Co-authored-by: Berkay Şahin <[email protected]>
Co-authored-by: Jonah Gao <[email protected]>
Co-authored-by: Weston Pace <[email protected]>

* Update substrait requirement from 0.49 to 0.50 (#13808)

* Update substrait requirement from 0.49 to 0.50

Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version.
- [Release notes](https://github.com/substrait-io/substrait-rs/releases)
- [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/substrait-io/substrait-rs/compare/v0.49.0...v0.50.0)

---
updated-dependencies:
- dependency-name: substrait
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Fix compilation

* Add expr test

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: jonahgao <[email protected]>

* typo: remove extraneous "`" in doc comment, fix header (#13848)

* typo: extraneous "`" in doc comment

* Update datafusion/execution/src/runtime_env.rs

* Update datafusion/execution/src/runtime_env.rs

---------

Co-authored-by: Oleks V <[email protected]>

* typo: remove extra "`" interfering with doc formatting (#13847)

* Support n-ary monotonic functions in ordering equivalence (#13841)

* Support n-ary monotonic functions in `discover_new_orderings`

* Add tests for n-ary monotonic functions in `discover_new_orderings`

* Fix tests

* Fix non-monotonic test case

* Fix unintended simplification

* Minor comment changes

* Fix tests

* Add `preserves_lex_ordering` field

* Use `preserves_lex_ordering` on `discover_new_orderings()`

* Add `output_ordering` and `output_preserves_lex_ordering` implementations for `ConcatFunc`

* Update tests

* Move logic to UDF

* Cargo fmt

* Refactor

* Cargo fmt

* Simply use false value on default implementation

* Remove unnecessary import

* Clippy fix

* Update Cargo.lock

* Move dep to dev-dependencies

* Rename output_preserves_lex_ordering to preserves_lex_ordering

* minor

---------

Co-authored-by: berkaysynnada <[email protected]>

* Replace `execution_mode` with `emission_type` and `boundedness` (#13823)

* feat: update execution modes and add bitflags dependency

- Introduced `Incremental` execution mode alongside existing modes in the DataFusion execution plan.
- Updated various execution plans to utilize the new `Incremental` mode where applicable, enhancing streaming capabilities.
- Added `bitflags` dependency to `Cargo.toml` for better management of execution modes.
- Adjusted execution mode handling in multiple files to ensure compatibility with the new structure.

* add exec API

Signed-off-by: Jay Zhan <[email protected]>

* replace done but has stackoverflow

Signed-off-by: Jay Zhan <[email protected]>

* exec API done

Signed-off-by: Jay Zhan <[email protected]>

* Refactor execution plan properties to remove execution mode

- Removed the `ExecutionMode` parameter from `PlanProperties` across multiple physical plan implementations.
- Updated related functions to utilize the new structure, ensuring compatibility with the changes.
- Adjusted comments and cleaned up imports to reflect the removal of execution mode handling.

This refactor simplifies the execution plan properties and enhances maintainability.

* Refactor execution plan to remove `ExecutionMode` and introduce `EmissionType`

- Removed the `ExecutionMode` parameter from `PlanProperties` and related implementations across multiple files.
- Introduced `EmissionType` to better represent the output characteristics of execution plans.
- Updated functions and tests to reflect the new structure, ensuring compatibility and enhancing maintainability.
- Cleaned up imports and adjusted comments accordingly.

This refactor simplifies the execution plan properties and improves the clarity of memory handling in execution plans.

* fix test

Signed-off-by: Jay Zhan <[email protected]>

* Refactor join handling and emission type logic

- Updated test cases in `sanity_checker.rs` to reflect changes in expected outcomes for bounded and unbounded joins, ensuring accurate test coverage.
- Simplified the `is_pipeline_breaking` method in `execution_plan.rs` to clarify the conditions under which a plan is considered pipeline-breaking.
- Enhanced the emission type determination logic in `execution_plan.rs` to prioritize `Final` over `Both` and `Incremental`, improving clarity in execution plan behavior.
- Adjusted join type handling in `hash_join.rs` to classify `Right` joins as `Incremental`, allowing for immediate row emission.

These changes improve the accuracy of tests and the clarity of execution plan properties.

* Implement emission type for execution plans

- Updated multiple execution plan implementations to replace `unimplemented!()` with `EmissionType::Incremental`, ensuring that the emission type is correctly defined for various plans.
- This change enhances the clarity and functionality of the execution plans by explicitly specifying their emission behavior.

These updates contribute to a more robust execution plan framework within the DataFusion project.

* Enhance join type documentation and refine emission type logic

- Updated the `JoinType` enum in `join_type.rs` to include detailed descriptions for each join type, improving clarity on their behavior and expected results.
- Modified the emission type logic in `hash_join.rs` to ensure that `Right` and `RightAnti` joins are classified as `Incremental`, allowing for immediate row emission when applicable.

These changes improve the documentation and functionality of join operations within the DataFusion project.

* Refactor emission type logic in join and sort execution plans

- Updated the emission type determination in `SortMergeJoinExec` and `SymmetricHashJoinExec` to utilize the `emission_type_from_children` function, enhancing the accuracy of emission behavior based on input characteristics.
- Clarified comments in `sort.rs` regarding the conditions under which results are emitted, emphasizing the relationship between input sorting and emission type.
- These changes improve the clarity and functionality of the execution plans within the DataFusion project, ensuring more robust handling of emission types.

* Refactor emission type handling in execution plans

- Updated the `emission_type_from_children` function to accept an iterator instead of a slice, enhancing flexibility in how child execution plans are passed.
- Modified the `SymmetricHashJoinExec` implementation to utilize the new function signature, improving code clarity and maintainability.

These changes streamline the emission type determination process within the DataFusion project, contributing to a more robust execution plan framework.

* Enhance execution plan properties with boundedness and emission type

- Introduced `boundedness` and `pipeline_behavior` methods to the `ExecutionPlanProperties` trait, improving the handling of execution plan characteristics.
- Updated the `CsvExec`, `SortExec`, and related implementations to utilize the new methods for determining boundedness and emission behavior.
- Refactored the `ensure_distribution` function to use the new boundedness logic, enhancing clarity in distribution decisions.
- These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project.

* Refactor execution plans to enhance boundedness and emission type handling

- Updated multiple execution plan implementations to incorporate `Boundedness` and `EmissionType`, improving the clarity and functionality of execution plans.
- Replaced instances of `unimplemented!()` with appropriate emission types, ensuring that plans correctly define their output behavior.
- Refactored the `PlanProperties` structure to utilize the new boundedness logic, enhancing decision-making in execution plans.
- These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project.

* Refactor memory handling in execution plans

- Updated the condition for checking memory requirements in execution plans from `has_finite_memory()` to `boundedness().requires_finite_memory()`, improving clarity in memory management.
- This change enhances the robustness of execution plans within the DataFusion project by ensuring more accurate assessments of memory constraints.

* Refactor boundedness checks in execution plans

- Updated conditions for checking boundedness in various execution plans to use `is_unbounded()` instead of `requires_finite_memory()`, enhancing clarity in memory management.
- Adjusted the `PlanProperties` structure to reflect these changes, ensuring more accurate assessments of memory constraints across the DataFusion project.
- These modifications contribute to a more robust and maintainable execution plan framework, improving the handling of boundedness in execution strategies.

* Remove TODO comment regarding unbounded execution plans in `UnboundedExec` implementation

- Eliminated the outdated comment suggesting a switch to unbounded execution with finite memory, streamlining the code and improving clarity.
- This change contributes to a cleaner and more maintainable codebase within the DataFusion project.

* Refactor execution plan boundedness and emission type handling

- Updated the `is_pipeline_breaking` method to use `requires_finite_memory()` for improved clarity in determining pipeline behavior.
- Enhanced the `Boundedness` enum to include detailed documentation on memory requirements for unbounded streams.
- Refactored `compute_properties` methods in `GlobalLimitExec` and `LocalLimitExec` to directly use the input's boundedness, simplifying the logic.
- Adjusted emission type determination in `NestedLoopJoinExec` to utilize the `emission_type_from_children` function, ensuring accurate output behavior based on input characteristics.

These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project, improving clarity and functionality in handling boundedness and emission types.

* Refactor emission type and boundedness handling in execution plans

- Removed the `OptionalEmissionType` struct from `plan_properties.rs`, simplifying the codebase.
- Updated the `is_pipeline_breaking` function in `execution_plan.rs` for improved readability by formatting the condition across multiple lines.
- Adjusted the `GlobalLimitExec` implementation in `limit.rs` to directly use the input's boundedness, enhancing clarity in memory management.

These changes contribute to a more streamlined and maintainable execution plan framework within the DataFusion project, improving the handling of emission types and boundedness.

* Refactor GlobalLimitExec and LocalLimitExec to enhance boundedness handling

- Updated the `compute_properties` methods in both `GlobalLimitExec` and `LocalLimitExec` to replace `EmissionType::Final` with `Boundedness::Bounded`, reflecting that limit operations always produce a finite number of rows.
- Changed the input's boundedness reference to `pipeline_behavior()` for improved clarity in execution plan properties.

These changes contribute to a more streamlined and maintainable execution plan framework within the DataFusion project, enhancing the handling of boundedness in limit operations.

* Review Part1

* Update sanity_checker.rs

* addressing reviews

* Review Part 1

* Update datafusion/physical-plan/src/execution_plan.rs

* Update datafusion/physical-plan/src/execution_plan.rs

* Shorten imports

* Enhance documentation for JoinType and Boundedness enums

- Improved descriptions for the Inner and Full join types in join_type.rs to clarify their behavior and examples.
- Added explanations regarding the boundedness of output streams and memory requirements in execution_plan.rs, including specific examples for operators like Median and Min/Max.

---------

Signed-off-by: Jay Zhan <[email protected]>
Co-authored-by: berkaysynnada <[email protected]>
Co-authored-by: Mehmet Ozan Kabak <[email protected]>

* Preserve ordering equivalencies on `with_reorder` (#13770)

* Preserve ordering equivalencies on `with_reorder`

* Add assertions

* Return early if filtered_exprs is empty

* Add clarify comment

* Refactor

* Add comprehensive test case

* Add comment for exprs_equal

* Cargo fmt

* Clippy fix

* Update properties.rs

* Update exprs_equal and add tests

* Update properties.rs

---------

Co-authored-by: berkaysynnada <[email protected]>

* replace CASE expressions in predicate pruning with boolean algebra (#13795)

* replace CASE expressions in predicate pruning with boolean algebra

* fix merge

* update tests

* add some more tests

* add some more tests

* remove duplicate test case

* Update datafusion/physical-optimizer/src/pruning.rs

* swap NOT for !=

* replace comments, update docstrings

* fix example

* update tests

* update tests

* Apply suggestions from code review

Co-authored-by: Andrew Lamb <[email protected]>

* Update pruning.rs

Co-authored-by: Chunchun Ye <[email protected]>

* Update pruning.rs

Co-authored-by: Chunchun Ye <[email protected]>

---------

Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Chunchun Ye <[email protected]>

* enable DF's nested_expressions feature by in datafusion-substrait tests to make them pass (#13857)

fixes #13854

Co-authored-by: Arttu Voutilainen <[email protected]>

* Add configurable normalization for configuration options and preserve case for S3 paths (#13576)

* Do not normalize values

* Fix tests & update docs

* Prettier

* Lowercase config params

* Unify transform and parse

* Fix tests

* Rename `default_transform` and relax boundaries

* Make `compression` case-insensitive

* Comment to new line

* Deprecate and ignore `enable_options_value_normalization`

* Update datafusion/common/src/config.rs

* fix typo

---------

Co-authored-by: Oleks V <[email protected]>

* Improve`Signature` and `comparison_coercion` documentation (#13840)

* Improve Signature documentation more

* Apply suggestions from code review

Co-authored-by: Piotr Findeisen <[email protected]>

---------

Co-authored-by: Piotr Findeisen <[email protected]>

* feat: support normalized expr in CSE (#13315)

* feat: support normalized expr in CSE

* feat: support normalize_eq in cse optimization

* feat: support cumulative binary expr result in normalize_eq

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Upgrade to sqlparser `0.53.0` (#13767)

* chore: Udpate to sqlparser 0.53.0

* Update for new sqlparser API

* more api updates

* Avoid serializing query to SQL string unless it is necessary

* Box wildcard options

* chore: update datafusion-cli Cargo.lock

* Minor: Use `resize` instead of `extend` for adding static values in SortMergeJoin logic (#13861)

Thanks @Dandandan

* feat(function): add `least` function (#13786)

* start adding least fn

* feat(function): add least function

* update function name

* fix scalar smaller function

* add tests

* run Clippy and Fmt

* Generated docs using `./dev/update_function_docs.sh`

* add comment why `descending: false`

* update comment

* Update least.rs

Co-authored-by: Bruce Ritchie <[email protected]>

* Update scalar_functions.md

* run ./dev/update_function_docs.sh to update docs

* merge greatest and least implementation to one

* add header

---------

Co-authored-by: Bruce Ritchie <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>

* Improve SortPreservingMerge::enable_round_robin_repartition  docs (#13826)

* Clarify SortPreservingMerge::enable_round_robin_repartition  docs

* tweaks

* Improve comments more

* clippy

* fix doc link

* Minor: Unify `downcast_arg` method (#13865)

* Implement `SHOW FUNCTIONS` (#13799)

* introduce rid for different signature

* implement show functions syntax

* add syntax example

* avoid duplicate join

* fix clippy

* show function_type instead of routine_type

* add some doc and comments

* Update bzip2 requirement from 0.4.3 to 0.5.0 (#13740)

* Update bzip2 requirement from 0.4.3 to 0.5.0

Updates the requirements on [bzip2](https://github.com/trifectatechfoundation/bzip2-rs) to permit the latest version.
- [Release notes](https://github.com/trifectatechfoundation/bzip2-rs/releases)
- [Commits](https://github.com/trifectatechfoundation/bzip2-rs/compare/0.4.4...v0.5.0)

---
updated-dependencies:
- dependency-name: bzip2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Fix test

* Fix CLI cargo.lock

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: jonahgao <[email protected]>

* Fix build (#13869)

* feat(substrait): modular substrait consumer (#13803)

* feat(substrait): modular substrait consumer

* feat(substrait): include Extension Rel handlers in default consumer

Include SerializerRegistry based handlers for Extension Relations in the
DefaultSubstraitConsumer

* refactor(substrait) _selection -> _field_reference

* refactor(substrait): remove SubstraitPlannerState usage from consumer

* refactor: get_state() -> get_function_registry()

* docs: elide imports from example

* test: simplify test

* refactor: remove Arc from DefaultSubstraitConsumer

* doc: add ticket for API improvements

* doc: link DefaultSubstraitConsumer to from_subtrait_plan

* refactor: remove redundant Extensions parsing

* Minor: fix: Include FetchRel when producing LogicalPlan from Sort (#13862)

* include FetchRel when producing LogicalPlan from Sort

* add suggested test

* address review feedback

* Minor: improve error message when ARRAY literals can not be planned (#13859)

* Minor: improve error message when ARRAY literals can not be planned

* fmt

* Update datafusion/sql/src/expr/value.rs

Co-authored-by: Oleks V <[email protected]>

---------

Co-authored-by: Oleks V <[email protected]>

* Add documentation for `SHOW FUNCTIONS` (#13868)

* Support unicode character for `initcap` function (#13752)

* Support unicode character for 'initcap' function

Signed-off-by: Tai Le Manh <[email protected]>

* Update unit tests

* Fix clippy warning

* Update sqllogictests - initcap

* Update scalar_functions.md docs

* Add suggestions change

Signed-off-by: Tai Le Manh <[email protected]>

---------

Signed-off-by: Tai Le Manh <[email protected]>

* [minor] make recursive package dependency optional  (#13778)

* make recursive optional

* add to default for common package

* cargo update

* added to readme

* make test conditional

* reviews

* cargo update

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Minor: remove unused async-compression `futures-io` feature (#13875)

* Minor: remove unused async-compression feature

* Fix cli cargo lock

* Consolidate Example: dataframe_output.rs into dataframe.rs (#13877)

* Restore `DocBuilder::new()` to avoid breaking API change (#13870)

* Fix build

* Restore DocBuilder::new(), deprecate

* cmt

* clippy

* Improve error messages for incorrect zero argument signatures (#13881)

* Improve error messages for incorrect zero argument signatures

* fix errors

* fix fmt

* Consolidate Example: simplify_udwf_expression.rs into advanced_udwf.rs (#13883)

* minor: fix typos in  comments / structure names (#13879)

* minor: fix typo error in datafusion

* fix: fix rebase error

* fix: format HashJoinExec doc

* doc: recover thiserror/preemptively

* fix: other typo error fixed

* fix: directories to dir_entries in catalog example

* Support 1 or 3 arg in generate_series() UDTF (#13856)

* Support 1 or 3 args in generate_series() UDTF

* address comment

* Support (order by / sort) for DataFrameWriteOptions (#13874)

* Support (order by / sort) for DataFrameWriteOptions

* Fix fmt

* Fix import

* Add insert into example

* Update sort_merge_join.rs (#13894)

* Update join_selection.rs (#13893)

* Fix `recursive-protection` feature flag (#13887)

* Fix recursive-protection feature flag

* rename feature flag to be consistent

* Make default

* taplo format

* Fix visibility of swap_hash_join (#13899)

* Minor: Avoid emitting empty batches in partial sort (#13895)

* Update partial_sort.rs

* Update partial_sort.rs

* Update partial_sort.rs

* Prepare for 44.0.0 release: version and changelog (#13882)

* Prepare for 44.0.0 release: version and changelog

* update changelog

* update configs

* update before release

* Support unparsing implicit lateral `UNNEST` plan to SQL text (#13824)

* support unparsing the implicit lateral unnest plan

* cargo clippy and fmt

* refactor for `check_unnest_placeholder_with_outer_ref`

* add const for the prefix string of unnest and outer refernece column

* fix case_column_or_null with nullable when conditions (#13886)

* fix case_column_or_null with nullable when conditions

* improve sqllogictests for case_column_or_null

---------

Co-authored-by: zhangli20 <[email protected]>

* Fixed Issue #13896 (#13903)

The URL to the external website was returning a 404. Presuming recent changes in the external website's structure, the required data has been moved to a different URL. The commit ensures the new URL is used.

* Introduce `UserDefinedLogicalNodeUnparser` for User-defined Logical Plan unparsing (#13880)

* make ast builder public

* introduce udlp unparser

* add documents

* add examples

* add negative tests and fmt

* fix the doc

* rename udlp to extension

* apply the first unparsing result only

* improve the doc

* seperate the enum for the unparsing result

* fix the doc

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Preserve constant values across union operations (#13805)

* Add value tracking to ConstExpr for improved union optimization

* Update PartialEq impl

* Minor change

* Add docstring for ConstExpr value

* Improve constant propagation across union partitions

* Add assertion for across_partitions

* fix fmt

* Update properties.rs

* Remove redundant constant removal loop

* Remove unnecessary mut

* Set across_partitions=true when both sides are constant

* Extract and use constant values in filter expressions

* Add initial SLT for constant value tracking across UNION ALL

* Assign values to ConstExpr where possible

* Revert "Set across_partitions=true when both sides are constant"

This reverts commit 3051cd470b0ad4a70cd8bd3518813f5ce0b3a449.

* Temporarily take value from literal

* Lint fixes

* Cargo fmt

* Add get_expr_constant_value

* Make `with_value()` accept optional value

* Add todo

* Move test to union.slt

* Fix changed slt after merge

* Simplify constexpr

* Update properties.rs

---------

Co-authored-by: berkaysynnada <[email protected]>

* chore(deps): update sqllogictest requirement from 0.23.0 to 0.24.0 (#13902)

* fix RecordBatch size in topK (#13906)

* ci improvements, update protoc (#13876)

* Fix md5 return_type to only return Utf8 as per current code impl.

* ci improvements

* Lock taiki-e/install-action to a githash for apache action policy - Release 2.46.19 in the case of this hash.

* Lock taiki-e/install-action to a githash for apache action policy - Release 2.46.19 in the case of this hash.

* Revert nextest change until action is approved.

* Exclude requires workspace

* Fixing minor typo to verify ci caching of builds is working as expected.

* Updates from PR review.

* Adding issue link for disabling intel mac build

* improve performance of running examples

* remove cargo check

* Introduce LogicalPlan invariants, begin automatically checking them (#13651)

* minor(13525): perform LP validation before and after each possible mutation

* minor(13525): validate unique field names on query and subquery schemas, after each optimizer pass

* minor(13525): validate union after each optimizer passes

* refactor: make explicit what is an invariant of the logical plan, versus assertions made after a given analyzer or optimizer pass

* chore: add link to invariant docs

* fix: add new invariants module

* refactor: move all LP invariant checking into LP, delineate executable (valid semantic plan) vs basic LP invariants

* test: update test for slight error message change

* fix: push_down_filter optimization pass can push a IN(<subquery>) into a TableScan's filter clause

* refactor: move collect_subquery_cols() to common utils crate

* refactor: clarify the purpose of assert_valid_optimization(), runs after all optimizer passes, except in debug mode it runs after each pass.

* refactor: based upon performance tests, run the maximum number of checks without impa ct:
* assert_valid_optimization can run each optimizer pass
* remove the recursive cehck_fields, which caused the performance regression
* the full LP Invariants::Executable can only run in debug

* chore: update error naming and terminology used in code comments

* refactor: use proper error methods

* chore: more cleanup of error messages

* chore: handle option trailer to error message

* test: update sqllogictests tests to not use multiline

* Correct return type for initcap scalar function with utf8view (#13909)

* Set utf8view as return type when input type is the same

* Verify that the returned type from call to scalar function matches the return type specified in the return_type function

* Match return type to utf8view

* Consolidate example: simplify_udaf_expression.rs into advanced_udaf.rs (#13905)

* Implement maintains_input_order for AggregateExec (#13897)

* Implement maintains_input_order for AggregateExec

* Update mod.rs

* Improve comments

---------

Co-authored-by: berkaysynnada <[email protected]>
Co-authored-by: mertak-synnada <[email protected]>
Co-authored-by: Mehmet Ozan Kabak <[email protected]>

* Move join type input swapping to pub methods on Joins (#13910)

* doc-gen: migrate scalar functions (string) documentation 3/4 (#13926)

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917)

* Update sqllogictest requirement from 0.24.0 to 0.25.0

Updates the requirements on [sqllogictest](https://github.com/risinglightdb/sqllogictest-rs) to permit the latest version.
- [Release notes](https://github.com/risinglightdb/sqllogictest-rs/releases)
- [Changelog](https://github.com/risinglightdb/sqllogictest-rs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/risinglightdb/sqllogictest-rs/compare/v0.24.0...v0.25.0)

---
updated-dependencies:
- dependency-name: sqllogictest
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Remove labels

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: jonahgao <[email protected]>

* Consolidate Examples: memtable.rs and parquet_multiple_files.rs (#13913)

* doc-gen: migrate scalar functions (crypto) documentation (#13918)

* doc-gen: migrate scalar functions (crypto) documentation

* doc-gen: fix typo and update function docs

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* doc-gen: migrate scalar functions (datetime) documentation 1/2 (#13920)

* doc-gen: migrate scalar functions (datetime) documentation 1/2

* fix: fix typo and update function docs

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* fix RecordBatch size in hash join (#13916)

* doc-gen: migrate scalar functions (array) documentation 1/3 (#13928)

* doc-gen: migrate scalar functions (array) documentation 1/3

* fix: remove unsed import, fix typo and update function docs

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* doc-gen: migrate scalar functions (math) documentation 1/2 (#13922)

* doc-gen: migrate scalar functions (math) documentation 1/2

* fix: fix typo

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* doc-gen: migrate scalar functions (math) documentation 2/2 (#13923)

* doc-gen: migrate scalar functions (math) documentation 2/2

* fix: fix typo

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* doc-gen: migrate scalar functions (array) documentation 3/3 (#13930)

* doc-gen: migrate scalar functions (array) documentation 3/3

* fix: import doc and macro, fix typo and update function docs

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* doc-gen: migrate scalar functions (array) documentation 2/3 (#13929)

* doc-gen: migrate scalar functions (array) documentation 2/3

* fix: import doc and macro, fix typo and update function docs

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* doc-gen: migrate scalar functions (string) documentation 4/4 (#13927)

* doc-gen: migrate scalar functions (string) documentation 4/4

* fix: fix typo and update function docs

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* Support explain query when running dfbench with clickbench (#13942)

* Support explain query when running dfbench

* Address comments

* Consolidate example to_date.rs into dateframe.rs (#13939)

* Consolidate example to_date.rs into dateframe.rs

* Assert results using assert_batches_eq

* clippy

* Revert "Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917)" (#13945)

* Revert "Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917)"

This reverts commit 0989649214a6fe69ffb33ed38c42a8d3df94d6bf.

* add comment

* Implement predicate pruning for `like` expressions (prefix matching) (#12978)

* Implement predicate pruning for like expressions

* add function docstring

* re-order bounds calculations

* fmt

* add fuzz tests

* fix clippy

* Update datafusion/core/tests/fuzz_cases/pruning.rs

Co-authored-by: Andrew Lamb <[email protected]>

---------

Co-authored-by: Andrew Lamb <[email protected]>

* doc-gen: migrate scalar functions (string) documentation 1/4 (#13924)

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* consolidate dataframe_subquery.rs into dataframe.rs (#13950)

* migrate btrim to user_doc macro (#13952)

* doc-gen: migrate scalar functions (datetime) documentation 2/2 (#13921)

* doc-gen: migrate scalar functions (datetime) documentation 2/2

* fix: fix typo and update function docs

* doc: update function docs

* doc-gen: remove slash

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>

* Add sqlite test files, progress bar, and automatic postgres container management into sqllogictests (#13936)

* Fix md5 return_type to only return Utf8 as per current code impl.

* Add support for sqlite test files to sqllogictest

* Force version 0.24.0 of sqllogictest dependency until issue with labels is fixed.

* Removed workaround for bug that was fixed.

* Git submodule update ... err update, link to sqlite tests.

* Git submodule update

* Readd submodule

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Supporting writing schema metadata when writing Parquet in parallel (#13866)

* refactor: make ParquetSink tests a bit more readable

* chore(11770): add new ParquetOptions.skip_arrow_metadata

* test(11770): demonstrate that the single threaded ParquetSink is already writing the arrow schema in the kv_meta, and allow disablement

* refactor(11770): replace  with new method, since the kv_metadata is inherent to TableParquetOptions and therefore we should explicitly make the API apparant that you have to include the arrow schema or not

* fix(11770): fix parallel ParquetSink to encode arrow  schema into the file metadata, based on the ParquetOptions

* refactor(11770): provide deprecation warning for TryFrom

* test(11770): update tests with new default to include arrow schema

* refactor: including partitioning of arrow schema inserted into kv_metdata

* test: update tests for new config prop, as well as the new file partition offsets based upon larger metadata

* chore: avoid cloning in tests, and update code docs

* refactor: return to the WriterPropertiesBuilder::TryFrom<TableParquetOptions>, and separately add the arrow_schema to the kv_metadata on the TableParquetOptions

* refactor: require the arrow_schema key to be present in the kv_metadata, if is required by the configuration

* chore: update configs.md

* test: update tests to handle the (default) required arrow schema in the kv_metadata

* chore: add reference to arrow-rs upstream PR

* chore: Create devcontainer.json (#13520)

* Create devcontainer.json

* update devcontainer

* remove useless features

* Minor: consolidate ConfigExtension example into API docs (#13954)

* Update examples README.md

* Minor: consolidate ConfigExtension example into API docs

* more docs

* Remove update

* clippy

* Fix issue with ExtensionsOptions docs

* Parallelize pruning utf8 fuzz test (#13947)

* Add swap_inputs to SMJ (#13984)

* fix(datafusion-functions-nested): `arrow-distinct` now work with null rows (#13966)

* added failing test

* fix(datafusion-functions-nested): `arrow-distinct` now work with null rows

* Update datafusion/functions-nested/src/set_ops.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Update set_ops.rs

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Update release instructions for 44.0.0 (#13959)

* Update release instructions for 44.0.0

* update macros and order

* add functions-table

* Add datafusion python 43.1.0 blog post to doc. (#13974)

* Include license and notice files in more crates (#13985)

* Extract postgres container from sqllogictest, update datafusion-testing pin (#13971)

* Add support for sqlite test files to sqllogictest

* Removed workaround for bug that was fixed.

* Refactor sqllogictest to extract postgres functionality into a separate file. Removed dependency on once_cell in favour of LazyLock.

* Add missing license header.

* Update rstest requirement from 0.23.0 to 0.24.0 (#13977)

Updates the requirements on [rstest](https://github.com/la10736/rstest) to permit the latest version.
- [Release notes](https://github.com/la10736/rstest/releases)
- [Changelog](https://github.com/la10736/rstest/blob/master/CHANGELOG.md)
- [Commits](https://github.com/la10736/rstest/compare/v0.23.0...v0.23.0)

---
updated-dependencies:
- dependency-name: rstest
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Move hash collision test to run only when merging to main. (#13973)

* Update itertools requirement from 0.13 to 0.14 (#13965)

* Update itertools requirement from 0.13 to 0.14

Updates the requirements on [itertools](https://github.com/rust-itertools/itertools) to permit the latest version.
- [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-itertools/itertools/compare/v0.13.0...v0.13.0)

---
updated-dependencies:
- dependency-name: itertools
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Fix build

* Simplify

* Update CLI lock

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: jonahgao <[email protected]>

* Change trigger, rename `hash_collision.yml` to `extended.yml` and add comments (#13988)

* Rename hash_collision.yml to extended.yml and add comments

* Adjust schedule, add comments

* Update job, rerun

* doc-gen: migrate scalar functions (string) documentation 2/4 (#13925)

* doc-gen: migrate scalar functions (string) documentation 2/4

* doc-gen: update function docs

* doc: fix related udf order for upper function in documentation

* Update datafusion/functions/src/string/concat_ws.rs

* Update datafusion/functions/src/string/concat_ws.rs

* Update datafusion/functions/src/string/concat_ws.rs

* doc-gen: update function docs

---------

Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com>
Co-authored-by: Oleks V <[email protected]>

* Update substrait requirement from 0.50 to 0.51 (#13978)

Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version.
- [Release notes](https://github.com/substrait-io/substrait-rs/releases)
- [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/substrait-io/substrait-rs/compare/v0.50.0...v0.51.0)

---
updated-dependencies:
- dependency-name: substrait
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Update release README for datafusion-cli publishing (#13982)

* Enhance LastValueAccumulator logic and add SQL logic tests for last_value function (#13980)

- Updated LastValueAccumulator to include requirement satisfaction check before updating the last value.
- Added SQL logic tests to verify the behavior of the last_value function with merge batches and ensure correct aggregation in various scenarios.

* Improve deserialize_to_struct example (#13958)

* Cleanup deserialize_to_struct example

* prettier

* Apply suggestions from code review

Co-authored-by: Jonah Gao <[email protected]>

---------

Co-authored-by: Jonah Gao <[email protected]>

* Update docs (#14002)

* Optimize CASE expression for "expr or expr" usage. (#13953)

* Apply optimization for ExprOrExpr.

* Implement optimization similar to existing code.

* Add sqllogictest.

* feat(substrait): introduce consume_rel and consume_expression (#13963)

* feat(substrait): introduce consume_rel and consume_expression

Route calls to from_substrait_rel and from_substrait_rex through the
SubstraitConsumer in order to allow users to provide their own behaviour

* feat(substrait): consume nulls of user-defined types

* docs(substrait): consume_rel and consume_expression docstrings

* Consolidate csv_opener.rs and json_opener.rs into a single example (#… (#13981)

* Consolidate csv_opener.rs and json_opener.rs into a single example (#13955)

* Update datafusion-examples/examples/csv_json_opener.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Update datafusion-examples/README.md

Co-authored-by: Andrew Lamb <[email protected]>

* Apply code formatting with cargo fmt

---------

Co-authored-by: Sergey Zhukov <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>

* FIX : Incorrect NULL handling in BETWEEN expression (#14007)

* submodule update

* FIX : Incorrect NULL handling in BETWEEN expression

* Revert "submodule update"

This reverts commit 72431aadeaf33a27775a88c41931572a0b66bae3.

* fix incorrect unit test

* move sqllogictest to expr

* feat(substrait): modular substrait producer (#13931)

* feat(substrait): modular substrait producer

* refactor(substrait): simplify col_ref_offset handling in producer

* refactor(substrait): remove column offset tracking from producer

* docs(substrait): document SubstraitProducer

* refactor: minor cleanup

* feature: remove unused SubstraitPlanningState

BREAKING CHANGE: SubstraitPlanningState is no longer available

* refactor: cargo fmt

* refactor(substrait): consume_ -> handle_

* refactor(substrait): expand match blocks

* refactor: DefaultSubstraitProducer only needs serializer_registry

* refactor: remove unnecessary warning suppression

* fix(substrait): route expr conversion through handle_expr

* cargo fmt

* fix: Avoid re-wrapping planning errors  Err(DataFusionError::Plan) for use in plan_datafusion_err (#14000)

* fix: unwrapping Err(DataFusionError::Plan) for use in plan_datafusion_err

* test: add tests for error formatting during planning

* feat: support `RightAnti` for `SortMergeJoin` (#13680)

* feat: support `RightAnti` for `SortMergeJoin`

* feat: preserve session id when using cxt.enable_url_table() (#14004)

* Return error message during planning when inserting into a MemTable with zero partitions. (#14011)

* Minor: Rewrite LogicalPlan::max_rows for Join and Union, made it easier to understand (#14012)

* Refactor max_rows for join plan, made it easier to understand

* Simplified max_rows for Union

* Chore: update wasm-supported crates, add tests (#14005)

* Chore: update wasm-supported crates

* format

* Use workspace rust-version for all workspace crates (#14009)

* [Minor] refactor: make ArraySort public for broader access (#14006)

* refactor: make ArraySort public for broader access

Changes the visibility of the ArraySort struct fromsuper to public. allows broader access to the 
struct, enabling its use in other modules and 
promoting better code reuse.

* clippy and docs

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Update sqllogictest requirement from =0.24.0 to =0.26.0 (#14017)

* Update sqllogictest requirement from =0.24.0 to =0.26.0

Updates the requirements on [sqllogictest](https://github.com/risinglightdb/sqllogictest-rs) to permit the latest version.
- [Release notes](https://github.com/risinglightdb/sqllogictest-rs/releases)
- [Changelog](https://github.com/risinglightdb/sqllogictest-rs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/risinglightdb/sqllogictest-rs/compare/v0.24.0...v0.26.0)

---
updated-dependencies:
- dependency-name: sqllogictest
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* remove version pin and note

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Eduard Karacharov <[email protected]>

* `url` dependancy update (#14019)

* `url` dependancy update

* `url` version update for datafusion-cli

* Minor: Improve zero partition check when inserting into `MemTable` (#14024)

* Improve zero partition check when inserting into `MemTable`

* update err msg

* refactor: make structs public and implement Default trait (#14030)

* Minor: Remove redundant implementation of `StringArrayType` (#14023)

* Minor: Remove redundant implementation of StringArrayType

Signed-off-by: Tai Le Manh <[email protected]>

* Deprecate rather than remove StringArrayType

---------

Signed-off-by: Tai Le Manh <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>

* Added references to IDE documentation for dev containers along with a small note about why one may choose to do development using a dev container. (#14014)

* Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream (#13995)

* Refactor spill handling in GroupedHashAggregateStream to use partial aggregate schema

* Implement aggregate functions with spill handling in tests

* Add tests for aggregate functions with and without spill handling

* Move test related imports into mod test

* Rename spill pool test functions for clarity and consistency

* Refactor aggregate function imports to use fully qualified paths

* Remove outdated comments regarding input batch schema for spilling in GroupedHashAggregateStream

* Update aggregate test to use AVG instead of MAX

* assert spill count

* Refactor partial aggregate schema creation to use create_schema function

* Refactor partial aggregation schema creation and remove redundant function

* Remove unused import of Schema from arrow::datatypes in row_hash.rs

* move spill pool testing for aggregate functions to physical-plan/src/aggregates

* Use Arc::clone for schema references in aggregate functions

* Encapsulate fields of `EquivalenceProperties` (#14040)

* Encapsulate fields of `EquivalenceGroup` (#14039)

* Fix error on `array_distinct` when input is empty #13810 (#14034)

* fix

* add test

* oops

---------

Co-authored-by: Cyprien Huet <[email protected]>

* Update petgraph requirement from 0.6.2 to 0.7.1 (#14045)

* Update petgraph requirement from 0.6.2 to 0.7.1

Updates the requirements on [petgraph](https://github.com/petgraph/petgraph) to permit the latest version.
- [Changelog](https://github.com/petgraph/petgraph/blob/master/RELEASES.rst)
- [Commits](https://github.com/petgraph/petgraph/compare/[email protected]@v0.7.1)

---
updated-dependencies:
- dependency-name: petgraph
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Update datafusion-cli/Cargo.lock

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Andrew Lamb <[email protected]>

* Encapsulate fields of `OrderingEquivalenceClass` (make field non pub) (#14037)

* Complete encapsulatug `OrderingEquivalenceClass` (make fields non pub)

* fix doc

* Fix: ensure that compression type is also taken into consideration during ListingTableConfig infer_options (#14021)

* chore: add test to verify that schema is inferred as expected

* chore: add comment to method as suggested

* chore: restructure to avoid need to clone

* chore: fix flaw in rewrite

* feat(optimizer): Enable filter pushdown on window functions (#14026)

* feat(optimizer): Enable filter pushdown on window functions

Ensures selections can be pushed past window functions similarly
to what is already done with aggregations, when possible.

* fix: Add missing dependency

* minor(optimizer): Use 'datafusion-functions-window' as a dev dependency

* docs(optimizer): Add example to filter pushdown on LogicalPlan::Window

* Unparsing optimized (> 2 inputs) unions (#14031)

* tests and optimizer in testing queries

* unparse optimized unions

* format Cargo.toml

* format Cargo.toml

* revert test

* rewrite test to avoid cyclic dep

* remove old test

* cleanup

* comments and error handling

* handle union with lt 2 inputs

* Minor: Document output schema of LogicalPlan::Aggregate and LogicalPlan::Window (#14047)

* Simplify error handling in case.rs (#13990) (#14033)

* Simplify error handling in case.rs (#13990)

* Fix issues causing GitHub checks to fail

* Update datafusion/physical-expr/src/expressions/case.rs

Co-authored-by: Andrew Lamb <[email protected]>

---------

Co-authored-by: Sergey Zhukov <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>

* feat: add `AsyncCatalogProvider` helpers for asynchronous catalogs (#13800)

* Add asynchronous catalog traits to help users that have asynchronous catalogs

* Apply clippy suggestions

* Address PR reviews

* Remove allow_unused exceptions

* Update remote catalog example to demonstrate new helper structs

* Move schema_name / catalog_name parameters into resolve function and out of trait

* Custom scalar to sql overrides support for DuckDB Unparser dialect (#13915)

* Allow adding custom scalar to sql overrides for DuckDB (#68)

* Add unit test: custom_scalar_overrides_duckdb

* Move `with_custom_scalar_overrides` definition on `Dialect` trait level

* Improve perfomance of `reverse` function (#14025)

* Improve perfomance of 'reverse' function

Signed-off-by: Tai Le Manh <[email protected]>

* Apply sugestion change

* Fix typo

---------

Signed-off-by: Tai Le Manh <[email protected]>

* docs(ci): use up-to-date protoc with docs.rs (#14048)

* fix (#14042)

Co-authored-by: Cyprien Huet <[email protected]>

* Re-export TypeSignatureClass from the datafusion-expr package (#14051)

* Fix clippy for Rust 1.84 (#14065)

* fix: incorrect error message of function_length_check (#14056)

* minor fix

* add ut

* remove check for 0 arg

* test: Add plan execution during tests for bounded source (#14013)

* Bump `ctor` to `0.2.9` (#14069)

* Refactor into `LexOrdering::collapse`, `LexRequirement::collapse` avoid clone (#14038)

* Move collapse_lex_ordering to Lexordering::collapse

* reduce diff

* avoid clone, cleanup

* Introduce LexRequirement::collapse

* Improve performance of collapse, from @akurmustafa

https://github.com/alamb/datafusion/pull/26

fix formatting

* Revert "Improve performance of collapse, from @akurmustafa"

This reverts commit a44acfdb3af5bf0082c277de6ee7e09e92251a49.

* remove incorrect comment

---------

Co-authored-by: Mustafa Akur <[email protected]>

* Bump `wasm-bindgen` and `wasm-bindgen-futures` (#14068)

* update (#14070)

* fix: make get_valid_types handle TypeSignature::Numeric correctly (#14060)

* fix get_valid_types with TypeSignature::Numeric

* f…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules performance Make DataFusion faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support pruning on string columns using LIKE
6 participants