Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datafusion] NOW() function support #288

Merged
merged 27 commits into from
May 14, 2021
Merged

Conversation

msathis
Copy link
Contributor

@msathis msathis commented May 7, 2021

Which issue does this PR close?

Closes #251 .

Rationale for this change

Adding Postgres compatible NOW function

What changes are included in this PR?

  • Add NOW
  • Multiple NOW should return same time
  • Fix non-optimized plan case (Thanks to @alamb )

Are there any user-facing changes?

N/A

@codecov-commenter
Copy link

codecov-commenter commented May 9, 2021

Codecov Report

Merging #288 (1987ac3) into master (5f6024d) will decrease coverage by 0.61%.
The diff coverage is 96.48%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #288      +/-   ##
==========================================
- Coverage   76.80%   76.19%   -0.62%     
==========================================
  Files         133      142       +9     
  Lines       23284    23927     +643     
==========================================
+ Hits        17884    18231     +347     
- Misses       5400     5696     +296     
Impacted Files Coverage Δ
datafusion/src/physical_plan/planner.rs 80.00% <80.00%> (-0.37%) ⬇️
datafusion/src/optimizer/constant_folding.rs 91.63% <90.74%> (-0.32%) ⬇️
datafusion/src/execution/context.rs 92.49% <97.36%> (-0.14%) ⬇️
...ta/rust/core/src/serde/physical_plan/from_proto.rs 47.80% <100.00%> (+0.25%) ⬆️
datafusion/src/optimizer/eliminate_limit.rs 89.18% <100.00%> (+0.30%) ⬆️
datafusion/src/optimizer/filter_push_down.rs 97.74% <100.00%> (+<0.01%) ⬆️
datafusion/src/optimizer/hash_build_probe_order.rs 62.06% <100.00%> (-0.54%) ⬇️
datafusion/src/optimizer/limit_push_down.rs 97.84% <100.00%> (+0.02%) ⬆️
datafusion/src/optimizer/projection_push_down.rs 98.74% <100.00%> (+0.07%) ⬆️
datafusion/src/optimizer/utils.rs 49.80% <100.00%> (+0.39%) ⬆️
... and 60 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f6024d...1987ac3. Read the comment docs.

@msathis
Copy link
Contributor Author

msathis commented May 9, 2021

This PR is already big. To reduce the review burden, Will add CURRENT_TIMESTAMP and CURRENT_TIME in a separate PR.

@msathis msathis marked this pull request as ready for review May 9, 2021 07:50
@msathis msathis changed the title [Datafusion] NOW, CURRENT_TIMESTAMP, CURRENT_TIME functions [Datafusion] NO functions May 9, 2021
@msathis msathis changed the title [Datafusion] NO functions [Datafusion] NOW() function support May 9, 2021
@Dandandan
Copy link
Contributor

@msathis

Thanks for this PR!

I believe we should do it a bit differently and not calculate the timestamp inside the optimization rule.

One could disable the optimization rule, or execute an un-optimized plan.
The result of the query needs to be correct in those cases too.

My suggestion would be to set field to the ExecutionContextState (something like queryExecutionStartTime) at the start of a query and pass it to the optimizers / evaluation code. This requires maybe some more changes to the code, but is needed to correctly run the queries.

For the optimizer you could use utils::expr_sub_expressions (see use in filter push down) to recurse into the expressions and replace all cases of the NOW with the timestamp of the start of the query.

@msathis
Copy link
Contributor Author

msathis commented May 9, 2021

Thanks @Dandandan That makes sense. I'll modify the PR as per your suggestion 👍

@alamb alamb added the api change Changes the API exposed to users of the crate label May 10, 2021
@returnString
Copy link
Contributor

I am not sure we should do this: we are changing all functions' signatures because of a single function's requirement.

I think we should add a node specifically for NOW and leave all the other functions' alone.

We'll probably hit up against similar issues for current_date, current_time, current_timestamp, etc, in case that helps formulate what said node would look like.

Interestingly, this is an instance where it would be much easier if slightly less performant to implement as a UDF for certain use cases, because you could just close over some external state when building your execution ctx.

@jorgecarleitao
Copy link
Member

AFAIK current_* are all derived from now; imo the differentiator aspect here is that there is some state X that is being shared.

It seems to me that the use-case here is that we want to preserve state across nodes, so that their execution depends on said state. NOW is an example, but in reality, random is also an example; we "cheated" a bit by not allowing users to select a seed. If they want that, we hit the same problem as NOW.

IMO a natural construct here is something like struct StatefulFunction<T: Send + Sync>, where T is the state, and Arc<T> is inside of it, and that implements PhysicalExpr. During planning, the initial state is passed to it from the planner, and we are ready to fly.

For Ballista, we would probably need T to also be "proto-able", but we can handle that on a future PR when we add support for this node to Ballista.

IMO this would enable us to implement now, but also a lot of other functions that share state (e.g. pass T: Mutex<something> and we have a function that can change state during evaluation, which AFAIK is what we need for window functions, whereby the state is updated as batches pass through it).

The ScalarFunction construct was meant to be stateless because it makes it very easy to develop, and it also makes it obvious that is stateless. Trying to couple execution state to them is imo going beyond its scope.

IMO a PhysicalExpr is not really that difficult to implement; it is just a struct with evaluate, schema, and a new match on the physical planner.

@returnString
Copy link
Contributor

@jorgecarleitao that all sounds reasonable! In Postgres, this sort of corresponds to the function volatility categories (https://www.postgresql.org/docs/13/xfunc-volatility.html) which might be a useful basis for any future definition of different function types.

  • immutable: pure function, can only use arguments and internal constants (example: basic math ops). Optimiser can do lots here
  • stable: can refer to shared state but must return the same value for the same arguments within a given statement (example: now). Optimiser is allowed to unify all references into one call per unique set of arguments
  • volatile: no rules, no optimiser potential! Must always be evaluated exactly as initially planned (example: random)

@Dandandan
Copy link
Contributor

The stateful function seems needed at some point.
I am not sure a stateful function would be needed here, at this moment, and might go a bit beyond the scope of this PR/feature(?).

Adding a node for node for timestamps (which can be used by current_date, NOW(), etc.) which only has access to the query start time seems like a more natural route to me for now. WDYT?

@alamb alamb dismissed their stale review May 12, 2021 21:06

has changed significantly

@alamb
Copy link
Contributor

alamb commented May 12, 2021

The stateful function seems needed at some point. I am not sure a stateful function would be needed here, at this moment, and might go a bit beyond the scope of this PR/feature(?).

I agree with @Dandandan -- I would like to get a basic implementation of now(). I feel @msathis had this PR really close but then the addition of ExecutionProps to all scalar functions has caused issues

Adding a node for node for timestamps (which can be used by current_date, NOW(), etc.) which only has access to the query start time seems like a more natural route to me for now. WDYT?

Just to be clear, are you suggesting adding a new Expr variant, something like Expr::now() that could be handled specially?

I guess I was hoping that part of the translation from Expr --> PhysicalExpr could somehow capture state that was available at plan time (e.g. ExecutionProps) without having to change the signature of all scalar functions. Perhaps we could use a closure or something to capture this statel

@msathis
Copy link
Contributor Author

msathis commented May 13, 2021

I think we should create another category of functions called StatefulFunction as @jorgecarleitao explained above. This seems like a clean solution without any hacks from my point of view. But open for other suggestions as well. I would like to give this issue another try once we have the agreement on which way to go.

@alamb
Copy link
Contributor

alamb commented May 13, 2021

I am not opposed to the idea of StatefulFunction but it seems like a large overkill for the now() function.

Here is an alternate proposal (in #335): 261e769 -- basically use a closure to capture the value for now() during plan time. It doesn't require changing any other function signatures and I think implements the semantics of now() correctly.

What do you think @msathis @jorgecarleitao @Dandandan and @returnString ?

If we still want to do StatefulFunction that is cool too, but perhaps it would better be done as part of another PR?

@returnString
Copy link
Contributor

returnString commented May 13, 2021

Yeah on the basis that now is quite a common/useful function and this PR plus @alamb's closure change resolves this in a neat way, my suggestion would be:

Especially at the logical level, I think there are more potential benefits here than just supporting stateful functions outright, like enabling better plan optimisation, and we could probably do with some upfront design work on that first 😄

Edit: also, if/when we land some work for stateful functions, there's nothing that stops us migrating this across later!

@jorgecarleitao
Copy link
Member

Just to understand, @alamb and @returnString , is it fair to say that there is some urgency in the NOW? In this case, I agree that the closure is a great stop-gap and we should go for it.

Other than that, if you want to take a stab at stateful functions, @msathis , an idea is to create an issue and a design document / draft, so that others could comment on before you commit to a large chunk of work?

@alamb
Copy link
Contributor

alamb commented May 13, 2021

@jorgecarleitao I would like now() to be added to datafusion in the next few weeks -- my usecase is that I want to be able to write queries with predicates like "in the last 5 minutes" (which are very common in timeseries queries). The standard sql I know to do this is

WHERE ts > now() - interval '5 minutes'

However, that query also requires timestamp arithmetic -- part of #194 -- but I can work around the lack of #194 with casting. I can't think of any way to work around the lack of now()

I also am not sure I would call now() "stateful" in the sense that it has state that changes during the execution of the query. It is more like "parameterized" or something with a parameter that is filled in prior to execution

@alamb
Copy link
Contributor

alamb commented May 13, 2021

@jorgecarleitao Also, I am not convinced about how valuable a general purpose StatefulFunction will be (though of course it depends on a proposal that is not yet written ;) )

This is due to:

  1. You can write such a function today as a ScalarFunction -- via a closure with an Arc<...> with shared state. I don't see much advantage to building that into DataFusion. Window functions I think are going to have to be their own thing (like SortExpr and AggregateExpr.
  2. Given there is no way to know what order DataFusion will pass inputs to your functions (across threads, repartitioned, etc) I can't think of a usecase at this moment 🤔

@returnString
Copy link
Contributor

returnString commented May 13, 2021

I also am not sure I would call now() "stateful" in the sense that it has state that changes during the execution of the query. It is more like "parameterized" or something with a parameter that is filled in prior to execution

Yeah, in Postgres parlance this kind of function is "stable" (read: consistent over the course of a single txn given the same inputs) as opposed to "immutable" (which can't reference anything outside its args/constants) - still not the best breakdown imo, but a little bit closer maybe.

Cache invalidation and naming things, right? 😅

Also, I am not convinced about how valuable a general purpose StatefulFunction will be (though of course it depends on a proposal that is not yet written ;) )

Off the top of my head I think it'll open up some potential for generalised optimisation passes over function usage in queries according to function class, i.e. the optimiser rule used for the initial implementation of this PR but applicable to arbitrary functions provided they indicate themselves to be "stable".

is it fair to say that there is some urgency in the NOW? In this case, I agree that the closure is a great stop-gap and we should go for it.

Personally I think so, it's a pretty generally useful function and opens up lots of good time-series use cases as @alamb mentioned.

Edit: also, I'm happy to assist in defining this proposed future work or even drive it outright; not just trying to generate tasks for other people ;)

@msathis
Copy link
Contributor Author

msathis commented May 14, 2021

I have reverted the last commit & added @alamb approach from #335. Took care of the review comments as well. @alamb @jorgecarleitao Can you please give another look at the PR?

@msathis msathis requested a review from jorgecarleitao May 14, 2021 06:19
Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @msathis for this and for your patience 💯

@msathis
Copy link
Contributor Author

msathis commented May 14, 2021

Thanks @jorgecarleitao! It was a great learning experience for me. 👌

@Dandandan
Copy link
Contributor

Yes, amazing work @msathis

@alamb alamb mentioned this pull request May 14, 2021
@alamb
Copy link
Contributor

alamb commented May 14, 2021

I filed #340 to track possible future work on stateful functions

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I LIKE IT! Nice work @msathis -- and I echo the sentiment of thanks for bearing with us.

@alamb alamb merged commit b096539 into apache:master May 14, 2021
@msathis msathis deleted the add-now-function branch May 14, 2021 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Postgres compatible now() function
6 participants