-
Hello! I'm attempting to use Datafusion to implement an embedded query engine for a tool that occasionally needs to perform CPU-intensive, long-running operations, such as sorting a dataset that is too large to fit in memory. The debug logging produced during the execution of the physical plan during these types of operations is useful to me (as a developer) but for my users I would prefer to render a progress bar instead, using something like Before I jump into the Datafusion internals, I'm wondering if anyone has any suggestions as to the best way to go about this. (Do you think it's even possible in the first place?) Are there any existing hooks for progress events? Or extension points for alternate implementations of, say, an Thanks in advance. And thanks for all the work that's gone into Datafusion! I've found it to be an extraordinarily useful library. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
Any progress with this? |
Beta Was this translation helpful? Give feedback.
-
I think I've got a sketch of a working approach. Here's the general idea:
|
Beta Was this translation helpful? Give feedback.
-
Another possibility is to call Using row counts could give you some estimate of how far your query has progressed. In terms of estimating how much longer a query has to run I think it is tricky in the sense that the input size is often not known exactly, and some of the operators (like HashAgregate and Joins and Sorts) don't produce any output until they have seen all their input |
Beta Was this translation helpful? Give feedback.
I think I've got a sketch of a working approach. Here's the general idea:
ExecutionPlan
offers a couple of relevant functions:statistics
which provides metrics for the physical plan; andexecute
, which produces a stream of record batches that constitute the output of the plan.ExecutionPlan
that wraps a normal plan and delegates every function except forexecute
, which I'm hijacking to produce a wrapped progress reader stream.statistics
…