FlatMapGroupsWithStateExec
is a unary physical operator (aka UnaryExecNode
) that is created when FlatMapGroupsWithStateStrategy execution planning strategy plans FlatMapGroupsWithState logical operator for execution.
Note
|
FlatMapGroupsWithState logical operator is created as the result of flatMapGroupsWithState operator. |
import java.sql.Timestamp
import org.apache.spark.sql.streaming.GroupState
val stateFunc = (key: Long, values: Iterator[(Timestamp, Long)], state: GroupState[Long]) => {
Iterator((key, values.size))
}
import java.sql.Timestamp
import org.apache.spark.sql.streaming.{GroupState, GroupStateTimeout, OutputMode}
val rateGroups = spark.
readStream.
format("rate").
load.
withWatermark(eventTime = "timestamp", delayThreshold = "10 seconds"). // required for EventTimeTimeout
as[(Timestamp, Long)]. // leave DataFrame for Dataset
groupByKey { case (time, value) => value % 2 }. // creates two groups
flatMapGroupsWithState(OutputMode.Update, GroupStateTimeout.EventTimeTimeout)(stateFunc) // EventTimeTimeout requires watermark (defined above)
// Check out the physical plan with FlatMapGroupsWithStateExec
scala> rateGroups.explain
== Physical Plan ==
*SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#35L, assertnotnull(input[0, scala.Tuple2, true])._2 AS _2#36]
+- FlatMapGroupsWithState <function3>, value#30: bigint, newInstance(class scala.Tuple2), [value#30L], [timestamp#20-T10000ms, value#21L], obj#34: scala.Tuple2, StatefulOperatorStateInfo(<unknown>,63491721-8724-4631-b6bc-3bb1edeb4baf,0,0), class[value[0]: bigint], Update, EventTimeTimeout, 0, 0
+- *Sort [value#30L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(value#30L, 200)
+- AppendColumns <function1>, newInstance(class scala.Tuple2), [input[0, bigint, false] AS value#30L]
+- EventTimeWatermark timestamp#20: timestamp, interval 10 seconds
+- StreamingRelation rate, [timestamp#20, value#21L]
// Execute the streaming query
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import scala.concurrent.duration._
val sq = rateGroups.
writeStream.
format("console").
trigger(Trigger.ProcessingTime(10.seconds)).
outputMode(OutputMode.Update). // Append is not supported
start
// Eventually...
sq.stop
Name | Description |
---|---|
Number of keys in the StateStore Incremented when |
|
Memory used by the StateStore |
FlatMapGroupsWithStateExec
is a ObjectProducerExec
that…FIXME
FlatMapGroupsWithStateExec
is a StateStoreWriter
that…FIXME
FlatMapGroupsWithStateExec
supports watermark which is…FIXME
Note
|
StatefulOperatorStateInfo, batchTimestampMs, and eventTimeWatermark are defined when |
When executed, FlatMapGroupsWithStateExec
requires that the optional values are properly defined given timeoutConf:
-
batchTimestampMs for
ProcessingTimeTimeout
-
eventTimeWatermark and watermarkExpression for
EventTimeTimeout
Caution
|
FIXME Where are the optional values defined? |
Name | Description |
---|---|
Tip
|
Enable Add the following line to
Refer to Logging. |
doExecute(): RDD[InternalRow]
Note
|
doExecute is a part of SparkPlan contract to produce the result of a physical operator as an RDD of internal binary rows (i.e. InternalRow ).
|
Internally, doExecute
initializes metrics.
doExecute
then executes child physical operator and creates a StateStoreRDD with storeUpdateFunction
that:
-
Creates a StateStoreUpdater
-
Filters out rows from
Iterator[InternalRow]
that matchwatermarkPredicateForData
(when defined and timeoutConf isEventTimeTimeout
) -
Generates an output
Iterator[InternalRow]
with elements fromStateStoreUpdater
's updateStateForKeysWithData and updateStateForTimedOutKeys -
In the end,
storeUpdateFunction
creates aCompletionIterator
that executes a completion function (akacompletionFunction
) after it has successfully iterated through all the elements (i.e. when a client has consumed all the rows). The completion method requestsStateStore
to commit followed by updatingnumTotalStateRows
metric with the number of keys in the state store.
FlatMapGroupsWithStateExec
takes the following when created:
-
State function of type
(Any, Iterator[Any], LogicalGroupState[Any]) ⇒ Iterator[Any]
-
Grouping attributes (as used for grouping in KeyValueGroupedDataset for
mapGroupsWithState
orflatMapGroupsWithState
operators) -
Optional StatefulOperatorStateInfo
FlatMapGroupsWithStateExec
initializes the internal registries and counters.