Add any extra change notes here and we'll put them in the release notes on GitHub when we make a new release.
-
Breaking change Recovery system re-worked. Kafka-based recovery removed. SQLite recovery file format changed; existing recovery DB files can not be used. See the module docstring for
bytewax.recovery
for how to use the new recovery system. -
Dataflow execution supports rescaling over resumes. You can now change the number of workers and still get proper execution and recovery.
- Add support for Windows builds - thanks @zzl221000!
- Adds a CSVInput subclass of FileInput
- Add a cooldown for activating workers to reduce CPU consumption.
- Add support for Python 3.11.
-
Breaking change Reworked the execution model.
run_main
andcluster_main
have been moved tobytewax.testing
as they are only supposed to be used when testing or prototyping. Production dataflows should be ran by calling thebytewax.run
module withpython -m bytewax.run <dataflow-path>:<dataflow-name>
. Seepython -m bytewax.run -h
for all the possible options. The functionality offered byspawn_cluster
are now only offered by thebytewax.run
script, sospawn_cluster
was removed. -
Breaking change
{Sliding,Tumbling}Window.start_at
has been renamed toalign_to
and both now require that argument. It's not possible to recover windowing operators without it. -
Fixes bugs with windows not closing properly.
-
Fixes an issue with SQLite-based recovery. Previously you'd always get an "interleaved executions" panic whenever you resumed a cluster after the first time.
-
Add
SessionWindow
for windowing operators. -
Add
SlidingWindow
for windowing operators. -
Breaking change Rename
TumblingWindowConfig
toTumblingWindow
-
Add
filter_map
operator. -
Breaking change New partition-based input and output API. This removes
ManualInputConfig
andManualOutputConfig
. Seebytewax.inputs
andbytewax.outputs
for more info. -
Breaking change
Dataflow.capture
operator is renamed toDataflow.output
. -
Breaking change
KafkaInputConfig
andKafkaOutputConfig
have been moved tobytewax.connectors.kafka.KafkaInput
andbytewax.connectors.kafka.KafkaOutput
. -
Deprecation warning The
KafkaRecovery
store is being deprecated in favor ofSqliteRecoveryConfig
, and will be removed in a future release.
- Breaking change Fixes issue with multi-worker recovery. If the cluster crashed before all workers had completed their first epoch, the cluster would resume from the incorrect position. This requires a change to the recovery store. You cannot resume from recovery data written with an older version.
-
Dataflow continuation now works. If you run a dataflow over a finite input, all state will be persisted via recovery so if you re-run the same dataflow pointing at the same input, but with more data appended at the end, it will correctly continue processing from the previous end-of-stream.
-
Fixes issue with multi-worker recovery. Previously resume data was being routed to the wrong worker so state would be missing.
-
Breaking change The above two changes require that the recovery format has been changed for all recovery stores. You cannot resume from recovery data written with an older version.
-
Adds an introspection web server to dataflow workers.
-
Adds
collect_window
operator.
- Added Google Colab support.
- Added tracing instrumentation and configurations for tracing backends.
-
Fixes bug where window is never closed if recovery occurs after last item but before window close.
-
Recovery logging is reduced.
-
Breaking change Recovery format has been changed for all recovery stores. You cannot resume from recovery data written with an older version.
-
Adds a
DynamoDB
andBigquery
output connector.
-
Performance improvements.
-
Support SASL and SSL for
bytewax.inputs.KafkaInputConfig
.
-
KafkaInputConfig now accepts additional properties. See
bytewax.inputs.KafkaInputConfig
. -
Support for a pre-built Kafka output component. See
bytewax.outputs.KafkaOutputConfig
.
-
Added the
fold_window
operator, works likereduce_window
but allows the user to build the initial accumulator for each key in abuilder
function. -
Output is no longer specified using an
output_builder
for the entire dataflow, but you supply an "output config" per capture. Seebytewax.outputs
for more info. -
Input is no longer specified on the execution entry point (like
run_main
), it is instead using theDataflow.input
operator. -
Epochs are no longer user-facing as part of the input system. Any custom Python-based input components you write just need to be iterators and emit items. Recovery snapshots and backups now happen periodically, defaulting to every 10 seconds.
-
Recovery format has been changed for all recovery stores. You cannot resume from recovery data written with an older version.
-
The
reduce_epoch
operator has been replaced withreduce_window
. It takes a "clock" and a "windower" to define the kind of aggregation you want to do. -
run
andrun_cluster
have been removed and the remaining execution entry points moved intobytewax.execution
. You can now get similar prototyping functionality withbytewax.execution.run_main
andbytewax.execution.spawn_cluster
usingTesting{Input,Output}Config
s. -
Dataflow
has been moved intobytewax.dataflow.Dataflow
.
-
Input is no longer specified using an
input_builder
, but now aninput_config
which allows you to use pre-built input components. Seebytewax.inputs
for more info. -
Preliminary support for a pre-built Kafka input component. See
bytewax.inputs.KafkaInputConfig
. -
Keys used in the
(key, value)
2-tuples to route data for stateful operators (likestateful_map
andreduce_epoch
) must now be strings. Because of thisbytewax.exhash
is no longer necessary and has been removed. -
Recovery format has been changed for all recovery stores. You cannot resume from recovery data written with an older version.
-
Slight changes to
bytewax.recovery.RecoveryConfig
config options due to recovery system changes. -
bytewax.run()
andbytewax.run_cluster()
no longer takerecovery_config
as they don't support recovery.
-
Adds
bytewax.AdvanceTo
andbytewax.Emit
to control when processing happens. -
Adds
bytewax.run_main()
as a way to test input and output builders without starting a cluster. -
Adds a
bytewax.testing
module with helpers for testing. -
bytewax.run_cluster()
andbytewax.spawn_cluster()
now take amp_ctx
argument to allow you to change the multiprocessing behavior. E.g. from "fork" to "spawn". Defaults now to "spawn". -
Adds dataflow recovery capabilities. See
bytewax.recovery
. -
Stateful operators
bytewax.Dataflow.reduce()
andbytewax.Dataflow.stateful_map()
now require astep_id
argument to handle recovery. -
Execution entry points now take configuration arguments as kwargs.
-
Capture operator no longer takes arguments. Items that flow through those points in the dataflow graph will be processed by the output handlers setup by each execution entry point. Every dataflow requires at least one capture.
-
Executor.build_and_run()
is replaced with four entry points for specific use cases:-
run()
for exeuction in the current process. It returns all captured items to the calling process for you. Use this for prototyping in notebooks and basic tests. -
run_cluster()
for execution on a temporary machine-local cluster that Bytewax coordinates for you. It returns all captured items to the calling process for you. Use this for notebook analysis where you need parallelism. -
spawn_cluster()
for starting a machine-local cluster with more control over input and output. Use this for standalone scripts where you might need partitioned input and output. -
cluster_main()
for starting a process that will participate in a cluster you are coordinating manually. Use this when starting a Kubernetes cluster.
-
-
Adds
bytewax.parse
module to help with reading command line arguments and environment variables for the above entrypoints. -
Renames
bytewax.inp
tobytewax.inputs
.