You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A bug in the collection encoder of the Spark dataset runner (aka Spark structured streaming runner) may result in corrupted data after a groupByKey if the dataset is converted to an RDD and persisted.
The collection encoder just wraps the underlying serialized UnsafeArrayData for primitive types to present a Collection as a result of deserialization. However, Spark may reuse the underlying unsafe object and with that corrupt values.
dataset.persist(storageLevel);
System.out.println("\nContent of persisted dataset (1st eval)");
dataset.foreach(printValue);
System.out.println("\nContent of persisted dataset (2nd eval)");
dataset.foreach(printValue);
System.out.println("\nContent of rdd (1st eval)");
dataset.rdd().foreach(printValue);
System.out.println("\nContent of rdd (2nd eval)");
dataset.rdd().foreach(printValue);
System.out.println("\nContent of persisted rdd (1st eval)");
dataset.rdd().persist().foreach(printValue);
System.out.println("\nContent of persisted rdd (2nd eval)");
dataset.rdd().persist().foreach(printValue);
What happened?
A bug in the collection encoder of the Spark dataset runner (aka Spark structured streaming runner) may result in corrupted data after a groupByKey if the dataset is converted to an RDD and persisted.
The collection encoder just wraps the underlying serialized
UnsafeArrayData
for primitive types to present a Collection as a result of deserialization. However, Spark may reuse the underlying unsafe object and with that corrupt values.Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: