[Bug]: Usage of collection encoder in Spark Dataset runner may lead to corrupt data #25296

mosche · 2023-02-03T06:32:28Z

What happened?

A bug in the collection encoder of the Spark dataset runner (aka Spark structured streaming runner) may result in corrupted data after a groupByKey if the dataset is converted to an RDD and persisted.

The collection encoder just wraps the underlying serialized UnsafeArrayData for primitive types to present a Collection as a result of deserialization. However, Spark may reuse the underlying unsafe object and with that corrupt values.

  dataset.persist(storageLevel);

  System.out.println("\nContent of persisted dataset (1st eval)");
  dataset.foreach(printValue);

  System.out.println("\nContent of persisted dataset (2nd eval)");
  dataset.foreach(printValue);

  System.out.println("\nContent of rdd (1st eval)");
  dataset.rdd().foreach(printValue);

  System.out.println("\nContent of rdd (2nd eval)");
  dataset.rdd().foreach(printValue);

  System.out.println("\nContent of persisted rdd (1st eval)");
  dataset.rdd().persist().foreach(printValue);

  System.out.println("\nContent of persisted rdd (2nd eval)");
  dataset.rdd().persist().foreach(printValue);

Content of persisted dataset (1st eval)
key=k3 {@1797166500}, value=[0] {@621459392}
key=k5 {@522613115}, value=[2147483647, -2147483648] {@50229760}
key=k1 {@935655811}, value=[3, 4] {@1059339408}
key=k2 {@1698407107}, value=[66, -33] {@1059339408}

Content of persisted dataset (2nd eval)
key=k3 {@491772294}, value=[0] {@1519705721}
key=k1 {@1801297116}, value=[3, 4] {@929237815}
key=k2 {@1278247606}, value=[66, -33] {@929237815}
key=k5 {@1642777670}, value=[2147483647, -2147483648] {@822592932}

Content of rdd (1st eval)
key=k5 {@1706950017}, value=[2147483647, -2147483648] {@935984436}
key=k1 {@830413362}, value=[3, 4] {@1431942185}
key=k3 {@1807383422}, value=[0] {@1220036903}
key=k2 {@1334173112}, value=[66, -33] {@1431942185}

Content of rdd (2nd eval)
key=k1 {@1977617218}, value=[3, 4] {@1966593054}
key=k2 {@131385719}, value=[66, -33] {@1966593054}
key=k3 {@966124643}, value=[0] {@4762028}
key=k5 {@1477114665}, value=[2147483647, -2147483648] {@942837752}

Content of persisted rdd (1st eval)
key=k3 {@2050847337}, value=[0] {@83038769}
key=k5 {@906982325}, value=[2147483647, -2147483648] {@714957170}
key=k1 {@654493203}, value=[66, -33] {@1747607045}
key=k2 {@287079803}, value=[66, -33] {@1747607045}

Content of persisted rdd (2nd eval)
key=k1 {@654493203}, value=[66, -33] {@1747607045}
key=k2 {@287079803}, value=[66, -33] {@1747607045}
key=k5 {@906982325}, value=[2147483647, -2147483648] {@714957170}
key=k3 {@2050847337}, value=[0] {@83038769}

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

The text was updated successfully, but these errors were encountered:

…rrupted data due to naive usage of unsafe storage (fixes apache#25296).

…rrupted data due to naive usage of unsafe storage (fixes #25296). (#25297)

mosche added bug awaiting triage spark and removed awaiting triage labels Feb 3, 2023

github-actions bot added the P2 label Feb 3, 2023

mosche mentioned this issue Feb 3, 2023

[Spark Dataset runner] Break linage of dataset to reduce Spark planning overhead in case of large query plans #25187

Merged

3 tasks

mosche pushed a commit to mosche/beam that referenced this issue Feb 3, 2023

[Spark Dataset runner] Fix collection encoder bug that may lead to co…

d00ae7d

…rrupted data due to naive usage of unsafe storage (fixes apache#25296).

mosche mentioned this issue Feb 3, 2023

[Spark Dataset runner] Fix collection encoder bug leading to corrupt data #25297

Merged

3 tasks

mosche closed this as completed in #25297 Feb 3, 2023

mosche pushed a commit that referenced this issue Feb 3, 2023

[Spark Dataset runner] Fix collection encoder bug that may lead to co…

d50924d

…rrupted data due to naive usage of unsafe storage (fixes #25296). (#25297)

github-actions bot added this to the 2.46.0 Release milestone Feb 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Usage of collection encoder in Spark Dataset runner may lead to corrupt data #25296

[Bug]: Usage of collection encoder in Spark Dataset runner may lead to corrupt data #25296

mosche commented Feb 3, 2023

[Bug]: Usage of collection encoder in Spark Dataset runner may lead to corrupt data #25296

[Bug]: Usage of collection encoder in Spark Dataset runner may lead to corrupt data #25296

Comments

mosche commented Feb 3, 2023

What happened?

Issue Priority

Issue Components