Persist intermediate data to avoid non-determinism caused by Spark lazy random evaluation #1676

simonzhaoms · 2022-03-17T08:23:53Z

Description

The bug is reported by Bhrigu. Due to Spark lazy evaluation, random filtering without replacement in spark_stratified_split() will result in duplicate rows in both training and test data.

This PR persists the intermediate data to avoid the issue caused by Spark lazy evaluation.

Related Issues

Checklist:

I have followed the contribution guidelines and code style for this project.
I have added tests covering my contributions.
I have updated the documentation accordingly.
This PR is being made to staging branch and not to main branch.

…evaluation

codecov-commenter · 2022-03-17T08:58:46Z

Codecov Report

Merging #1676 (4dfb783) into staging (17204b1) will increase coverage by 0.12%.
The diff coverage is 100.00%.

❗ Current head 4dfb783 differs from pull request most recent head de6cbf9. Consider uploading reports for the commit de6cbf9 to get more accurate results

@@             Coverage Diff             @@
##           staging    #1676      +/-   ##
===========================================
+ Coverage    23.09%   23.21%   +0.12%     
===========================================
  Files           88       88              
  Lines         9101     9101              
===========================================
+ Hits          2102     2113      +11     
+ Misses        6989     6988       -1

Flag	Coverage Δ
nightly	`?`
pr-gate	`23.21% <100.00%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
recommenders/datasets/spark_splitters.py	`87.27% <100.00%> (ø)`
recommenders/evaluation/spark_evaluation.py	`87.05% <0.00%> (+0.44%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 17204b1...de6cbf9. Read the comment docs.

miguelgfierro

LGTM

simonzhaoms added 4 commits March 16, 2022 17:29

Generate random number independent of the data to be splitted

fd5f42d

Persist intermediate Spark data frame to avoid errors caused by lazy …

cb9f4c6

…evaluation

Typos: e.g -> e.g.

8341d51

Compact code

4dfb783

simonzhaoms requested review from miguelgfierro, gramhagen, anargyri, loomlike and wutaomsft as code owners March 17, 2022 08:23

simonzhaoms added 3 commits March 17, 2022 16:33

Add intersection tests for spark_stratified_split

13dafa6

Add comments

296f57e

Drop duplicate data in tests

de6cbf9

miguelgfierro approved these changes Mar 17, 2022

View reviewed changes

simonzhaoms merged commit d43e2c1 into staging Mar 18, 2022

simonzhaoms deleted the simonz/spark_stratified_split-fix2 branch March 18, 2022 05:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persist intermediate data to avoid non-determinism caused by Spark lazy random evaluation #1676

Persist intermediate data to avoid non-determinism caused by Spark lazy random evaluation #1676

simonzhaoms commented Mar 17, 2022 •

edited

Loading

codecov-commenter commented Mar 17, 2022

miguelgfierro left a comment

Persist intermediate data to avoid non-determinism caused by Spark lazy random evaluation #1676

Persist intermediate data to avoid non-determinism caused by Spark lazy random evaluation #1676

Conversation

simonzhaoms commented Mar 17, 2022 • edited Loading

Description

Related Issues

Checklist:

codecov-commenter commented Mar 17, 2022

Codecov Report

miguelgfierro left a comment

Choose a reason for hiding this comment

simonzhaoms commented Mar 17, 2022 •

edited

Loading