Restore original index with PandasParallelLFApplier #1589

henryre · 2020-05-17T18:36:52Z

Describe the solution you'd like

Using PandasParallelLFApplier includes an index sort on the original DataFrame, which can result in unexpected row order if the index is not sorted when passed in. This should be documented, or the original index order should be restored.

Discussion: https://spectrum.chat/snorkel/help/how-to-use-the-pandasparallelapplier~cf50f563-28e6-418c-93a3-337384566c13

Additional context

Related issues: #1587 #1581 #1524

The text was updated successfully, but these errors were encountered:

bendeaton · 2020-05-18T17:52:50Z

+1 to this. We just spent several cycles tracking down this issue in a model that uses Snorkel.

github-actions · 2020-08-17T12:19:51Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

BenjaminFraser · 2021-01-06T10:10:18Z

Not sure on the exact solution you were looking for on this, but one possibility would be to add an additional default keyword arg to PandasParallelLFApplier on whether to index sort or not.

For example, at line 97 within snorkel/snorkel/labeling/apply/dask.py, dd.from_pandas() is called:

df = dd.from_pandas(df, npartitions=n_parallel)

If you want the index to remain unsorted, and prevent the problem highlighted in this issue, we can simply call dd.from_pandas() like so:

df = dd.from_pandas(df, npartitions=n_parallel, sort=False)

We might not always want this to be false however, since dask makes this by default for performance purposes and for obtaining meaningful divisions (Ref: dask/dask#1428).

Therefore, it could be worth letting users call PandasParallelLFApplier with sort=True / False, as required. Perhaps it could be made clear in the documentation that this sorting occurs by default, and if users do not want this to happen, they should provide sort=False.

Just a suggestion anyway!

henryre self-assigned this May 17, 2020

This was referenced May 17, 2020

labelmodel.fit on a superset of data changes predictions of subset #1581

Closed

Different results and accuracy down to 10% with PandasParallelLFApplier vs PandasLFApplier in Snorkel 0.9.5 #1587

Closed

github-actions bot added the no-issue-activity label Aug 17, 2020

github-actions bot closed this as completed Aug 24, 2020

henryre added no-stale Auto-stale bot skips this issue and removed no-issue-activity labels Aug 26, 2020

henryre reopened this Aug 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore original index with PandasParallelLFApplier #1589

Restore original index with PandasParallelLFApplier #1589

henryre commented May 17, 2020 •

edited

Loading

bendeaton commented May 18, 2020

github-actions bot commented Aug 17, 2020

BenjaminFraser commented Jan 6, 2021 •

edited

Loading

Restore original index with PandasParallelLFApplier #1589

Restore original index with PandasParallelLFApplier #1589

Comments

henryre commented May 17, 2020 • edited Loading

Describe the solution you'd like

Additional context

bendeaton commented May 18, 2020

github-actions bot commented Aug 17, 2020

BenjaminFraser commented Jan 6, 2021 • edited Loading

henryre commented May 17, 2020 •

edited

Loading

BenjaminFraser commented Jan 6, 2021 •

edited

Loading