You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using PandasParallelLFApplier includes an index sort on the original DataFrame, which can result in unexpected row order if the index is not sorted when passed in. This should be documented, or the original index order should be restored.
Not sure on the exact solution you were looking for on this, but one possibility would be to add an additional default keyword arg to PandasParallelLFApplier on whether to index sort or not.
For example, at line 97 within snorkel/snorkel/labeling/apply/dask.py, dd.from_pandas() is called:
df = dd.from_pandas(df, npartitions=n_parallel)
If you want the index to remain unsorted, and prevent the problem highlighted in this issue, we can simply call dd.from_pandas() like so:
We might not always want this to be false however, since dask makes this by default for performance purposes and for obtaining meaningful divisions (Ref: dask/dask#1428).
Therefore, it could be worth letting users call PandasParallelLFApplier with sort=True / False, as required. Perhaps it could be made clear in the documentation that this sorting occurs by default, and if users do not want this to happen, they should provide sort=False.
Describe the solution you'd like
Using
PandasParallelLFApplier
includes an index sort on the original DataFrame, which can result in unexpected row order if the index is not sorted when passed in. This should be documented, or the original index order should be restored.Discussion: https://spectrum.chat/snorkel/help/how-to-use-the-pandasparallelapplier~cf50f563-28e6-418c-93a3-337384566c13
Additional context
Related issues: #1587 #1581 #1524
The text was updated successfully, but these errors were encountered: