-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pyspark] SparkXGBRanker does not work on dataframe with multiple partitions #8491
Comments
@tracykyle93 For this |
Local is sufficient. |
Hi @WeichenXu123 @trivialfis , thanks for the quick response! However, spark_df.sortWithinPartitions does not work, code to reproduce this issue, `from xgboost.spark import SparkXGBRanker sparkSession = (SparkSession df_train = sparkSession.createDataFrame( print('partition number of df_train -- {}'.format(df_train.rdd.getNumPartitions())) df_train = df_train.sortWithinPartitions("qid", ascending=True) ranker = SparkXGBRanker(qid_col="qid") error log: Sorting with local DMatrix data (within partitions) is definitely efficient, look forward to fix this issue |
Yeah, I can repro it according to #8491 (comment) I will provide a fix for it. |
@wbo4958 Thanks! |
@trivialfis |
No. We calculate gradient based on query group, and gradient calculation is not distributed, only the final histogram bin is synchronized. Is it possible that a better organized group with larger amount of data can bring better accuracy via better gradient? Maybe, but it's not required. |
@tracykyle93 could you help to verify #8497 ? BTW, |
Thanks for the hotfix @wbo4958 , while I want to test your change, I can not build from source on the hotfix branch, I have posted the question in the discussion forum https://discuss.xgboost.ai/t/fail-to-build-from-source-on-hotfix-branch/3015 |
@tracykyle93 I can repro it. could you file an issue for xgboost. BTW, to workaround it. please use the newest pyspark. for example, |
I have experienced the same issue mentioned in this discussion forum https://discuss.xgboost.ai/t/sparkxgbranker-does-not-work-on-parallel-workers/2986, and did not see any issue raised in Github, so I just do it here
"SparkXGBRanker from 1.7.0 release requires data to be sorted by qid. It works fine if we have one worker and sorted dataframe. However with multiple workers data comes to them unordered and raises exception:
org.apache.spark.api.python.PythonException: 'xgboost.core.XGBoostError: [17:46:40] …/src/data/data.cc:486: Check failed: non_dec: qid must be sorted in non-decreasing order along with data.
...
Is there any ways to prepare df in sorted order for workers? Or sorting should be done on each worker?"
One trivial solution is that given a df with ['qid', 'label', 'features'],
df = df.repartition(1)
df = df.sort(df.qid.asc())
repartition into 1 and do the sorting can make the SparkXGBRanker run without any error as far as I explore, but such expensive operation really slow down the total processing time, could you add support for such case? Thanks in advance!
The text was updated successfully, but these errors were encountered: