-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-14409][ML][WIP] Add RankingEvaluator #16618
Conversation
Can one of the admins verify this patch? |
Please consider this PR WIP. Discussion in JIRA https://issues.apache.org/jira/browse/SPARK-14409 |
Could you add |
I rewrote the ranking metrics from the mllib package as UDFs (as suggested here) with minimum changes to the logic. |
The basic direction looks right - I won't have time to review immediately. Spark 2.2 QA code freeze will happen shortly so this will wait until 2.3 dev cycle starts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like where this is going, it could be quite useful for taking advantage of CrossValidation when doing pairwise and listwise ranking in xgboost4j-spark.
val predictionAndLabels: DataFrame = dataset | ||
.join(topAtk, Seq($(queryCol)), "outer") | ||
.withColumn("topAtk", coalesce(col("topAtk"), mapToEmptyArray_())) | ||
.select($(labelCol), "topAtk") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we also need to run an aggregation on the label column, roughly the same as the previous aggregation but using labelCol as the sort instead of predictionCol?
Currently this generates a row per prediction, when ranking tasks should have a row per query. I think the aggregation should be run twice, then those two aggregations should be joined together on queryCol. That would result in a dataset containing (labels of top k predictions, top k actual labels)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. This is currently done in the previous step, when the topAtk Dataframe is calculated (line 101).
Unfortunately this is not compatible with RankingMetrics, which expects the format of predictionAndLabels as input. I didn't want to change RankingMetrics in this same PR.
So the predictionAndLabels DataFrame is calculated to use the same RankingMetrics from mllib (well, it is now UDFs based, but I didn't touched its logic).
var i = 0 | ||
while (i < n) { | ||
val gain = 1.0 / math.log(i + 2) | ||
if (i < predicted.length && actualSet.contains(predicted(i))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem right, there is no overlap between the calculation of dcg and max_dcg. The question asked here should be if the label at predicted(i) is "good". When treating the labels as binary relevant/not relevant I suppose that might use a threshold, but better would be to move away from a binary dcg and use the full equation from the docblock. I understand though that you are not looking to make major updates to the code from mllib, so it would probably be reasonable for someone to fix this in a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should be fixed in another PR to keep changes isolated. FYI, the original JIRA for this is here.
@daniloascione are you able to update this? I'd like to target for But, can we do the following:
Let's focus on porting things over and getting the API right. Then create follow up tickets for additional metrics (MPR and any others) as well as looking into correcting the logic and/or naming of the existing metrics. If you are unable to take it up again, I can help. Thanks! |
@daniloascione any update? |
@MLnick I'm wondering what's the status of this issue: seems closed, have you any plans on picking it up again? I might pick it up, but I'm not sure what's left: move from package mllib to ml and maybe a python API? Or fixes to ndcg as well? |
} | ||
}, DoubleType) | ||
|
||
val R_prime = predictionAndObservations.count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a sum instead of count?
(I know this is old/closed but other people might be referring to this code)
What changes were proposed in this pull request?
This patch adds the implementation of a Dataframe api based RankingEvaluator to ML (ml.evaluation)
How was this patch tested?
Additional test case has been added.