Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-19535][ML] RecommendForAllUsers RecommendForAllItems for ALS on Dataframe #17090

Closed
wants to merge 10 commits into from

Conversation

sueann
Copy link
Contributor

@sueann sueann commented Feb 28, 2017

What changes were proposed in this pull request?

This is a simple implementation of RecommendForAllUsers & RecommendForAllItems for the Dataframe version of ALS. It uses Dataframe operations (not a wrapper on the RDD implementation). Haven't benchmarked against a wrapper, but unit test examples do work.

How was this patch tested?

Unit tests

$ build/sbt
> mllib/testOnly *ALSSuite -- -z "recommendFor"
> mllib/testOnly

* the top `num` K2 items based on the given Ordering.
*/

private[recommendation] class TopByKeyAggregator[K1: TypeTag, K2: TypeTag, V: TypeTag]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to put this somewhere more general to be used ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It'd need its own unit tests, though not sure if we'll get everything in for 2.2)

extends Aggregator[(K1, K2, V), BoundedPriorityQueue[(K2, V)], Array[(K2, V)]] {

override def zero: BoundedPriorityQueue[(K2, V)] = new BoundedPriorityQueue[(K2, V)](num)(ord)
override def reduce(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to throw some spaces and braces in here to make it a bit more readable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -248,18 +248,18 @@ class ALSModel private[ml] (
@Since("1.3.0")
def setPredictionCol(value: String): this.type = set(predictionCol, value)

private val predict = udf { (userFeatures: Seq[Float], itemFeatures: Seq[Float]) =>
if (userFeatures != null && itemFeatures != null) {
blas.sdot(rank, userFeatures.toArray, 1, itemFeatures.toArray, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how the overhead of converting to an array compares with the efficiency of calling sdot -- could be faster to just do the Seqs by hand? is it possible to operate on something besides Seq?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! But since I copy-pasted this block in this PR, maybe it's okay to try it out in another PR? At least with what we have here we know it's not a regression. Want to make sure we get some version of ALS recommendForAll* in 2.2.

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73540 has started for PR 17090 at commit 707bc6b.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just small comments.
Could you please make a JIRA for the sdot vs toArray issue which Sean brought up?

@@ -285,6 +285,43 @@ class ALSModel private[ml] (

@Since("1.6.0")
override def write: MLWriter = new ALSModel.ALSModelWriter(this)

@Since("2.2.0")
def recommendForAllUsers(num: Int): DataFrame = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to numItems (and numUsers in the other method)

@@ -248,18 +248,18 @@ class ALSModel private[ml] (
@Since("1.3.0")
def setPredictionCol(value: String): this.type = set(predictionCol, value)

private val predict = udf { (userFeatures: Seq[Float], itemFeatures: Seq[Float]) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could rename userFeatures, itemFeatures to be featuresA, featuresB or something to make it clear that there is no ordering here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's actually clearer to keep the names as it "instantiates" the usage to the reader.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, though the inputs are switched in some uses

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are switched in the case of recommendForAllItems so is that not less clear?

* @param srcOutputColumn name of the column for the source in the output DataFrame
* @param num number of recommendations for each record
* @return a DataFrame of (srcOutputColumn: Int, recommendations), where recommendations are
* stored as an array of (dstId: Int, ratingL: Double) tuples.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ratingL -> rating

predict(srcFactors("features"), dstFactors("features")).as($(predictionCol)))
// We'll force the IDs to be Int. Unfortunately this converts IDs to Int in the output.
val topKAggregator = new TopByKeyAggregator[Int, Int, Float](num, Ordering.by(_._2))
ratings.as[(Int, Int, Float)].groupByKey(_._1).agg(topKAggregator.toColumn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to specify field names for dstId and rating and to document the schema in the recommend methods. That will help users extract recommendations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what a good way to do this is :-/ Ways I can think of but haven't succeeded in:
1/ change the schema of the entire DataFrame
2/ map over the rows in the DataFrame {
map over the items in the array {
convert from tuple (really a Row) to a Row with a different schema
}
}

I tried using RowEncoder in either case, but the types haven't quite worked out. Any ideas?

r.toArray.sorted(ord.reverse)
override def bufferEncoder: Encoder[BoundedPriorityQueue[(K2, V)]] =
Encoders.kryo[BoundedPriorityQueue[(K2, V)]]
override def outputEncoder: Encoder[Array[(K2, V)]] = ExpressionEncoder[Array[(K2, V)]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntelliJ style complaint: include "()" at end


/**
* Returns top `num` items recommended for each user, for all users.
* @param num number of recommendations for each user
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number -> max number

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73543 has finished for PR 17090 at commit 832b066.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73553 has finished for PR 17090 at commit ebd2604.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Feb 28, 2017

cc @MLnick

@hhbyyh
Copy link
Contributor

hhbyyh commented Feb 28, 2017

the same as #12574 ?

@jkbradley
Copy link
Member

@hhbyyh This is different from #12574 since it sidesteps the ongoing design discussions about input and output schema. Eventually, I'd like us to proceed with #12574 but only after we've figured out the right workflow for using these methods in a Pipeline with tuning.

@MLnick
Copy link
Contributor

MLnick commented Feb 28, 2017

#12574 is a comprehensive solution that also intends to support cross-validation as well as recommending for a subset (or any arbitrary set) of users/items. So it solves SPARK-10802 and SPARK-10802 at the same time.

That PR is in fully working state. I'm a little surprised to see work done on this rather than deciding on the input/output schema stuff...

@MLnick
Copy link
Contributor

MLnick commented Feb 28, 2017

By the way I have been doing performance testing in support of #12574 and results are pretty much ready.

@jkbradley
Copy link
Member

jkbradley commented Feb 28, 2017

I'd been following the long discussions about a transform-based solution, but those had not seemed to have converged to a clear design. If you feel they have in your PR, then I'll spend some time tomorrow going through your PR to catch up. (Keep in mind that the branch cut is scheduled to happen in a day or two: http://spark.apache.org/versioning-policy.html )

@MLnick
Copy link
Contributor

MLnick commented Feb 28, 2017

For performance tests, I've been using the MovieLens ml-latest dataset here. It has 24,404,096 ratings with 259,137 users and 39,443 movies.

So it's not enormous but "recommend all" does a lot of work - generating 1,631,206,099 predicted ratings raw before the top-k.

Some quick tests for the existing recommendProductsForUsers gives 306 sec.

scala> spark.time { oldModel.recommendProductsForUsers(k).count }
Time taken: 306512 ms
res11: Long = 259137

As part of my performance testing I've tried a few approaches roughly similar to this PR, but using Window and filter rather than this top-k aggregator (which is a neat idea).

At first I thought this PR was really good:

scala> spark.time { newModel.recommendForAllUsers(k).count }
Time taken: 151504 ms
res3: Long = 259137

151 sec seems fast!

But then I tried this:

scala> spark.time { newModel.recommendForAllUsers(k).show }
+------+--------------------+
|userId|     recommendations|
+------+--------------------+
| 35982|[[131382,15.53116...|
| 67782|[[131382,29.72169...|
| 82672|[[132954,12.19152...|
|155042|[[148954,16.09084...|
|167532|[[118942,13.94282...|
|168802|[[27212,11.881494...|
|216112|[[109159,25.46359...|
|243392|[[153010,9.85302]...|
|255132|[[131382,15.50626...|
|255362|[[131382,10.08476...|
| 17389|[[152711,16.09958...|
|120899|[[156956,12.61003...|
|213089|[[82055,13.293286...|
|253769|[[152711,16.57459...|
|258129|[[152711,22.50499...|
| 24347|[[152711,12.31282...|
| 35947|[[153184,11.04110...|
|103357|[[132954,13.26898...|
|130557|[[118942,14.00168...|
|156017|[[153010,12.24449...|
+------+--------------------+
only showing top 20 rows

Time taken: 672524 ms

672 sec, over 2x slower than mllib impl.

Not sure why count is fast relative to show (maybe Spark SQL is not doing all the actual compute, while for show it does need to?).


private def checkRecommendationOrdering(topK: DataFrame, k: Int): Unit = {
assert(topK.columns.contains("recommendations"))
topK.select("recommendations").collect().foreach(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Purely a style suggestion but you can get rid of one set of parens:

...foreach { row =>
  ...
}

@MLnick
Copy link
Contributor

MLnick commented Feb 28, 2017

@jkbradley do we propose to add further methods to support recommending for all users (or items) in an input DF? like recommendForAllUsers(dataset: DataFrame, num: Int)?

@jkbradley
Copy link
Member

@MLnick Thanks for showing those comparison numbers. If your implementation is faster, then I'm happy going with it. I do wonder if we might hit scalability issues with RDDs which we would not hit with DataFrames, so it'd be worth revisiting a DF-based implementation later on.

In terms of the API, my main worry about #12574 is that I haven't seen a full design of how ALS would be plugged into cross validator. I still don't see how CV could handle ALS unless we specialized it for recommendation. It was this uncertainty which made me comment on https://issues.apache.org/jira/browse/SPARK-13857 to recommend we go ahead and merge basic recommendAll methods, while continuing to figure out a good design for tuning.

Feel free to push back, but I would really like to see a sketch of how ALS could plug into tuning. I haven't spent the time to do a literature review on how tuning is generally done for recommendation, especially on the best ways to split the data into folds.

further methods to support recommending for all users (or items) in an input DF? like recommendForAllUsers(dataset: DataFrame, num: Int)

I do think this sounds useful, but I'm focused on feature parity w.r.t. the RDD-based API right now. It'd be nice to add later, though that could be via your proposed transform-based API.

@MLnick
Copy link
Contributor

MLnick commented Feb 28, 2017

Fitting into the CV / evaluator is actually fairly straightforward. It's just that the semantics of transform for top-k recommendation must fit into whatever we decide on for RankingEvaluator, so they are closely linked. (In other words, they must be compatible). Once the semantics (basically output schema for transform) are decided it's quite simple.

It was discussed on the JIRA here and here.

I haven't had a chance to refine it yet but I have a view on the best approach now (basically to fit in with the design of SPARK-14409 and in particular the basic version of #16618). I think that design / schema is more "DataFrame-like".

In any event - I'm not against having the convenience methods for recommend-all here. I support it. Ultimately the transform approach is mostly for fitting into Pipelines & cross-validation. transform could call into these convenience methods (though it will need a DataFrame-based input version).

@MLnick
Copy link
Contributor

MLnick commented Feb 28, 2017

The performance of #12574 is not better than the existing mllib recommend-all - since it wraps the functionality it's roughly on par.

@MLnick
Copy link
Contributor

MLnick commented Feb 28, 2017

I should note that I've found the performance of "recommend all" to be very dependent on number of partitions since it controls the memory consumption per task (which can easily explode in the blocked mllib version) vs the CPU utilization & amount of shuffle data.

For example, the default mllib results above use 192*192 = 36,864 partitions (due to cross-join). So it does prevent dying due to exploding memory & GC but is slower than using fewer partitions. However, too few partitions and it dies.

I actually just realised that the defaults for mllib for user/item blocks - which in turn controls the partitions for the factors - is defaultParallelism (192 for my setup), while for ml it is 10. Hence we need to create a like-for-like comparison.

(Side note - it's not ideal actually that the num blocks drives the recommend-all partitions - because the optimal settings for training ALS are unlikely to be optimal for batch recommend-all prediction).

Anyway some results of quick tests on my setup:

Firstly, to match mllib defaults:

mllib with 192*192: 323 sec

scala> spark.time { oldModel.recommendProductsForUsers(k).foreach(_ => Unit)  }
Time taken: 323367 ms

ml with 192*192: 427 sec

scala> val newModel = newAls.setNumUserBlocks(192).setNumItemBlocks(192).fit(ratings)
scala> spark.time { newModel.recommendForAllUsers(k).foreach(_ => Unit) }
Time taken: 427174 ms

So this PR is 30% slower - which is actually pretty decent given it's not using blocked BLAS operators.

Note I didn't use netlib native BLAS, which could make a large difference when using level 3 BLAS in the blocked mllib version.

Secondly, to match ml defaults:

mllib with 10*10: 1654 sec

scala> val oldModel = OldALS.train(oldRatings, rank, iter, lambda, 10)
scala> spark.time { oldModel.recommendProductsForUsers(k).foreach(_ => Unit)  }
Time taken: 1654951 ms

ml with 10*10: 438 sec

scala> val newModel = newAls.fit(ratings)
scala> spark.time { newModel.recommendForAllUsers(k).foreach(_ => Unit) }
Time taken: 438328 ms

In this case, the mllib version blows up with memory & GC dominating runtime, and this PR is over 3x faster (though it varies a lot: 600 sec above, 438 sec here, etc).

Finally, middle of the road case:

mllib with 96*96: 175 sec

scala> spark.time { oldModel.recommendProductsForUsers(k).foreach(_ => Unit)  }
Time taken: 175880 ms

ml with 96*96: 181 sec

scala> spark.time { newModel.recommendForAllUsers(k).foreach(_ => Unit) }
Time taken: 181494 ms

So a few % slower. Again pretty good actually considering it's not a blocked implementation. Still room to be optimized.

After running these I tested against a blocked version using DataFrame (to more or less match the current mllib version) and it's much faster in the 192*192 case, a bit slower in 96*96 case and also blows up in the 10*10 case. Again, really dependent on partitioning...

So the performance here is not too bad. The positive is it should avoid completely exploding. As I mentioned above I tried a similar DataFrame-based version using Window & filter and performance was terrible. It will be interesting to see if native BLAS adds anything.

@MLnick
Copy link
Contributor

MLnick commented Feb 28, 2017

Finally, I've done some work related to SPARK-11968 and have a potential solution that seems to be pretty good. In this case it should be more than 3x faster than the best runtime here.

@@ -248,18 +248,18 @@ class ALSModel private[ml] (
@Since("1.3.0")
def setPredictionCol(value: String): this.type = set(predictionCol, value)

private val predict = udf { (userFeatures: Seq[Float], itemFeatures: Seq[Float]) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are switched in the case of recommendForAllItems so is that not less clear?

* Works on rows of the form (K1, K2, V) where K1 & K2 are IDs and V is the score value. Finds
* the top `num` K2 items based on the given Ordering.
*/
private[recommendation] class TopByKeyAggregator[K1: TypeTag, K2: TypeTag, V: TypeTag]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think we should have at least some basic tests for this - see MLPairRDDFunctionsSuite for example

val numRecs = 5
val (training, test) = genExplicitTestData(numUsers, numItems, rank = 2, noiseStd = 0.01)
val topItems =
testALS(training, test, maxIter = 4, rank = 2, regParam = 0.01, targetRMSE = 0.03)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems wasteful to compute and check a model here, and it doesn't really test that the predictions are what we expect them to be.

We can construct an ALSModel with known factors and check the predictions are as expected (it's just a matrix multiply). See MatrixFactorizationModelSuite and tests in #12574 (based on those) for example.

@jkbradley
Copy link
Member

@MLnick Thanks a lot for the detailed tests! I really appreciate it. In this case, are you OK with the approach in the current PR (pending reviews)?

One thing we should confirm is what you think of the input/output schema for these methods. It'd be nice to have the schema match what you think best for the transform-based recommendAll.

Re: design: I saw the discussions in https://issues.apache.org/jira/browse/SPARK-13857 but I didn't see a clear statement of pros & cons leading to a design decision. I'm sorry about not having time to get involved with the discussions, but I'd like to help out as useful in the coming months.

@sueann
Copy link
Contributor Author

sueann commented Feb 28, 2017

The output in #12574 looks like a DataFrame with Row(srcCol: Int, "recommendations": Array[(Int, Float)]) so I think this PR as is matches the output type - @MLnick to confirm. Will make the suggested changes from everyone. Thanks a lot for the performance testing @MLnick! Really neat to see.

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73623 has finished for PR 17090 at commit 41a11e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73624 has finished for PR 17090 at commit 4ac586a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73628 has finished for PR 17090 at commit ef93575.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Mar 1, 2017

Isn't deciding on the output schema for these methods essentially the same as deciding on transform semantics in #12574 (apart from the issue of how, or if, to have transform generate the "ground truth" items)?

So in a way I'm not certain this "circumvents" the discussion around transform semantics / fitting into cross-validation & pipelines but rather makes an implicit decision on it. If this is set here, it would be rather awkward to have any future transform based recommend all functionality with a different output schema.

My default choice in #12574 was to match the form of the existing mllib methods. That is the same as here, and matches the format for the existing RankingMetrics in mllib. So from that perspective it is the "easiest" choice.

It's not necessarily the "best" choice - I go into the other option in detail on #12574 and the related JIRA. Ideally it should match up with the expected input schema for a new RankingEvaluator (not strictly necessary but definitely preferable).

My concern here is that we make a quick decision implicitly due to time constraints and are stuck with it down the line as it is exposed in the public API.

@MLnick
Copy link
Contributor

MLnick commented Mar 1, 2017

It could also be I'm overthinking things - and we can mould the RankingEvaluator to accept both types of input - the array version: (Array(predictions), Array(labels)) or the "exploded" version: (prediction, label).

In which case it doesn't really matter what call we make here, as long as we're consistent with it later.

@jkbradley
Copy link
Member

It's a good point about making an implicit decision. We could deprecate these methods in favor of transform-based ones in the future---we have done this in the past---but it does push the long-term decision in a clear direction.

My hesitation about transform is not just about the schema. It's also because I'm still unclear how we could plug ALS into tuning without having tuning specifically understand ALS (knowing about users, items, etc.). I'll add my thoughts on the JIRA.

expected: Map[Int, Array[Row]],
dstColName: String): Unit = {
assert(topK.columns.contains("recommendations"))
topK.collect().foreach { row =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little strange to have all the Row stuff in these tests.

You can do topK.as[(Int, Seq[(Int, Float)])].collect.foreach { case (id, recs) => ...

Then adjust expected accordingly


val topKAggregator = new TopByKeyAggregator[Int, Int, Float](k, Ordering.by(_._2))
Seq(
(0, 3, 54f),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a good idea to have varying # values = like maybe one with only 1 etc.

private def checkTopK(
topK: Dataset[(Int, Array[(Int, Float)])],
expected: Map[Int, Array[(Int, Float)]]): Unit = {
topK.collect().foreach { record =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly prefer foreach { case (id, recs) => ...

val recs = row.getAs[WrappedArray[Row]]("recommendations")
assert(recs === expected(id))
assert(recs(0).fieldIndex(dstColName) == 0)
assert(recs(0).fieldIndex("rating") == 1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have this requirement in the output (vs. the unnamed tuples)? It does cost more to support named fields (we have to call .rdd on the resulting dataframe in the process of imposing a new schema) and if we decide to remove it later it'll break any usage of the named fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually nevermind. Either way is committing to an incompatible API so the name one seems preferable.

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73787 has finished for PR 17090 at commit b0680db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73866 has finished for PR 17090 at commit 6a7e3d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM
Any other comments before we merge?

@jkbradley
Copy link
Member

I'll merge this with master now
Thanks @sueann and @MLnick for feedback. I'll prioritize helping with your work on transform, metrics, and tuning for ALS next.

@asfgit asfgit closed this in 70f9d7f Mar 6, 2017
@MLnick
Copy link
Contributor

MLnick commented Mar 6, 2017

@jkbradley I've put my updated proposal for ranking evaluation on SPARK-14409 comments.

It's different from this schema (and from the mllib recommend all return types). I've detailed the reasoning in the JIRA.

I would have preferred a more explicit discussion & decision on the schema before merging this, but now that it's merged we should at least consider how (or if) to deal with the schema difference assuming the above proposal proceeds.

@jkbradley
Copy link
Member

@MLnick OK I think I misunderstood some of your comments above then. I see the proposal in SPARK-14409 differs from this PR, so I agree it'd be nice to resolve it. We can make changes to this PR's schema as long as it happens soon.

Here are the pros of each as I see them:

  1. Nested schema (as in this PR): [user, Array((item, rating))]
  • Easy to work with both nested & flattened schema (df.select("recommendations.item")) (AFAIK there's no simple way to zip and nest the 2 columns when starting with the flattened schema.)
  1. Flattened schema (as in SPARK-14409): [user, Array(item), Array(rating)]
  • More efficient to store in Row-based formats like Avro

I'm not sure if there's a performance difference in the formats when stored in Tungsten Rows. I think not, but that'd be good to know.

@MLnick
Copy link
Contributor

MLnick commented Mar 7, 2017

I commented further on the JIRA.

Sorry if my other comments here and on JIRA were unclear. But the proposed schema for input to RankingEvaluator is:

Schema 1

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   230|    318|   5.0| 4.2403245|
|   230|   3424|   4.0|      null|
|   230|  81191|  null|  4.317455|
+------+-------+------+----------+

You will notice that rating and prediction columns can be null. This is by design. There are three cases shown above:

  1. 1st row indicates a (user-item) pair that occurs in both the ground-truth set and the top-k predictions;
  2. 2nd row indicates a (user-item) pair that occurs in the ground-truth set, but not in the top-k predictions;
  3. 3rd row indicates a (user-item) pair that occurs in the top-k predictions, but not in the ground-truth set.

Note for reference, the input to the current mllib RankingMetrics is:

Schema 2

RDD[(true labels array, predicted labels array)],
i.e.
RDD of ([318, 3424, 7139,...], [81191, 93040, 31...])

(So actually neither of the above schemas are easily compatible with the return schema here - but I think it is not really necessary to match the mllib.RankingMetrics format)

ALS cross-validation

My proposal for fitting ALS into cross-validation is the ALSModel.transform will output a DF of Schema 1 - only when the parameters k and recommendFor are appropriately set, and the input DF contains both user and item columns. In practice, this scenario will occur during cross-validation only.

So what I am saying is that ALS itself (not the evaluator) must know how to return the correct DataFrame output from transform such that it can be used in a cross-validation as input to the RankingEvaluator.

Concretely:

val als = new ALS().setRecommendFor("user").setK(10)
val validator = new TrainValidationSplit()
  .setEvaluator(new RankingEvaluator().setK(10))
  .setEstimator(als)
  .setEstimatorParamMaps(...)
val bestModel = validator.fit(ratings)

So while it is complex under the hood - to users it's simply a case of setting 2 params and the rest is as normal.

Now, we have the best model selected by cross-validation. We can make recommendations using these convenience methods (I think it will need a cast):

val recommendations = bestModel.asInstanceOf[ALSModel].recommendItemsforUsers(10)

Alternatively, the transform version looks like this:

val usersDF = ...
+------+
|userId|
+------+
|     1|
|     2|
|     3|
+------+
val recommendations = bestModel.transform(usersDF)

So the questions:

  1. should we support the above transform-based recommendations? Or only support it for cross-validation purposes as a special case?
  2. if we do, what should the output schema of the above transform version look like? It must certainly match the output of recommendX methods.

The options are:

(1) The schema in this PR:
Pros: as you mention above - also more "compact"
Cons: doesn't match up so closely with the transform "cross-validation" schema above

(2) The schema below. It is basically an "exploded" version of option (1)

+------+-------+----------+
|userId|movieId|prediction|
+------+-------+----------+
|     1|      1|       4.3|
|     1|      5|       3.2|
|     1|      9|       2.1|
+------+-------+----------+

Pros*: matches more closely with the cross-validation / evaluator input format. Perhaps slightly more "dataframe-like".
Cons: less compact; lose ordering?; may require more munging to save to external data stores etc.

Anyway sorry for hijacking this PR discussion - but as I think you can see, the evaluator / ALS transform interplay is a bit subtle and requires some thought to get the right approach.

@jkbradley
Copy link
Member

Thanks @MLnick for the explanation. This is what I'd understood from your similar description on the JIRA, but definitely more in-depth. (It might be good to copy to JIRA, or even a design doc at this point.)

As I'd said, I haven't done a literature review on this, so it's hard for me to judge what schema evaluators should be given. I see some implicit decisions, such as evaluators using implicit ratings (using rows missing either a label or a prediction) and us not computing predictions for all (user,item) pairs with labels.

However, assuming the schema you've selected is best for evaluation, then I think this highlights 2 distinct needs for top K: (a) a user-friendly API (this PR) and (b) an evaluator-friendly API (your design). For (a), many users have requested recommendForAll* methods matching the RDD-based equivalents, and this schema provides top K recommendations in an analogous and friendly schema. If evaluator needs a less user-friendly schema, that's OK, but then I think it should be considered an internal/dev schema which can differ from the user-friendly version.

What do you think?

@MLnick
Copy link
Contributor

MLnick commented Mar 9, 2017

Good point for copying some detail to JIRA, will do that. Also I responded on JIRA to your comment above (the part about evaluation) - just to avoid confusing this PR with further stuff.

@MLnick
Copy link
Contributor

MLnick commented Mar 9, 2017

As for the API - I'm ok with having the "user-facing" version differ from the transform version. Though it may lead to some confusion. In this case, it's probably best to have transform only work for the cross-validation / evaluator setting and to make it clear in future how it all works and fits together, with appropriate doc, user guide and examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants