[Spark-19535][ML] RecommendForAllUsers RecommendForAllItems for ALS on Dataframe #17090

sueann · 2017-02-28T00:28:47Z

What changes were proposed in this pull request?

This is a simple implementation of RecommendForAllUsers & RecommendForAllItems for the Dataframe version of ALS. It uses Dataframe operations (not a wrapper on the RDD implementation). Haven't benchmarked against a wrapper, but unit test examples do work.

How was this patch tested?

Unit tests

$ build/sbt
> mllib/testOnly *ALSSuite -- -z "recommendFor"
> mllib/testOnly

sueann · 2017-02-28T00:29:13Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/TopByKeyAggregator.scala

+ * the top `num` K2 items based on the given Ordering.
+ */
+
+private[recommendation] class TopByKeyAggregator[K1: TypeTag, K2: TypeTag, V: TypeTag]


we may want to put this somewhere more general to be used ?

(It'd need its own unit tests, though not sure if we'll get everything in for 2.2)

srowen · 2017-02-28T00:32:05Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/TopByKeyAggregator.scala

+  extends Aggregator[(K1, K2, V), BoundedPriorityQueue[(K2, V)], Array[(K2, V)]] {
+
+  override def zero: BoundedPriorityQueue[(K2, V)] = new BoundedPriorityQueue[(K2, V)](num)(ord)
+  override def reduce(


I think you need to throw some spaces and braces in here to make it a bit more readable?

srowen · 2017-02-28T00:33:25Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

@@ -248,18 +248,18 @@ class ALSModel private[ml] (
  @Since("1.3.0")
  def setPredictionCol(value: String): this.type = set(predictionCol, value)

+  private val predict = udf { (userFeatures: Seq[Float], itemFeatures: Seq[Float]) =>
+    if (userFeatures != null && itemFeatures != null) {
+      blas.sdot(rank, userFeatures.toArray, 1, itemFeatures.toArray, 1)


I wonder how the overhead of converting to an array compares with the efficiency of calling sdot -- could be faster to just do the Seqs by hand? is it possible to operate on something besides Seq?

Good point! But since I copy-pasted this block in this PR, maybe it's okay to try it out in another PR? At least with what we have here we know it's not a regression. Want to make sure we get some version of ALS recommendForAll* in 2.2.

SparkQA · 2017-02-28T00:35:32Z

Test build #73540 has started for PR 17090 at commit 707bc6b.

jkbradley

Looks good, just small comments.
Could you please make a JIRA for the sdot vs toArray issue which Sean brought up?

jkbradley · 2017-02-28T00:42:17Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

@@ -285,6 +285,43 @@ class ALSModel private[ml] (

  @Since("1.6.0")
  override def write: MLWriter = new ALSModel.ALSModelWriter(this)
+
+  @Since("2.2.0")
+  def recommendForAllUsers(num: Int): DataFrame = {


Maybe rename to numItems (and numUsers in the other method)

jkbradley · 2017-02-28T00:46:43Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

@@ -248,18 +248,18 @@ class ALSModel private[ml] (
  @Since("1.3.0")
  def setPredictionCol(value: String): this.type = set(predictionCol, value)

+  private val predict = udf { (userFeatures: Seq[Float], itemFeatures: Seq[Float]) =>


You could rename userFeatures, itemFeatures to be featuresA, featuresB or something to make it clear that there is no ordering here.

I think it's actually clearer to keep the names as it "instantiates" the usage to the reader.

Fair enough, though the inputs are switched in some uses

They are switched in the case of recommendForAllItems so is that not less clear?

jkbradley · 2017-02-28T00:55:21Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

+   * @param srcOutputColumn name of the column for the source in the output DataFrame
+   * @param num number of recommendations for each record
+   * @return a DataFrame of (srcOutputColumn: Int, recommendations), where recommendations are
+   *         stored as an array of (dstId: Int, ratingL: Double) tuples.


ratingL -> rating

jkbradley · 2017-02-28T01:00:39Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

+        predict(srcFactors("features"), dstFactors("features")).as($(predictionCol)))
+    // We'll force the IDs to be Int. Unfortunately this converts IDs to Int in the output.
+    val topKAggregator = new TopByKeyAggregator[Int, Int, Float](num, Ordering.by(_._2))
+    ratings.as[(Int, Int, Float)].groupByKey(_._1).agg(topKAggregator.toColumn)


It'd be nice to specify field names for dstId and rating and to document the schema in the recommend methods. That will help users extract recommendations.

I'm not sure what a good way to do this is :-/ Ways I can think of but haven't succeeded in:
1/ change the schema of the entire DataFrame
2/ map over the rows in the DataFrame {
map over the items in the array {
convert from tuple (really a Row) to a Row with a different schema
}
}

I tried using RowEncoder in either case, but the types haven't quite worked out. Any ideas?

jkbradley · 2017-02-28T01:05:53Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/TopByKeyAggregator.scala

+    r.toArray.sorted(ord.reverse)
+  override def bufferEncoder: Encoder[BoundedPriorityQueue[(K2, V)]] =
+    Encoders.kryo[BoundedPriorityQueue[(K2, V)]]
+  override def outputEncoder: Encoder[Array[(K2, V)]] = ExpressionEncoder[Array[(K2, V)]]


IntelliJ style complaint: include "()" at end

jkbradley · 2017-02-28T01:08:12Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

+
+  /**
+   * Returns top `num` items recommended for each user, for all users.
+   * @param num number of recommendations for each user


number -> max number

SparkQA · 2017-02-28T02:00:01Z

Test build #73543 has finished for PR 17090 at commit 832b066.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-28T03:43:30Z

Test build #73553 has finished for PR 17090 at commit ebd2604.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2017-02-28T05:44:06Z

cc @MLnick

hhbyyh · 2017-02-28T07:21:51Z

the same as #12574 ?

jkbradley · 2017-02-28T08:10:35Z

@hhbyyh This is different from #12574 since it sidesteps the ongoing design discussions about input and output schema. Eventually, I'd like us to proceed with #12574 but only after we've figured out the right workflow for using these methods in a Pipeline with tuning.

MLnick · 2017-02-28T08:15:23Z

#12574 is a comprehensive solution that also intends to support cross-validation as well as recommending for a subset (or any arbitrary set) of users/items. So it solves SPARK-10802 and SPARK-10802 at the same time.

That PR is in fully working state. I'm a little surprised to see work done on this rather than deciding on the input/output schema stuff...

MLnick · 2017-02-28T08:16:23Z

By the way I have been doing performance testing in support of #12574 and results are pretty much ready.

jkbradley · 2017-02-28T08:26:02Z

I'd been following the long discussions about a transform-based solution, but those had not seemed to have converged to a clear design. If you feel they have in your PR, then I'll spend some time tomorrow going through your PR to catch up. (Keep in mind that the branch cut is scheduled to happen in a day or two: http://spark.apache.org/versioning-policy.html )

MLnick · 2017-02-28T08:28:44Z

For performance tests, I've been using the MovieLens ml-latest dataset here. It has 24,404,096 ratings with 259,137 users and 39,443 movies.

So it's not enormous but "recommend all" does a lot of work - generating 1,631,206,099 predicted ratings raw before the top-k.

Some quick tests for the existing recommendProductsForUsers gives 306 sec.

scala> spark.time { oldModel.recommendProductsForUsers(k).count }
Time taken: 306512 ms
res11: Long = 259137

As part of my performance testing I've tried a few approaches roughly similar to this PR, but using Window and filter rather than this top-k aggregator (which is a neat idea).

At first I thought this PR was really good:

scala> spark.time { newModel.recommendForAllUsers(k).count }
Time taken: 151504 ms
res3: Long = 259137

151 sec seems fast!

But then I tried this:

scala> spark.time { newModel.recommendForAllUsers(k).show }
+------+--------------------+
|userId|     recommendations|
+------+--------------------+
| 35982|[[131382,15.53116...|
| 67782|[[131382,29.72169...|
| 82672|[[132954,12.19152...|
|155042|[[148954,16.09084...|
|167532|[[118942,13.94282...|
|168802|[[27212,11.881494...|
|216112|[[109159,25.46359...|
|243392|[[153010,9.85302]...|
|255132|[[131382,15.50626...|
|255362|[[131382,10.08476...|
| 17389|[[152711,16.09958...|
|120899|[[156956,12.61003...|
|213089|[[82055,13.293286...|
|253769|[[152711,16.57459...|
|258129|[[152711,22.50499...|
| 24347|[[152711,12.31282...|
| 35947|[[153184,11.04110...|
|103357|[[132954,13.26898...|
|130557|[[118942,14.00168...|
|156017|[[153010,12.24449...|
+------+--------------------+
only showing top 20 rows

Time taken: 672524 ms

672 sec, over 2x slower than mllib impl.

Not sure why count is fast relative to show (maybe Spark SQL is not doing all the actual compute, while for show it does need to?).

srowen · 2017-02-28T11:17:44Z

mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala

+
+  private def checkRecommendationOrdering(topK: DataFrame, k: Int): Unit = {
+    assert(topK.columns.contains("recommendations"))
+    topK.select("recommendations").collect().foreach(


Purely a style suggestion but you can get rid of one set of parens:

...foreach { row => ... }

MLnick · 2017-02-28T18:46:33Z

@jkbradley do we propose to add further methods to support recommending for all users (or items) in an input DF? like recommendForAllUsers(dataset: DataFrame, num: Int)?

jkbradley · 2017-02-28T19:02:22Z

@MLnick Thanks for showing those comparison numbers. If your implementation is faster, then I'm happy going with it. I do wonder if we might hit scalability issues with RDDs which we would not hit with DataFrames, so it'd be worth revisiting a DF-based implementation later on.

In terms of the API, my main worry about #12574 is that I haven't seen a full design of how ALS would be plugged into cross validator. I still don't see how CV could handle ALS unless we specialized it for recommendation. It was this uncertainty which made me comment on https://issues.apache.org/jira/browse/SPARK-13857 to recommend we go ahead and merge basic recommendAll methods, while continuing to figure out a good design for tuning.

Feel free to push back, but I would really like to see a sketch of how ALS could plug into tuning. I haven't spent the time to do a literature review on how tuning is generally done for recommendation, especially on the best ways to split the data into folds.

further methods to support recommending for all users (or items) in an input DF? like recommendForAllUsers(dataset: DataFrame, num: Int)

I do think this sounds useful, but I'm focused on feature parity w.r.t. the RDD-based API right now. It'd be nice to add later, though that could be via your proposed transform-based API.

MLnick · 2017-02-28T19:28:11Z

Fitting into the CV / evaluator is actually fairly straightforward. It's just that the semantics of transform for top-k recommendation must fit into whatever we decide on for RankingEvaluator, so they are closely linked. (In other words, they must be compatible). Once the semantics (basically output schema for transform) are decided it's quite simple.

It was discussed on the JIRA here and here.

I haven't had a chance to refine it yet but I have a view on the best approach now (basically to fit in with the design of SPARK-14409 and in particular the basic version of #16618). I think that design / schema is more "DataFrame-like".

In any event - I'm not against having the convenience methods for recommend-all here. I support it. Ultimately the transform approach is mostly for fitting into Pipelines & cross-validation. transform could call into these convenience methods (though it will need a DataFrame-based input version).

MLnick · 2017-02-28T19:37:34Z

The performance of #12574 is not better than the existing mllib recommend-all - since it wraps the functionality it's roughly on par.

MLnick · 2017-02-28T19:47:23Z

I should note that I've found the performance of "recommend all" to be very dependent on number of partitions since it controls the memory consumption per task (which can easily explode in the blocked mllib version) vs the CPU utilization & amount of shuffle data.

For example, the default mllib results above use 192*192 = 36,864 partitions (due to cross-join). So it does prevent dying due to exploding memory & GC but is slower than using fewer partitions. However, too few partitions and it dies.

I actually just realised that the defaults for mllib for user/item blocks - which in turn controls the partitions for the factors - is defaultParallelism (192 for my setup), while for ml it is 10. Hence we need to create a like-for-like comparison.

(Side note - it's not ideal actually that the num blocks drives the recommend-all partitions - because the optimal settings for training ALS are unlikely to be optimal for batch recommend-all prediction).

Anyway some results of quick tests on my setup:

Firstly, to match mllib defaults:

mllib with 192*192: 323 sec

scala> spark.time { oldModel.recommendProductsForUsers(k).foreach(_ => Unit)  }
Time taken: 323367 ms

ml with 192*192: 427 sec

scala> val newModel = newAls.setNumUserBlocks(192).setNumItemBlocks(192).fit(ratings)
scala> spark.time { newModel.recommendForAllUsers(k).foreach(_ => Unit) }
Time taken: 427174 ms

So this PR is 30% slower - which is actually pretty decent given it's not using blocked BLAS operators.

Note I didn't use netlib native BLAS, which could make a large difference when using level 3 BLAS in the blocked mllib version.

Secondly, to match ml defaults:

mllib with 10*10: 1654 sec

scala> val oldModel = OldALS.train(oldRatings, rank, iter, lambda, 10)
scala> spark.time { oldModel.recommendProductsForUsers(k).foreach(_ => Unit)  }
Time taken: 1654951 ms

ml with 10*10: 438 sec

scala> val newModel = newAls.fit(ratings)
scala> spark.time { newModel.recommendForAllUsers(k).foreach(_ => Unit) }
Time taken: 438328 ms

In this case, the mllib version blows up with memory & GC dominating runtime, and this PR is over 3x faster (though it varies a lot: 600 sec above, 438 sec here, etc).

Finally, middle of the road case:

mllib with 96*96: 175 sec

scala> spark.time { oldModel.recommendProductsForUsers(k).foreach(_ => Unit)  }
Time taken: 175880 ms

ml with 96*96: 181 sec

scala> spark.time { newModel.recommendForAllUsers(k).foreach(_ => Unit) }
Time taken: 181494 ms

So a few % slower. Again pretty good actually considering it's not a blocked implementation. Still room to be optimized.

After running these I tested against a blocked version using DataFrame (to more or less match the current mllib version) and it's much faster in the 192*192 case, a bit slower in 96*96 case and also blows up in the 10*10 case. Again, really dependent on partitioning...

So the performance here is not too bad. The positive is it should avoid completely exploding. As I mentioned above I tried a similar DataFrame-based version using Window & filter and performance was terrible. It will be interesting to see if native BLAS adds anything.

MLnick · 2017-02-28T19:52:35Z

Finally, I've done some work related to SPARK-11968 and have a potential solution that seems to be pretty good. In this case it should be more than 3x faster than the best runtime here.

MLnick · 2017-02-28T19:54:34Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

@@ -248,18 +248,18 @@ class ALSModel private[ml] (
  @Since("1.3.0")
  def setPredictionCol(value: String): this.type = set(predictionCol, value)

+  private val predict = udf { (userFeatures: Seq[Float], itemFeatures: Seq[Float]) =>


They are switched in the case of recommendForAllItems so is that not less clear?

MLnick · 2017-02-28T19:56:33Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/TopByKeyAggregator.scala

+ * Works on rows of the form (K1, K2, V) where K1 & K2 are IDs and V is the score value. Finds
+ * the top `num` K2 items based on the given Ordering.
+ */
+private[recommendation] class TopByKeyAggregator[K1: TypeTag, K2: TypeTag, V: TypeTag]


I'd think we should have at least some basic tests for this - see MLPairRDDFunctionsSuite for example

MLnick · 2017-02-28T20:00:03Z

mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala

+    val numRecs = 5
+    val (training, test) = genExplicitTestData(numUsers, numItems, rank = 2, noiseStd = 0.01)
+    val topItems =
+      testALS(training, test, maxIter = 4, rank = 2, regParam = 0.01, targetRMSE = 0.03)


Seems wasteful to compute and check a model here, and it doesn't really test that the predictions are what we expect them to be.

We can construct an ALSModel with known factors and check the predictions are as expected (it's just a matrix multiply). See MatrixFactorizationModelSuite and tests in #12574 (based on those) for example.

jkbradley · 2017-02-28T20:10:24Z

@MLnick Thanks a lot for the detailed tests! I really appreciate it. In this case, are you OK with the approach in the current PR (pending reviews)?

One thing we should confirm is what you think of the input/output schema for these methods. It'd be nice to have the schema match what you think best for the transform-based recommendAll.

Re: design: I saw the discussions in https://issues.apache.org/jira/browse/SPARK-13857 but I didn't see a clear statement of pros & cons leading to a design decision. I'm sorry about not having time to get involved with the discussions, but I'd like to help out as useful in the coming months.

sueann · 2017-02-28T20:15:12Z

The output in #12574 looks like a DataFrame with Row(srcCol: Int, "recommendations": Array[(Int, Float)]) so I think this PR as is matches the output type - @MLnick to confirm. Will make the suggested changes from everyone. Thanks a lot for the performance testing @MLnick! Really neat to see.

SparkQA · 2017-03-01T01:20:48Z

Test build #73623 has finished for PR 17090 at commit 41a11e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-01T01:27:30Z

Test build #73624 has finished for PR 17090 at commit 4ac586a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-01T01:45:19Z

Test build #73628 has finished for PR 17090 at commit ef93575.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-03-01T14:28:26Z

Isn't deciding on the output schema for these methods essentially the same as deciding on transform semantics in #12574 (apart from the issue of how, or if, to have transform generate the "ground truth" items)?

So in a way I'm not certain this "circumvents" the discussion around transform semantics / fitting into cross-validation & pipelines but rather makes an implicit decision on it. If this is set here, it would be rather awkward to have any future transform based recommend all functionality with a different output schema.

My default choice in #12574 was to match the form of the existing mllib methods. That is the same as here, and matches the format for the existing RankingMetrics in mllib. So from that perspective it is the "easiest" choice.

It's not necessarily the "best" choice - I go into the other option in detail on #12574 and the related JIRA. Ideally it should match up with the expected input schema for a new RankingEvaluator (not strictly necessary but definitely preferable).

My concern here is that we make a quick decision implicitly due to time constraints and are stuck with it down the line as it is exposed in the public API.

MLnick · 2017-03-01T14:30:51Z

It could also be I'm overthinking things - and we can mould the RankingEvaluator to accept both types of input - the array version: (Array(predictions), Array(labels)) or the "exploded" version: (prediction, label).

In which case it doesn't really matter what call we make here, as long as we're consistent with it later.

jkbradley · 2017-03-02T02:49:17Z

It's a good point about making an implicit decision. We could deprecate these methods in favor of transform-based ones in the future---we have done this in the past---but it does push the long-term decision in a clear direction.

My hesitation about transform is not just about the schema. It's also because I'm still unclear how we could plug ALS into tuning without having tuning specifically understand ALS (knowing about users, items, etc.). I'll add my thoughts on the JIRA.

MLnick · 2017-03-02T19:17:36Z

mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala

+      expected: Map[Int, Array[Row]],
+      dstColName: String): Unit = {
+    assert(topK.columns.contains("recommendations"))
+    topK.collect().foreach { row =>


It's a little strange to have all the Row stuff in these tests.

You can do topK.as[(Int, Seq[(Int, Float)])].collect.foreach { case (id, recs) => ...

Then adjust expected accordingly

MLnick · 2017-03-02T19:19:25Z

mllib/src/test/scala/org/apache/spark/ml/recommendation/TopByKeyAggregatorSuite.scala

+
+    val topKAggregator = new TopByKeyAggregator[Int, Int, Float](k, Ordering.by(_._2))
+    Seq(
+      (0, 3, 54f),


maybe a good idea to have varying # values = like maybe one with only 1 etc.

MLnick · 2017-03-02T19:20:35Z

mllib/src/test/scala/org/apache/spark/ml/recommendation/TopByKeyAggregatorSuite.scala

+  private def checkTopK(
+      topK: Dataset[(Int, Array[(Int, Float)])],
+      expected: Map[Int, Array[(Int, Float)]]): Unit = {
+    topK.collect().foreach { record =>


slightly prefer foreach { case (id, recs) => ...

sueann · 2017-03-02T19:46:34Z

mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala

+      val recs = row.getAs[WrappedArray[Row]]("recommendations")
+      assert(recs === expected(id))
+      assert(recs(0).fieldIndex(dstColName) == 0)
+      assert(recs(0).fieldIndex("rating") == 1)


Should we have this requirement in the output (vs. the unnamed tuples)? It does cost more to support named fields (we have to call .rdd on the resulting dataframe in the process of imposing a new schema) and if we decide to remove it later it'll break any usage of the named fields.

Actually nevermind. Either way is committing to an incompatible API so the name one seems preferable.

…ll fns

SparkQA · 2017-03-02T20:54:16Z

Test build #73787 has finished for PR 17090 at commit b0680db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-03T23:32:21Z

Test build #73866 has finished for PR 17090 at commit 6a7e3d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-03-03T23:45:04Z

LGTM
Any other comments before we merge?

jkbradley · 2017-03-06T00:47:51Z

I'll merge this with master now
Thanks @sueann and @MLnick for feedback. I'll prioritize helping with your work on transform, metrics, and tuning for ALS next.

MLnick · 2017-03-06T10:34:13Z

@jkbradley I've put my updated proposal for ranking evaluation on SPARK-14409 comments.

It's different from this schema (and from the mllib recommend all return types). I've detailed the reasoning in the JIRA.

I would have preferred a more explicit discussion & decision on the schema before merging this, but now that it's merged we should at least consider how (or if) to deal with the schema difference assuming the above proposal proceeds.

jkbradley · 2017-03-07T06:42:19Z

@MLnick OK I think I misunderstood some of your comments above then. I see the proposal in SPARK-14409 differs from this PR, so I agree it'd be nice to resolve it. We can make changes to this PR's schema as long as it happens soon.

Here are the pros of each as I see them:

Nested schema (as in this PR): [user, Array((item, rating))]

Easy to work with both nested & flattened schema (df.select("recommendations.item")) (AFAIK there's no simple way to zip and nest the 2 columns when starting with the flattened schema.)

Flattened schema (as in SPARK-14409): [user, Array(item), Array(rating)]

More efficient to store in Row-based formats like Avro

I'm not sure if there's a performance difference in the formats when stored in Tungsten Rows. I think not, but that'd be good to know.

MLnick · 2017-03-07T07:54:21Z

I commented further on the JIRA.

Sorry if my other comments here and on JIRA were unclear. But the proposed schema for input to RankingEvaluator is:

Schema 1

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   230|    318|   5.0| 4.2403245|
|   230|   3424|   4.0|      null|
|   230|  81191|  null|  4.317455|
+------+-------+------+----------+

You will notice that rating and prediction columns can be null. This is by design. There are three cases shown above:

1st row indicates a (user-item) pair that occurs in both the ground-truth set and the top-k predictions;
2nd row indicates a (user-item) pair that occurs in the ground-truth set, but not in the top-k predictions;
3rd row indicates a (user-item) pair that occurs in the top-k predictions, but not in the ground-truth set.

Note for reference, the input to the current mllib RankingMetrics is:

Schema 2

RDD[(true labels array, predicted labels array)],
i.e.
RDD of ([318, 3424, 7139,...], [81191, 93040, 31...])

(So actually neither of the above schemas are easily compatible with the return schema here - but I think it is not really necessary to match the mllib.RankingMetrics format)

ALS cross-validation

My proposal for fitting ALS into cross-validation is the ALSModel.transform will output a DF of Schema 1 - only when the parameters k and recommendFor are appropriately set, and the input DF contains both user and item columns. In practice, this scenario will occur during cross-validation only.

So what I am saying is that ALS itself (not the evaluator) must know how to return the correct DataFrame output from transform such that it can be used in a cross-validation as input to the RankingEvaluator.

Concretely:

val als = new ALS().setRecommendFor("user").setK(10)
val validator = new TrainValidationSplit()
  .setEvaluator(new RankingEvaluator().setK(10))
  .setEstimator(als)
  .setEstimatorParamMaps(...)
val bestModel = validator.fit(ratings)

So while it is complex under the hood - to users it's simply a case of setting 2 params and the rest is as normal.

Now, we have the best model selected by cross-validation. We can make recommendations using these convenience methods (I think it will need a cast):

val recommendations = bestModel.asInstanceOf[ALSModel].recommendItemsforUsers(10)

Alternatively, the transform version looks like this:

val usersDF = ...
+------+
|userId|
+------+
|     1|
|     2|
|     3|
+------+
val recommendations = bestModel.transform(usersDF)

So the questions:

should we support the above transform-based recommendations? Or only support it for cross-validation purposes as a special case?
if we do, what should the output schema of the above transform version look like? It must certainly match the output of recommendX methods.

The options are:

(1) The schema in this PR:
Pros: as you mention above - also more "compact"
Cons: doesn't match up so closely with the transform "cross-validation" schema above

(2) The schema below. It is basically an "exploded" version of option (1)

+------+-------+----------+
|userId|movieId|prediction|
+------+-------+----------+
|     1|      1|       4.3|
|     1|      5|       3.2|
|     1|      9|       2.1|
+------+-------+----------+

Pros*: matches more closely with the cross-validation / evaluator input format. Perhaps slightly more "dataframe-like".
Cons: less compact; lose ordering?; may require more munging to save to external data stores etc.

Anyway sorry for hijacking this PR discussion - but as I think you can see, the evaluator / ALS transform interplay is a bit subtle and requires some thought to get the right approach.

jkbradley · 2017-03-07T19:21:59Z

Thanks @MLnick for the explanation. This is what I'd understood from your similar description on the JIRA, but definitely more in-depth. (It might be good to copy to JIRA, or even a design doc at this point.)

As I'd said, I haven't done a literature review on this, so it's hard for me to judge what schema evaluators should be given. I see some implicit decisions, such as evaluators using implicit ratings (using rows missing either a label or a prediction) and us not computing predictions for all (user,item) pairs with labels.

However, assuming the schema you've selected is best for evaluation, then I think this highlights 2 distinct needs for top K: (a) a user-friendly API (this PR) and (b) an evaluator-friendly API (your design). For (a), many users have requested recommendForAll* methods matching the RDD-based equivalents, and this schema provides top K recommendations in an analogous and friendly schema. If evaluator needs a less user-friendly schema, that's OK, but then I think it should be considered an internal/dev schema which can differ from the user-friendly version.

What do you think?

MLnick · 2017-03-09T08:09:23Z

Good point for copying some detail to JIRA, will do that. Also I responded on JIRA to your comment above (the part about evaluation) - just to avoid confusing this PR with further stuff.

MLnick · 2017-03-09T08:17:38Z

As for the API - I'm ok with having the "user-facing" version differ from the transform version. Though it may lead to some confusion. In this case, it's probably best to have transform only work for the cross-validation / evaluator setting and to make it clear in future how it all works and fits together, with appropriate doc, user guide and examples.

sueann commented Feb 28, 2017

View reviewed changes

srowen reviewed Feb 28, 2017

View reviewed changes

jkbradley reviewed Feb 28, 2017

View reviewed changes

srowen reviewed Feb 28, 2017

View reviewed changes

MLnick reviewed Feb 28, 2017

View reviewed changes

sueann force-pushed the SPARK-19535 branch from ebd2604 to 41a11e7 Compare March 1, 2017 00:21

MLnick reviewed Mar 2, 2017

View reviewed changes

sueann commented Mar 2, 2017

View reviewed changes

sueann and others added 9 commits March 2, 2017 11:46

skeleton fns, doesn't compile

bcac949

simple working tests

d4616ec

comments

2de5b45

scaladoc, formatting

6de4c38

clean-ups, comments.

1b35f0a

added tests for TopByKeyAggregator, more precise tests for recommendA…

08a58e4

…ll fns

cleanup

b139c26

name the output columns in the recommendations Array

c1973e6

comments

b0680db

sueann force-pushed the SPARK-19535 branch from ef93575 to b0680db Compare March 2, 2017 19:58

no longer needing to cause serialization costs

6a7e3d1

asfgit closed this in 70f9d7f Mar 6, 2017

[Spark-19535][ML] RecommendForAllUsers RecommendForAllItems for ALS on Dataframe #17090

[Spark-19535][ML] RecommendForAllUsers RecommendForAllItems for ALS on Dataframe #17090

Conversation

sueann commented Feb 28, 2017

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 28, 2017

jkbradley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 28, 2017

SparkQA commented Feb 28, 2017

sethah commented Feb 28, 2017

hhbyyh commented Feb 28, 2017

jkbradley commented Feb 28, 2017

MLnick commented Feb 28, 2017

MLnick commented Feb 28, 2017

jkbradley commented Feb 28, 2017 • edited Loading

MLnick commented Feb 28, 2017

Choose a reason for hiding this comment

MLnick commented Feb 28, 2017

jkbradley commented Feb 28, 2017

MLnick commented Feb 28, 2017

MLnick commented Feb 28, 2017

MLnick commented Feb 28, 2017

MLnick commented Feb 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley commented Feb 28, 2017

sueann commented Feb 28, 2017

SparkQA commented Mar 1, 2017

SparkQA commented Mar 1, 2017

SparkQA commented Mar 1, 2017

MLnick commented Mar 1, 2017

MLnick commented Mar 1, 2017 • edited Loading

jkbradley commented Mar 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 2, 2017

SparkQA commented Mar 3, 2017

jkbradley commented Mar 3, 2017

jkbradley commented Mar 6, 2017

MLnick commented Mar 6, 2017

jkbradley commented Mar 7, 2017

MLnick commented Mar 7, 2017

Schema 1

Schema 2

ALS cross-validation

jkbradley commented Mar 7, 2017

MLnick commented Mar 9, 2017 • edited Loading

MLnick commented Mar 9, 2017

jkbradley commented Feb 28, 2017 •

edited

Loading

MLnick commented Mar 1, 2017 •

edited

Loading

MLnick commented Mar 9, 2017 •

edited

Loading