Use Spark job grouping to distinguish steps of the machine learning flow #467

nicodv · 2020-03-18T22:03:59Z

Related issues
N/A

Describe the proposed solution
Leverages Spark's ability to set a "job group" ID to distinguish certain steps of the machine learning. Examples: data IO, model IO, feature engineering, cross-validation.

OpSparkListener is extended to capture which job group is currently active. Also, the new OpStep enum's entry names automatically show up in the Spark UI so that the function of stages can be more easily interpreted.

To this end:

An OpStep enum is introduced
A utility is introduced to set the job group for a code block, or indefinitely

Describe alternatives you've considered
Because a main goal is to get the current step into the real-time SparkListener framework, the latter's ability to get hold of the Spark job group was an easy way to accomplish this.
Considered but not feasible:

Spark listeners per step by adding/removing Spark listeners to/from the listener bus. More complex than using the job grouping, without added benefit.
Spark's local properties, but it seems they are not shared with Spark listeners

Also considered were the addition of other steps, such as "sanity checker", "scoring" or "metrics". However, these are not included here as this would:

Needed to be done in places inconsistent with other steps (most of the grouping is contained to OpWorkflow/OpWorkflowModel)
Lead to additional overhead/complexity, e.g. because scoring and metrics calculation is part of a single Spark stage and therefore could not be split up by Spark job group.

Additional context
The extension to OpSparkListener allows for more advanced handling of the metrics that are collected by it, e.g. the metrics can be grouped by OpStep.

…er job groups

codecov · 2020-03-18T22:23:10Z

Codecov Report

Merging #467 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master    #467      +/-   ##
=========================================
+ Coverage   86.99%     87%   +0.01%     
=========================================
  Files         344     345       +1     
  Lines       11576   11575       -1     
  Branches      370     593     +223     
=========================================
+ Hits        10070   10071       +1     
+ Misses       1506    1504       -2

Impacted Files	Coverage Δ
...main/scala/com/salesforce/op/OpWorkflowModel.scala	`93.9% <100%> (-0.15%)`	⬇️
.../src/main/scala/com/salesforce/op/OpWorkflow.scala	`88.11% <100%> (-0.85%)`	⬇️
...sforce/op/stages/impl/selector/ModelSelector.scala	`98.36% <100%> (+0.17%)`	⬆️
...a/com/salesforce/op/utils/spark/JobGroupUtil.scala	`100% <100%> (ø)`
.../main/scala/com/salesforce/op/OpWorkflowCore.scala	`95.45% <100%> (ø)`	⬆️
...om/salesforce/op/utils/spark/OpSparkListener.scala	`98.63% <100%> (+0.02%)`	⬆️
...es/src/main/scala/com/salesforce/op/OpParams.scala	`89.79% <0%> (+4.08%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8087834...818ce67. Read the comment docs.

tovbinm

Great addition!!

core/src/main/scala/com/salesforce/op/utils/spark/JobGroupUtil.scala

core/src/main/scala/com/salesforce/op/stages/impl/selector/ModelSelector.scala

core/src/main/scala/com/salesforce/op/OpWorkflowCore.scala

core/src/test/scala/com/salesforce/op/utils/spark/JobGroupUtilTest.scala

utils/src/main/scala/com/salesforce/op/utils/spark/OpSparkListener.scala

…vingScores

gerashegalov

LGTM

nicodv added 27 commits February 24, 2020 16:44

WIP adding a job group util

f2fad1c

complete JobGroupUtilTest

b0d89bc

introduce enum for job groups

babeaa3

add job group for reading/filtering

5fde259

add OpStep enum

5d90bf1

add scoring step

c02da84

cleanup

6c7a6c7

add some job groups to OpWorkflowModel

99ea18b

set job groups for feature engineering, sanity checker, cross-validation

3df56f3

catch job group on job start; print job group with stage metrics

8c3c388

fix test, rename test, docs

a265437

make jobgroup protected

03f2355

add tagging to .computeDataUpTo

79ad0d7

always log job groups too

9b3918a

refactor

673c3ca

move job group logic out of closure

483e62e

set OpStep.FeatureEngineering after data reading/filtering

8e1b434

add OpStep.Scoring for streaming scoring too

93821d6

remove superfluous withJobGroup

7c82091

tag saving model at lower level

39ac685

prune steps a bit

79bcdad

re-add removed variable

e01aaa8

remove superfluous job group set

da02414

set scoring job grouping in better location

9cd9598

remove problematic score job group

cd00b0d

move feature engineering job group to OpWorkflow, consistent with oth…

d7f02f4

…er job groups

cleanup

e6bb4ae

nicodv added enhancement ready for review labels Mar 18, 2020

nicodv requested a review from gerashegalov as a code owner March 18, 2020 22:04

nicodv requested review from Jauntbox, leahmcguire, tovbinm and wsuchy as code owners March 18, 2020 22:04

Merge branch 'master' into ndv/jobgroups

65d371f

tovbinm reviewed Mar 19, 2020

View reviewed changes

nicodv added 5 commits March 18, 2020 21:44

add "other" to enum, moving OpStep to utils in the process; rename Sa…

7bf0111

…vingScores

use Spec

27ee0c8

docs

9bb9bd0

Merge remote-tracking branch 'origin/ndv/jobgroups' into ndv/jobgroups

1b00c0b

fix test

818ce67

gerashegalov approved these changes Mar 19, 2020

View reviewed changes

nicodv merged commit ed4abfd into master Mar 24, 2020

nicodv deleted the ndv/jobgroups branch March 24, 2020 17:30

nicodv mentioned this pull request Mar 26, 2020

Avoid having to have an implicit SparkSession in OpWorkFlow(Model) #468

Merged

nicodv mentioned this pull request Jun 11, 2020

0.7.0 release #481

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Spark job grouping to distinguish steps of the machine learning flow #467

Use Spark job grouping to distinguish steps of the machine learning flow #467

nicodv commented Mar 18, 2020 •

edited

Loading

codecov bot commented Mar 18, 2020 •

edited

Loading

tovbinm left a comment

gerashegalov left a comment

Use Spark job grouping to distinguish steps of the machine learning flow #467

Use Spark job grouping to distinguish steps of the machine learning flow #467

Conversation

nicodv commented Mar 18, 2020 • edited Loading

codecov bot commented Mar 18, 2020 • edited Loading

Codecov Report

tovbinm left a comment

Choose a reason for hiding this comment

gerashegalov left a comment

Choose a reason for hiding this comment

nicodv commented Mar 18, 2020 •

edited

Loading

codecov bot commented Mar 18, 2020 •

edited

Loading