Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Spark job grouping to distinguish steps of the machine learning flow #467

Merged
merged 33 commits into from
Mar 24, 2020

Conversation

nicodv
Copy link
Contributor

@nicodv nicodv commented Mar 18, 2020

Related issues
N/A

Describe the proposed solution
Leverages Spark's ability to set a "job group" ID to distinguish certain steps of the machine learning. Examples: data IO, model IO, feature engineering, cross-validation.

OpSparkListener is extended to capture which job group is currently active. Also, the new OpStep enum's entry names automatically show up in the Spark UI so that the function of stages can be more easily interpreted.

To this end:

  • An OpStep enum is introduced
  • A utility is introduced to set the job group for a code block, or indefinitely

Describe alternatives you've considered
Because a main goal is to get the current step into the real-time SparkListener framework, the latter's ability to get hold of the Spark job group was an easy way to accomplish this.
Considered but not feasible:

  • Spark listeners per step by adding/removing Spark listeners to/from the listener bus. More complex than using the job grouping, without added benefit.
  • Spark's local properties, but it seems they are not shared with Spark listeners

Also considered were the addition of other steps, such as "sanity checker", "scoring" or "metrics". However, these are not included here as this would:

  • Needed to be done in places inconsistent with other steps (most of the grouping is contained to OpWorkflow/OpWorkflowModel)
  • Lead to additional overhead/complexity, e.g. because scoring and metrics calculation is part of a single Spark stage and therefore could not be split up by Spark job group.

Additional context
The extension to OpSparkListener allows for more advanced handling of the metrics that are collected by it, e.g. the metrics can be grouped by OpStep.

@codecov
Copy link

codecov bot commented Mar 18, 2020

Codecov Report

Merging #467 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #467      +/-   ##
=========================================
+ Coverage   86.99%     87%   +0.01%     
=========================================
  Files         344     345       +1     
  Lines       11576   11575       -1     
  Branches      370     593     +223     
=========================================
+ Hits        10070   10071       +1     
+ Misses       1506    1504       -2
Impacted Files Coverage Δ
...main/scala/com/salesforce/op/OpWorkflowModel.scala 93.9% <100%> (-0.15%) ⬇️
.../src/main/scala/com/salesforce/op/OpWorkflow.scala 88.11% <100%> (-0.85%) ⬇️
...sforce/op/stages/impl/selector/ModelSelector.scala 98.36% <100%> (+0.17%) ⬆️
...a/com/salesforce/op/utils/spark/JobGroupUtil.scala 100% <100%> (ø)
.../main/scala/com/salesforce/op/OpWorkflowCore.scala 95.45% <100%> (ø) ⬆️
...om/salesforce/op/utils/spark/OpSparkListener.scala 98.63% <100%> (+0.02%) ⬆️
...es/src/main/scala/com/salesforce/op/OpParams.scala 89.79% <0%> (+4.08%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8087834...818ce67. Read the comment docs.

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition!!

Copy link
Contributor

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nicodv nicodv merged commit ed4abfd into master Mar 24, 2020
@nicodv nicodv deleted the ndv/jobgroups branch March 24, 2020 17:30
@nicodv nicodv mentioned this pull request Jun 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants