-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3224] FetchFailed reduce stages should only show up once in failed stages (in UI) #2127
Conversation
…e failed stages UI.
cc @kayousterhout & @pwendell can you take a look |
QA tests have started for PR 2127 at commit
|
import env.actorSystem.dispatcher | ||
env.actorSystem.scheduler.scheduleOnce( | ||
RESUBMIT_TIMEOUT, eventProcessActor, ResubmitFailedStages) | ||
// It is likely that we receive multiple FetchFailed for a single stage (because we have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see the comment here explaining the problem
QA tests have finished for PR 2127 at commit
|
Yeah - this makes sense. Just to be clear though, this doesn't change the existing logic except to surround it with an if-else, correct? |
LGTM |
Jenkibns, retest this please. |
Yea it doesn't. |
QA tests have started for PR 2127 at commit
|
val mapStage = shuffleToMapStage(shuffleId) | ||
if (mapId != -1) { | ||
mapStage.removeOutputLoc(mapId, bmAddress) | ||
mapOutputTracker.unregisterMapOutput(shuffleId, mapId, bmAddress) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want these two lines, even if the stage has already been marked as failed? It seems like the new failure could be telling us about a different dead map output
QA tests have finished for PR 2127 at commit
|
Also can you write a unit test for this? I think it should be pretty easy -- you can just check for doulby-receiving StageCompleted events. Our lack of unit tests in the scheduler code has historically led to many bugs for new patches... |
Jenkins, retest this please On Tue, Aug 26, 2014 at 10:31 AM, Reynold Xin [email protected]
|
Oh oops these emails arrived in the wrong order -- I thought your request On Tue, Aug 26, 2014 at 1:10 PM, Kay Ousterhout [email protected]
|
Pushed a new version. I agree we should add test for it, but that shouldn't block 1.1. |
QA tests have started for PR 2127 at commit
|
QA tests have finished for PR 2127 at commit
|
") for resubmision due to a fetch failure") | ||
|
||
logInfo("The failed fetch was from " + mapStage + " (" + mapStage.name + | ||
"); marking it for resubmission") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you combine this with the above now-very-redundant log msg?
Added unit test for SPARK-3224
@kayousterhout I looked into the racing condition we discussed offline. I think even in that case the scheduler is resilient. If you look at val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
for (i <- 0 until locs.size) {
stage.outputLocs(i) = Option(locs(i)).toList // locs(i) will be null if missing
} The Option there guards against null locations. |
QA tests have started for PR 2127 at commit
|
That code looks like it's just setting the output locations for the map stage... what about the following case: (1) map stage runs |
I think that's fine since that just fails the executor code, which will result in another retry? |
QA tests have finished for PR 2127 at commit
|
Ah cool you're right! |
markStageAsFinished(failedStage, Some("Fetch failure")) | ||
runningStages -= failedStage | ||
// TODO: Cancel running tasks in the stage | ||
logInfo(s"Marking $failedStage (${failedStage.name}) for resubmision " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be below? Other than this tiny thing this LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll tune up these log messages and merge this - thanks.
QA tests have started for PR 2127 at commit
|
…iled stages (in UI) This is a HOTFIX for 1.1. Author: Reynold Xin <[email protected]> Author: Kay Ousterhout <[email protected]> Closes #2127 from rxin/SPARK-3224 and squashes the following commits: effb1ce [Reynold Xin] Move log message. 49282b3 [Reynold Xin] Kay's feedback. 3f01847 [Reynold Xin] Merge pull request #2 from kayousterhout/SPARK-3224 796d282 [Kay Ousterhout] Added unit test for SPARK-3224 3d3d356 [Reynold Xin] Remove map output loc even for repeated FetchFaileds. 1dd3eb5 [Reynold Xin] [SPARK-3224] FetchFailed reduce stages should only show up once in the failed stages UI. (cherry picked from commit bf71905) Signed-off-by: Patrick Wendell <[email protected]>
QA tests have finished for PR 2127 at commit
|
…iled stages (in UI) This is a HOTFIX for 1.1. Author: Reynold Xin <[email protected]> Author: Kay Ousterhout <[email protected]> Closes apache#2127 from rxin/SPARK-3224 and squashes the following commits: effb1ce [Reynold Xin] Move log message. 49282b3 [Reynold Xin] Kay's feedback. 3f01847 [Reynold Xin] Merge pull request apache#2 from kayousterhout/SPARK-3224 796d282 [Kay Ousterhout] Added unit test for SPARK-3224 3d3d356 [Reynold Xin] Remove map output loc even for repeated FetchFaileds. 1dd3eb5 [Reynold Xin] [SPARK-3224] FetchFailed reduce stages should only show up once in the failed stages UI.
This is a HOTFIX for 1.1.