Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3224] FetchFailed reduce stages should only show up once in failed stages (in UI) #2127

Closed
wants to merge 6 commits into from

Conversation

rxin
Copy link
Contributor

@rxin rxin commented Aug 26, 2014

This is a HOTFIX for 1.1.

@rxin
Copy link
Contributor Author

rxin commented Aug 26, 2014

cc @kayousterhout & @pwendell can you take a look

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have started for PR 2127 at commit 1dd3eb5.

  • This patch merges cleanly.

import env.actorSystem.dispatcher
env.actorSystem.scheduler.scheduleOnce(
RESUBMIT_TIMEOUT, eventProcessActor, ResubmitFailedStages)
// It is likely that we receive multiple FetchFailed for a single stage (because we have
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the comment here explaining the problem

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have finished for PR 2127 at commit 1dd3eb5.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@pwendell
Copy link
Contributor

Yeah - this makes sense. Just to be clear though, this doesn't change the existing logic except to surround it with an if-else, correct?

@pwendell
Copy link
Contributor

LGTM

@rxin
Copy link
Contributor Author

rxin commented Aug 26, 2014

Jenkibns, retest this please.

@rxin
Copy link
Contributor Author

rxin commented Aug 26, 2014

Yea it doesn't.

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have started for PR 2127 at commit 1dd3eb5.

  • This patch merges cleanly.

val mapStage = shuffleToMapStage(shuffleId)
if (mapId != -1) {
mapStage.removeOutputLoc(mapId, bmAddress)
mapOutputTracker.unregisterMapOutput(shuffleId, mapId, bmAddress)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want these two lines, even if the stage has already been marked as failed? It seems like the new failure could be telling us about a different dead map output

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have finished for PR 2127 at commit 1dd3eb5.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kayousterhout
Copy link
Contributor

Also can you write a unit test for this? I think it should be pretty easy -- you can just check for doulby-receiving StageCompleted events. Our lack of unit tests in the scheduler code has historically led to many bugs for new patches...

@kayousterhout
Copy link
Contributor

Jenkins, retest this please

On Tue, Aug 26, 2014 at 10:31 AM, Reynold Xin [email protected]
wrote:

Jenkibns, retest this please.


Reply to this email directly or view it on GitHub
#2127 (comment).

@kayousterhout
Copy link
Contributor

Oh oops these emails arrived in the wrong order -- I thought your request
for testing had not been satisfied. Sorry for the duplication!

On Tue, Aug 26, 2014 at 1:10 PM, Kay Ousterhout [email protected]
wrote:

Jenkins, retest this please

On Tue, Aug 26, 2014 at 10:31 AM, Reynold Xin [email protected]
wrote:

Jenkibns, retest this please.


Reply to this email directly or view it on GitHub
#2127 (comment).

@rxin
Copy link
Contributor Author

rxin commented Aug 26, 2014

Pushed a new version. I agree we should add test for it, but that shouldn't block 1.1.

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have started for PR 2127 at commit 3d3d356.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have finished for PR 2127 at commit 3d3d356.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

") for resubmision due to a fetch failure")

logInfo("The failed fetch was from " + mapStage + " (" + mapStage.name +
"); marking it for resubmission")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you combine this with the above now-very-redundant log msg?

@rxin
Copy link
Contributor Author

rxin commented Aug 27, 2014

@kayousterhout I looked into the racing condition we discussed offline. I think even in that case the scheduler is resilient.

If you look at newOrUsedStage, it gets the map output locs from MapOutputTracker this way:

      val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
      val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
      for (i <- 0 until locs.size) {
        stage.outputLocs(i) = Option(locs(i)).toList   // locs(i) will be null if missing
      }

The Option there guards against null locations.

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have started for PR 2127 at commit 49282b3.

  • This patch merges cleanly.

@kayousterhout
Copy link
Contributor

That code looks like it's just setting the output locations for the map stage... what about the following case:

(1) map stage runs
(2) reduce stage starts
(3) reduce task fails because map output A is lost
(4) map stage is restarted , with a single task for output A
(5) scheduler gets another message that a second reduce task failed because output B was missing.
(6) map stage finishes, and new reduce stage is started
(7) when the reduce stage tries to get the output locations, won't it get an exception because there's no location for output B?

@rxin
Copy link
Contributor Author

rxin commented Aug 27, 2014

I think that's fine since that just fails the executor code, which will result in another retry?

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have finished for PR 2127 at commit 49282b3.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • "$FWDIR"/bin/spark-submit --class $CLASS "$
    • class ExternalSorter(object):
    • "$FWDIR"/bin/spark-submit --class $CLASS "$
    • protected class AttributeEquals(val a: Attribute)

@kayousterhout
Copy link
Contributor

Ah cool you're right!

markStageAsFinished(failedStage, Some("Fetch failure"))
runningStages -= failedStage
// TODO: Cancel running tasks in the stage
logInfo(s"Marking $failedStage (${failedStage.name}) for resubmision " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be below? Other than this tiny thing this LGTM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll tune up these log messages and merge this - thanks.

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have started for PR 2127 at commit effb1ce.

  • This patch merges cleanly.

@asfgit asfgit closed this in bf71905 Aug 27, 2014
asfgit pushed a commit that referenced this pull request Aug 27, 2014
…iled stages (in UI)

This is a HOTFIX for 1.1.

Author: Reynold Xin <[email protected]>
Author: Kay Ousterhout <[email protected]>

Closes #2127 from rxin/SPARK-3224 and squashes the following commits:

effb1ce [Reynold Xin] Move log message.
49282b3 [Reynold Xin] Kay's feedback.
3f01847 [Reynold Xin] Merge pull request #2 from kayousterhout/SPARK-3224
796d282 [Kay Ousterhout] Added unit test for SPARK-3224
3d3d356 [Reynold Xin] Remove map output loc even for repeated FetchFaileds.
1dd3eb5 [Reynold Xin] [SPARK-3224] FetchFailed reduce stages should only show up once in the failed stages UI.
(cherry picked from commit bf71905)

Signed-off-by: Patrick Wendell <[email protected]>
@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have finished for PR 2127 at commit effb1ce.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…iled stages (in UI)

This is a HOTFIX for 1.1.

Author: Reynold Xin <[email protected]>
Author: Kay Ousterhout <[email protected]>

Closes apache#2127 from rxin/SPARK-3224 and squashes the following commits:

effb1ce [Reynold Xin] Move log message.
49282b3 [Reynold Xin] Kay's feedback.
3f01847 [Reynold Xin] Merge pull request apache#2 from kayousterhout/SPARK-3224
796d282 [Kay Ousterhout] Added unit test for SPARK-3224
3d3d356 [Reynold Xin] Remove map output loc even for repeated FetchFaileds.
1dd3eb5 [Reynold Xin] [SPARK-3224] FetchFailed reduce stages should only show up once in the failed stages UI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants