-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2947] DAGScheduler resubmit the stage into an infinite loop #1877
Conversation
QA tests have started for PR 1877. This patch merges cleanly. |
QA results for PR 1877: |
@witgo can you explain how this happens and why the fix works, and add a unit test for it? We can't really merge something like this without a test. |
@@ -1024,31 +1024,33 @@ class DAGScheduler( | |||
case FetchFailed(bmAddress, shuffleId, mapId, reduceId) => | |||
// Mark the stage that the reducer was in as unrunnable | |||
val failedStage = stageIdToStage(task.stageId) | |||
runningStages -= failedStage | |||
// TODO: Cancel running tasks in the stage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mateiz
There is no cancel running tasks in the stage . When any one of the running tasks which throws an exception.The following code will be repeated.
failedStages += failedStage
failedStages += mapStage
Stage will be unnecessary resubmit by resubmitFailedStages
It takes some time to add a test for this. |
QA tests have started for PR 1877. This patch merges cleanly. |
QA results for PR 1877: |
@witgo Could you rebase this PR onto master? There are some conflict right now. |
QA tests have started for PR 1877 at commit
|
QA tests have finished for PR 1877 at commit
|
@rxin could you take a look at this PR? Thanks! |
Can you explain what problem you are seeing? |
QA tests have started for PR 1877 at commit
|
SPARK-3224 is the same problem. |
@@ -472,6 +472,44 @@ class DAGSchedulerSuite extends TestKit(ActorSystem("DAGSchedulerSuite")) with F | |||
assert(sparkListener.failedStages.size == 1) | |||
} | |||
|
|||
test("run trivial shuffle with repeated fetch failure") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change this and/or the name for the test at line 438? They are currently almost identical such that it's unclear what the point of each test is.
QA tests have finished for PR 1877 at commit
|
089577f
to
bf6f81a
Compare
QA tests have started for PR 1877 at commit
|
QA tests have finished for PR 1877 at commit
|
|
||
// It is likely that we receive multiple FetchFailed for a single stage (because we have | ||
// multiple tasks running concurrently on different executors). In that case, it is possible | ||
// the fetch failure has already been handled by the scheduler. | ||
if (runningStages.contains(failedStage)) { | ||
if (runningStages.contains(failedStage) && stage.pendingTasks.contains(task)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin Because there is no cancel running tasks in the stage. stage.pendingTasks.contains(task)
is necessary.
QA tests have started for PR 1877 at commit
|
if (bmAddress != null) { | ||
handleExecutorLost(bmAddress.executorId, Some(task.epoch)) | ||
// TODO: mark the executor as failed only if there were lots of fetch failures on it | ||
if (bmAddress != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once you put this within the conditional statement, only one executor failure will be handled for each stage. that means if there are two executor fails, the 2nd one gets ignored by the dagscheduler, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin Yes, here is unnecessary modifications to processing logic, I negligence.
是的,这里处理逻辑被不必要的修改了,疏忽了.
QA tests have finished for PR 1877 at commit
|
QA tests have started for PR 1877 at commit
|
QA tests have finished for PR 1877 at commit
|
…der (apache#1877) Parquet footer metadata is now always read twice in vectorized parquet reader. When the NameNode is under high pressure, it will cost time to read twice. Actually we can avoid reading the footer twice by reading all row groups in advance and filter row groups according to filters that require push down (no need to read the footer metadata again the second time). Reduce the reading of footer in vectorized parquet reader no existing tests Closes apache#39950 from yabola/skip_footer. Authored-by: chenliang.lu <[email protected]> Signed-off-by: Chao Sun <[email protected]> Co-authored-by: chenliang.lu <[email protected]>
No description provided.