-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JobManager related fix #4742
JobManager related fix #4742
Conversation
@@ -81,6 +81,12 @@ ErrOrHosts StorageJobExecutor::getLeaderHost(GraphSpaceID space) { | |||
it->second.emplace_back(partId); | |||
} | |||
} | |||
// If storage has not report leader distribution to meta and we don't report error here, | |||
// JobMananger will think of the job consists of 0 task, and the task will not send to any | |||
// storage. And the job will always be RUNNING. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I got an issue for @liangliangc . That Explorer will execute "SUBMIT JOB STATS" after create space and wait a heartbeat, and job is always running, what do you think? Do we need to handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that is what I told him. To be precise, a heartbeat is not enough, we must wait until leader distribution in meta is not empty. Dashboard or explorer don't want to handle this logic (by checking show hosts or show parts)... So I changed in core. It is indeed a rare case, if storage report even once, it won't reproduce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I am not sure whether we should check leader count same as partition count here... WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a flag to identify a space is ready to supply service.
Maybe use leader count == partition,both okay for me.
@@ -853,6 +871,15 @@ TEST_F(JobManagerTest, StoppableJob) { | |||
code = jobMgr->stopJob(spaceId, jobId); | |||
ASSERT_EQ(code, nebula::cpp2::ErrorCode::SUCCEEDED); | |||
|
|||
// check job status again, it still should be running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job, the comments here should be be stopped
.
Co-authored-by: Sophie <[email protected]>
Co-authored-by: Sophie <[email protected]>
* fix dep of loop in go planner (#4736) * fix inappropriate error code from raft (#4737) Co-authored-by: Sophie <[email protected]> * Fix variable types collected and graph crash (#4724) * Fix variable types collected and graph crash add test cases small fix * unskip some test cases related to multiple query parts * small delete * fmt * Fix ldbc BI R5 implementation Co-authored-by: Harris.Chu <[email protected]> Co-authored-by: Sophie <[email protected]> * stats handle the flag use_vertex_key (#4738) * JobManager related fix (#4742) Co-authored-by: Sophie <[email protected]> * download job related fix (#4754) * fixed case-when error (#4744) * fixed case-when error * fix tck * fix tck * fix tck Co-authored-by: Sophie <[email protected]> * Refine go planner (#4750) * refine go planner * update * fix ctest Co-authored-by: jie.wang <[email protected]> Co-authored-by: Doodle <[email protected]> Co-authored-by: kyle.cao <[email protected]> Co-authored-by: Harris.Chu <[email protected]> Co-authored-by: canon <[email protected]>
What type of PR is this?
What problem(s) does this PR solve?
Issue(s) number:
Description:
Two bugs are fixed:
show job
, the job is marked as STOPPED, which is not expected. It looks like as below:After this PR, these unstoppable job will still keep in RUNNING state.
2. Some job need to run in storage leader. There is a corner case that, when we create a new space, before storage report leader distribution to meta, we start a job, e.g. STATS. For now JobManager can't select a storage to run the job, and the task won't be sent to storage for execution. Since no storage ever run it, the job will always stay in RUNNING.
How do you solve it?
Special notes for your reviewer, ex. impact of this fix, design document, etc:
Checklist:
Tests:
Affects:
Release notes:
Please confirm whether to be reflected in release notes and how to describe: