JobManager related fix #4742

critical27 · 2022-10-18T12:08:31Z

What type of PR is this?

bug
feature
enhancement

What problem(s) does this PR solve?

Issue(s) number:

Description:

Two bugs are fixed:

Some job do not support stop, e.g. COMPACT/DOWNLOAD/INGEST and so on. Previously in make 'stop job' work as expected #4460, we have already report error. However, sadly, when we show job, the job is marked as STOPPED, which is not expected. It looks like as below:

After this PR, these unstoppable job will still keep in RUNNING state.
2. Some job need to run in storage leader. There is a corner case that, when we create a new space, before storage report leader distribution to meta, we start a job, e.g. STATS. For now JobManager can't select a storage to run the job, and the task won't be sent to storage for execution. Since no storage ever run it, the job will always stay in RUNNING.

How do you solve it?

Only save STOPPED when it really stopped
If no leader found, report error.

Special notes for your reviewer, ex. impact of this fix, design document, etc:

Checklist:

Tests:

Unit test(positive and negative cases)
Function test
Performance test
N/A

Affects:

Documentation affected (Please add the label if documentation needs to be modified.)
Incompatibility (If it breaks the compatibility, please describe it and add the label.）
If it's needed to cherry-pick (If cherry-pick to some branches is required, please label the destination version(s).)
Performance impacted: Consumes more CPU/Memory

Release notes:

Please confirm whether to be reflected in release notes and how to describe:

ex. Fixed the bug .....

SuperYoko · 2022-10-18T13:13:57Z

src/meta/processors/job/StorageJobExecutor.cpp

@@ -81,6 +81,12 @@ ErrOrHosts StorageJobExecutor::getLeaderHost(GraphSpaceID space) {
      it->second.emplace_back(partId);
    }
  }
+  // If storage has not report leader distribution to meta and we don't report error here,
+  // JobMananger will think of the job consists of 0 task, and the task will not send to any
+  // storage. And the job will always be RUNNING.


Here I got an issue for @liangliangc . That Explorer will execute "SUBMIT JOB STATS" after create space and wait a heartbeat, and job is always running, what do you think? Do we need to handle this?

Yeah, that is what I told him. To be precise, a heartbeat is not enough, we must wait until leader distribution in meta is not empty. Dashboard or explorer don't want to handle this logic (by checking show hosts or show parts)... So I changed in core. It is indeed a rare case, if storage report even once, it won't reproduce.

BTW, I am not sure whether we should check leader count same as partition count here... WDYT?

Maybe a flag to identify a space is ready to supply service.
Maybe use leader count == partition,both okay for me.

panda-sheep · 2022-10-19T04:16:54Z

src/meta/test/JobManagerTest.cpp

@@ -853,6 +871,15 @@ TEST_F(JobManagerTest, StoppableJob) {
    code = jobMgr->stopJob(spaceId, jobId);
    ASSERT_EQ(code, nebula::cpp2::ErrorCode::SUCCEEDED);

+    // check job status again, it still should be running


Good job, the comments here should be be stopped.

Co-authored-by: Sophie <[email protected]>

* fix dep of loop in go planner (#4736) * fix inappropriate error code from raft (#4737) Co-authored-by: Sophie <[email protected]> * Fix variable types collected and graph crash (#4724) * Fix variable types collected and graph crash add test cases small fix * unskip some test cases related to multiple query parts * small delete * fmt * Fix ldbc BI R5 implementation Co-authored-by: Harris.Chu <[email protected]> Co-authored-by: Sophie <[email protected]> * stats handle the flag use_vertex_key (#4738) * JobManager related fix (#4742) Co-authored-by: Sophie <[email protected]> * download job related fix (#4754) * fixed case-when error (#4744) * fixed case-when error * fix tck * fix tck * fix tck Co-authored-by: Sophie <[email protected]> * Refine go planner (#4750) * refine go planner * update * fix ctest Co-authored-by: jie.wang <[email protected]> Co-authored-by: Doodle <[email protected]> Co-authored-by: kyle.cao <[email protected]> Co-authored-by: Harris.Chu <[email protected]> Co-authored-by: canon <[email protected]>

foesa-yang · 2022-10-21T09:04:30Z

vesoft-inc/nebula-docs-cn#2234

JobManager related fix

d374bcb

critical27 added ready-for-testing PR: ready for the CI test ready for review cherry-pick-v3.3 PR: need cherry-pick to this version labels Oct 18, 2022

critical27 requested review from liwenhui-soul, panda-sheep, SuperYoko and pengweisong October 18, 2022 12:08

critical27 requested a review from a team as a code owner October 18, 2022 12:08

SuperYoko reviewed Oct 18, 2022

View reviewed changes

SuperYoko approved these changes Oct 19, 2022

View reviewed changes

panda-sheep approved these changes Oct 19, 2022

View reviewed changes

panda-sheep reviewed Oct 19, 2022

View reviewed changes

dutor approved these changes Oct 19, 2022

View reviewed changes

Merge branch 'master' into job

3bbb682

Sophie-Xie merged commit dbaca57 into vesoft-inc:master Oct 19, 2022

caton-hpg pushed a commit to caton-hpg/nebula that referenced this pull request Oct 19, 2022

JobManager related fix (vesoft-inc#4742)

7da33c1

Co-authored-by: Sophie <[email protected]>

critical27 added the doc affected PR: improvements or additions to documentation label Oct 19, 2022

foesa-yang self-assigned this Oct 20, 2022

Sophie-Xie added a commit that referenced this pull request Oct 20, 2022

JobManager related fix (#4742)

3f69f27

Co-authored-by: Sophie <[email protected]>

SuperYoko mentioned this pull request Oct 20, 2022

Fix fulltext stopped #4757

Closed

11 tasks

wey-gu mentioned this pull request Oct 22, 2022

Weekly Report 2022-10-21 vesoft-inc/nebula-community#139

Closed

critical27 deleted the job branch November 11, 2022 02:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JobManager related fix #4742

JobManager related fix #4742

critical27 commented Oct 18, 2022 •

edited

Loading

SuperYoko Oct 18, 2022 •

edited

Loading

critical27 Oct 19, 2022 •

edited

Loading

critical27 Oct 19, 2022

SuperYoko Oct 19, 2022 •

edited

Loading

panda-sheep Oct 19, 2022

foesa-yang commented Oct 21, 2022

JobManager related fix #4742

JobManager related fix #4742

Conversation

critical27 commented Oct 18, 2022 • edited Loading

What type of PR is this?

What problem(s) does this PR solve?

Issue(s) number:

Description:

How do you solve it?

Special notes for your reviewer, ex. impact of this fix, design document, etc:

Checklist:

Release notes:

SuperYoko Oct 18, 2022 • edited Loading

Choose a reason for hiding this comment

critical27 Oct 19, 2022 • edited Loading

Choose a reason for hiding this comment

critical27 Oct 19, 2022

Choose a reason for hiding this comment

SuperYoko Oct 19, 2022 • edited Loading

Choose a reason for hiding this comment

panda-sheep Oct 19, 2022

Choose a reason for hiding this comment

foesa-yang commented Oct 21, 2022

critical27 commented Oct 18, 2022 •

edited

Loading

SuperYoko Oct 18, 2022 •

edited

Loading

critical27 Oct 19, 2022 •

edited

Loading

SuperYoko Oct 19, 2022 •

edited

Loading