Fix some routine load bugs #2093

morningman · 2019-10-29T11:31:25Z

Mainly fix the following issues:

A null pointer exception is raised when a database or table is dropped. The expected behavior is that the routine load job is stopped.
Memory leaks. Batch routine load task submissions are no longer performed, and modifications are submitted separately for each task.
Unreasonable task timeout.
Routine load tasks should not be queued in the BE thread pool for execution. The task sent to the BE should be executed immediately, otherwise the task in the FE will be timeout first. Eventually leads to constant timeout for all subsequent tasks.
All routine load job should be scheduled once it being submitted. Not waiting the available BE slot. Otherwise, all later submitted jobs may not be scheduled forever.

ISSUE #2065

EmmyMiao87 · 2019-10-29T12:33:17Z

fe/src/main/java/org/apache/doris/load/routineload/RoutineLoadJob.java

@@ -757,7 +758,10 @@ public void afterAborted(TransactionState txnState, boolean txnOperated, String
                             .add("error_msg", "change job state to paused when task has been aborted with error " + e.getMessage())
                             .build(), e);
        } finally {
-            writeUnlock();
+            if (lock.isWriteLockedByCurrentThread()) {


The function of afterAborted could not be call individually. It already explain in comments at the top of this function.
So the case A is illegal. The timeout of txn should not call this function individually.

Yes. It is weird. I will make this function more readable

EmmyMiao87 · 2019-10-29T12:58:26Z

be/src/common/config.h

@@ -434,6 +434,10 @@ namespace config {
    // max consumer num in one data consumer group, for routine load
    CONF_Int32(max_consumer_num_per_group, "3");

+    // the size of thread pool for routine load task.
+    // this should be slightly larger than FE config 'max_concurrent_task_num_per_be' (default 10)
+    CONF_Int32(routine_load_thread_pool_size, "12");


The concurrency of task per BE has been limited by FE. This config is not useful.

This is just self-protection of BE, in case that FE has some bugs...
But I will simplify the configurations of FE, to make things simple.

Yes, the restriction of the task num between FE and BE increase the degree of difficulty

EmmyMiao87 · 2019-10-29T13:19:17Z

fe/src/main/java/org/apache/doris/load/routineload/RoutineLoadScheduler.java

@@ -86,18 +86,6 @@ private void process() throws UserException {
                                     .build());
                    continue;
                }
-                int currentTotalTaskNum = routineLoadManager.getSizeOfIdToRoutineLoadTask();


There are no limit about the Routine Load Job. The queue of task will continually increase till the memory of FE is going to use out.

We have job num limit. the config is desired_max_waiting_jobs of FE

Are you sure about that ?

EmmyMiao87 · 2019-10-29T13:22:00Z

fe/src/main/java/org/apache/doris/load/routineload/RoutineLoadTaskScheduler.java

+    private boolean submitTask(long beId, TRoutineLoadTask tTask) {
+        Backend backend = Catalog.getCurrentSystemInfo().getBackend(beId);
+        if (backend == null) {
+            LOG.warn("failed to send tasks to backend {} because not exist", beId);


Please use the new LogBuilder

EmmyMiao87 · 2019-10-29T13:22:19Z

fe/src/main/java/org/apache/doris/load/routineload/RoutineLoadTaskScheduler.java

-                    ClientPool.backendPool.invalidateObject(address, client);
-                }
+        } catch (Exception e) {
+            LOG.warn("task send error. backend[{}]", beId, e);


Please use the new Log Builder

EmmyMiao87 · 2019-10-31T09:03:11Z

fe/src/main/java/org/apache/doris/transaction/GlobalTransactionMgr.java

-                // just print a log, it does not matter.
-                LOG.warn("after abort timeout txn failed. txn id: {}", abortedTxn.getTransactionId(), e);
+                // abort may be failed. it is acceptable. just print a log
+                LOG.warn("abort timeout txn {} failed. msg: {}", txnId, e.getMessage());


Suggested change

LOG.warn("abort timeout txn {} failed. msg: {}", txnId, e.getMessage());

LOG.warn("abort timeout txn {} failed. msg: {}", txnId, e.getMessage(), e);

print stack trance is unnecessary here, and will make fe.log ugly

EmmyMiao87 · 2019-10-31T09:13:37Z

fe/src/main/java/org/apache/doris/load/routineload/RoutineLoadManager.java

@@ -254,6 +258,13 @@ public void resumeRoutineLoadJob(ResumeRoutineLoadStmt resumeRoutineLoadStmt) th
                                                ConnectContext.get().getRemoteIP(),
                                                tableName);
        }
+
+        if (getRoutineLoadJobByState(Sets.newHashSet(RoutineLoadJob.JobState.NEED_SCHEDULE,


The paused job need to be include.

EmmyMiao87 · 2019-10-31T09:15:23Z

fe/src/main/java/org/apache/doris/load/routineload/RoutineLoadManager.java

@@ -254,6 +258,13 @@ public void resumeRoutineLoadJob(ResumeRoutineLoadStmt resumeRoutineLoadStmt) th
                                                ConnectContext.get().getRemoteIP(),
                                                tableName);
        }
+
+        if (getRoutineLoadJobByState(Sets.newHashSet(RoutineLoadJob.JobState.NEED_SCHEDULE,
+                RoutineLoadJob.JobState.RUNNING)).size() > Config.desired_max_waiting_jobs) {


Suggested change

RoutineLoadJob.JobState.RUNNING)).size() > Config.desired_max_waiting_jobs) {

RoutineLoadJob.JobState.RUNNING)).size() > Config.max_routine_load_jobs) {

EmmyMiao87 · 2019-10-31T09:19:42Z

fe/src/test/java/org/apache/doris/load/routineload/RoutineLoadManagerTest.java

@@ -282,7 +280,7 @@ public void testGetMinTaskBeId() throws LoadException {
        beIdToConcurrentTaskMap.put(1L, 1);

        new Expectations(routineLoadManager) {{
-            invoke(routineLoadManager, "getBeIdConcurrentTaskMaps");
+                invoke(routineLoadManager, "getBeCurrentTasksNumMap");


Suggested change

invoke(routineLoadManager, "getBeCurrentTasksNumMap");

invoke(routineLoadManager, "getBeCurrentTasksNumMap");

EmmyMiao87 · 2019-10-31T09:20:13Z

fe/src/test/java/org/apache/doris/load/routineload/RoutineLoadManagerTest.java

@@ -364,10 +363,11 @@ public void testGetTotalIdleTaskNum() {
        Map<Long, Integer> beIdToConcurrentTaskMap = Maps.newHashMap();
        beIdToConcurrentTaskMap.put(1L, 1);
        new Expectations(routineLoadManager) {{
-            invoke(routineLoadManager, "getBeIdConcurrentTaskMaps");
+                invoke(routineLoadManager, "getBeCurrentTasksNumMap");


Suggested change

invoke(routineLoadManager, "getBeCurrentTasksNumMap");

invoke(routineLoadManager, "getBeCurrentTasksNumMap");

EmmyMiao87 · 2019-10-31T09:24:35Z

fe/src/main/java/org/apache/doris/load/routineload/RoutineLoadManager.java

@@ -496,12 +510,13 @@ public boolean checkTaskInJob(UUID taskId) {
        return false;
    }

-    public List<RoutineLoadJob> getRoutineLoadJobByState(RoutineLoadJob.JobState jobState) {
+    public List<RoutineLoadJob> getRoutineLoadJobByState(Set<RoutineLoadJob.JobState> desiredStates) {


Suggested change

public List<RoutineLoadJob> getRoutineLoadJobByState(Set<RoutineLoadJob.JobState> desiredStates) {

public List<RoutineLoadJob> getRoutineLoadJobByState(RoutineLoadJob.JobState ...states) {

EmmyMiao87 · 2019-10-31T10:47:03Z

fe/src/main/java/org/apache/doris/load/routineload/RoutineLoadManager.java

-                RoutineLoadJob.JobState.RUNNING)).size() > Config.desired_max_waiting_jobs) {
-            throw new DdlException("There are more then " + Config.desired_max_waiting_jobs
+                RoutineLoadJob.JobState.RUNNING, RoutineLoadJob.JobState.PAUSED)).size() > Config.max_routine_load_job_num) {
+            throw new DdlException("There are more then " + Config.max_routine_load_job_num


This restriction does not need in here.

EmmyMiao87 · 2019-10-31T10:57:31Z

LGTM

Mainly fix the following issues: 1. A null pointer exception is raised when a database or table is dropped. The expected behavior is that the routine load job is stopped. 2. Memory leaks. Batch routine load task submissions are no longer performed, and modifications are submitted separately for each task. 3. Unreasonable task timeout. Routine load tasks should not be queued in the BE thread pool for execution. The task sent to the BE should be executed immediately, otherwise the task in the FE will be timeout first. Eventually leads to constant timeout for all subsequent tasks. 4. All routine load job should be scheduled once it being submitted. Not waiting the available BE slot. Otherwise, all later submitted jobs may not be scheduled forever.

morningman added 4 commits October 29, 2019 17:14

first commit

adf86a6

fix timeout bug

9246806

3 commit

6a43f52

fix ut

35953ef

EmmyMiao87 reviewed Oct 29, 2019

View reviewed changes

morningman added 4 commits October 30, 2019 10:36

fix by review

8de74fd

fix by review

8a47e80

fix ut

4296f96

fix comment

322d2a2

EmmyMiao87 reviewed Oct 31, 2019

View reviewed changes

fix by review3

6ab42aa

EmmyMiao87 reviewed Oct 31, 2019

View reviewed changes

fix by review 4

7645768

EmmyMiao87 approved these changes Oct 31, 2019

View reviewed changes

morningman merged commit 45df6aa into apache:master Oct 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix some routine load bugs #2093

Fix some routine load bugs #2093

morningman commented Oct 29, 2019

EmmyMiao87 Oct 29, 2019

morningman Oct 30, 2019

EmmyMiao87 Oct 29, 2019

morningman Oct 30, 2019

EmmyMiao87 Oct 30, 2019

EmmyMiao87 Oct 29, 2019

morningman Oct 30, 2019

EmmyMiao87 Oct 30, 2019

EmmyMiao87 Oct 29, 2019

EmmyMiao87 Oct 29, 2019

EmmyMiao87 Oct 31, 2019

morningman Oct 31, 2019

EmmyMiao87 Oct 31, 2019

EmmyMiao87 Oct 31, 2019

EmmyMiao87 Oct 31, 2019

EmmyMiao87 Oct 31, 2019

EmmyMiao87 Oct 31, 2019

EmmyMiao87 Oct 31, 2019

EmmyMiao87 commented Oct 31, 2019

	LOG.warn("abort timeout txn {} failed. msg: {}", txnId, e.getMessage());
	LOG.warn("abort timeout txn {} failed. msg: {}", txnId, e.getMessage(), e);

	RoutineLoadJob.JobState.RUNNING)).size() > Config.desired_max_waiting_jobs) {
	RoutineLoadJob.JobState.RUNNING)).size() > Config.max_routine_load_jobs) {

	invoke(routineLoadManager, "getBeCurrentTasksNumMap");
	invoke(routineLoadManager, "getBeCurrentTasksNumMap");

	public List<RoutineLoadJob> getRoutineLoadJobByState(Set<RoutineLoadJob.JobState> desiredStates) {
	public List<RoutineLoadJob> getRoutineLoadJobByState(RoutineLoadJob.JobState ...states) {

Fix some routine load bugs #2093

Fix some routine load bugs #2093

Conversation

morningman commented Oct 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EmmyMiao87 commented Oct 31, 2019