Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [Worker] RemoteShell tasks may have memory leak issue #15851

Closed
3 tasks done
simmonn opened this issue Apr 15, 2024 · 14 comments
Closed
3 tasks done

[Bug] [Worker] RemoteShell tasks may have memory leak issue #15851

simmonn opened this issue Apr 15, 2024 · 14 comments
Assignees
Labels
bug Something isn't working priority:high Stale

Comments

@simmonn
Copy link

simmonn commented Apr 15, 2024

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

Version:3.2.0
jvm options: -Xmx3g -Xms3g -Xmn1g
jdk:amazon-corretto-11.0.19.7.1-linux-x86_64
My project has 60 RemoteShell scheduled tasks(executing php commands). After running for a while, there are frequent Full GC occurrences, causing all tasks to fail, leading to false deadlocks on the worker nodes.So I had to change remoteshell to shell task which command using ssh -i id_rsa ''.

Apart from some error logs, I also noticed WARN logs with NPE (NullPointerException) occurring every time a task is executed.
[WARN] 2024-04-10 04:01:27.782 +0800 org.apache.sshd.client.session.ClientSessionImpl:[618] - [WorkflowInstance-0][TaskInstance-0] - exceptionCaught(ClientSessionImpl[root@/172.19.23.121:22])[state=Opened] NullPointerException: No customized heartbeat handler registered

here is error log:

[ERROR] 2024-04-10 04:01:01.146 +0800 org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable:[181] - [WorkflowInstance-72475][TaskInstance-74145] - Task execute failed, due to meet an exception
org.apache.dolphinscheduler.plugin.task.api.TaskException: Execute shell task error
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteShellTask.handle(RemoteShellTask.java:110)
        at org.apache.dolphinscheduler.server.worker.runner.DefaultWorkerDelayTaskExecuteRunnable.executeTask(DefaultWorkerDelayTaskExecuteRunnable.java:57)
        at org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable.run(WorkerTaskExecuteRunnable.java:175)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)
        at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.dolphinscheduler.plugin.task.api.TaskException: Remote shell task error
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.run(RemoteExecutor.java:101)
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteShellTask.handle(RemoteShellTask.java:104)
        ... 9 common frames omitted
Caused by: org.apache.dolphinscheduler.plugin.task.api.TaskException: SSH connection failed
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.getSession(RemoteExecutor.java:83)
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.runRemote(RemoteExecutor.java:208)
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.getTaskPid(RemoteExecutor.java:184)
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.run(RemoteExecutor.java:91)
        ... 10 common frames omitted
Caused by: org.apache.sshd.common.SshException: DefaultConnectFuture[root@/172.19.23.121:22]: Failed to get operation result within specified timeout: 5000
        at org.apache.sshd.common.future.AbstractSshFuture.formatExceptionMessage(AbstractSshFuture.java:185)
        at org.apache.sshd.common.future.AbstractSshFuture.verifyResult(AbstractSshFuture.java:111)
        at org.apache.sshd.client.future.DefaultConnectFuture.verify(DefaultConnectFuture.java:42)
        at org.apache.sshd.client.future.DefaultConnectFuture.verify(DefaultConnectFuture.java:34)
        at org.apache.dolphinscheduler.plugin.datasource.ssh.SSHUtils.getSession(SSHUtils.java:42)
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.getSession(RemoteExecutor.java:78)
        ... 13 common frames omitted
[INFO] 2024-04-10 04:01:02.874 +0800 org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteShellTask:[118] - [WorkflowInstance-72475][TaskInstance-74145] - kill remote task dolphinscheduler-remoteshell-74145
[ERROR] 2024-04-10 04:01:02.875 +0800 org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable:[140] - [WorkflowInstance-72475][TaskInstance-74145] - Cancel task failed, this will not affect the taskInstance status, but you need to check manual
org.apache.dolphinscheduler.plugin.task.api.TaskException: cancel application error
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteShellTask.cancel(RemoteShellTask.java:121)
        at org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable.cancelTask(WorkerTaskExecuteRunnable.java:136)
        at org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable.afterThrowing(WorkerTaskExecuteRunnable.java:118)
        at org.apache.dolphinscheduler.server.worker.runner.DefaultWorkerDelayTaskExecuteRunnable.afterThrowing(DefaultWorkerDelayTaskExecuteRunnable.java:67)
        at org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable.run(WorkerTaskExecuteRunnable.java:182)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)
        at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.dolphinscheduler.plugin.task.api.TaskException: SSH connection failed
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.getSession(RemoteExecutor.java:83)
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.runRemote(RemoteExecutor.java:208)
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.getTaskPid(RemoteExecutor.java:184)
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.kill(RemoteExecutor.java:176)
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteShellTask.cancel(RemoteShellTask.java:119)
        ... 11 common frames omitted
Caused by: java.lang.IllegalStateException: SshClient not started. Please call start() method before connecting to a server
        at org.apache.sshd.client.SshClient.doConnect(SshClient.java:627)
        at org.apache.sshd.client.SshClient.doConnect(SshClient.java:616)
        at org.apache.sshd.client.SshClient.connect(SshClient.java:547)
        at org.apache.sshd.client.SshClient.connect(SshClient.java:539)
        at org.apache.sshd.client.session.ClientSessionCreator.connect(ClientSessionCreator.java:74)
        at org.apache.sshd.client.session.ClientSessionCreator.connect(ClientSessionCreator.java:57)
        at org.apache.dolphinscheduler.plugin.datasource.ssh.SSHUtils.getSession(SSHUtils.java:41)
        at org.apache.dolphinscheduler.plugin.task.remoteshell.RemoteExecutor.getSession(RemoteExecutor.java:78)
        ... 15 common frames omitted`

here is the snapshot of host's memory:

image

What you expected to happen

execute remoteshell tasks and has no memory leaks

How to reproduce

create remoteshell task,and schedules them in a short time

Anything else

No response

Version

3.2.x

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@simmonn simmonn added bug Something isn't working Waiting for reply Waiting for reply labels Apr 15, 2024
@ruanwenjun
Copy link
Member

ruanwenjun commented Apr 15, 2024

Should be fixed by #15348

@ruanwenjun ruanwenjun added Waiting for user feedback Waiting for feedback from issue/PR author and removed Waiting for reply Waiting for reply labels Apr 15, 2024
@ruanwenjun ruanwenjun self-assigned this Apr 15, 2024
@simmonn
Copy link
Author

simmonn commented Apr 16, 2024

Should be finished by #15348

this issue you mentioned had been solved locally before. memory leak still exists
image
image
image

@ruanwenjun
Copy link
Member

Have you close the RemoteExecutor?

       // add task close method to release resource
         try (RemoteExecutor executor = remoteExecutor) {

@simmonn
Copy link
Author

simmonn commented Apr 16, 2024

executor

yes,I have
here is the code
image

@ruanwenjun
Copy link
Member

ruanwenjun commented Apr 16, 2024

@simmonn Could you please provide the heap dump file or stack info? Maybe this is caused by thread leak.

@ruanwenjun
Copy link
Member

Should be finished by #15348

this issue you mentioned had been solved locally before. memory leak still exists image image image

Please try to change the heartbeatType to IGNORE

session.setSessionHeartbeat(SessionHeartbeatController.HeartbeatType.IGNORE, Duration.ofSeconds(3));

@simmonn
Copy link
Author

simmonn commented Apr 17, 2024

@simmonn Could you please provide the heap dump file or stack info? Maybe this is caused by thread leak.

I simulated it in the test environment. Executing the SSH command via shell works fine. However, when executing tasks via RemoteShell, the tasks will also get stuck after a while. Here are the stack info.
stack_417.jstack.gz

@peak-xu
Copy link

peak-xu commented Apr 17, 2024

@simmonn Could you please provide the heap dump file or stack info? Maybe this is caused by thread leak.

I encountered the same problem #15812

@ruanwenjun
Copy link
Member

ruanwenjun commented Apr 17, 2024

Yes, I changed the heartbeat to IGNORE then the thread leak can be resolved. Please test. @simmonn @peak-xu

@simmonn
Copy link
Author

simmonn commented Apr 17, 2024

Yes, I changed the heartbeat to IGNORE then the thread leak can be resolved. Please test. @simmonn @peak-xu

Thank you, I'll try this approach.

@peak-xu
Copy link

peak-xu commented Apr 17, 2024

Yes, I changed the heartbeat to IGNORE then the thread leak can be resolved. Please test. @simmonn @peak-xu

Do I need to package and compile the dev branch code myself for testing?It may take some time

@ruanwenjun
Copy link
Member

Yes, I changed the heartbeat to IGNORE then the thread leak can be resolved. Please test. @simmonn @peak-xu

Do I need to package and compile the dev branch code myself for testing?It may take some time

Yes, or you can directly update your code.

@ruanwenjun ruanwenjun added priority:high and removed Waiting for user feedback Waiting for feedback from issue/PR author labels Apr 17, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

@github-actions github-actions bot added the Stale label May 25, 2024
Copy link

github-actions bot commented Jun 1, 2024

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

@github-actions github-actions bot closed this as completed Jun 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:high Stale
Projects
None yet
Development

No branches or pull requests

3 participants