Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build crashes due to timeout or deamon not responding in time #161

Open
deradam opened this issue Oct 27, 2020 · 30 comments
Open

Build crashes due to timeout or deamon not responding in time #161

deradam opened this issue Oct 27, 2020 · 30 comments
Labels
bug Something isn't working

Comments

@deradam
Copy link

deradam commented Oct 27, 2020

Hi there,

when building a rather large project (395 maven modules), mvnd runs for few minutes and then radomly "crashes" with the following error.

Exception in thread "main" org.jboss.fuse.mvnd.common.DaemonException$StaleAddressException: Could not receive a message from the daemon.
	at org.jboss.fuse.mvnd.client.DaemonClientConnection.receive(DaemonClientConnection.java:107)
	at org.jboss.fuse.mvnd.client.DefaultClient.execute(DefaultClient.java:198)
	at org.jboss.fuse.mvnd.client.DefaultClient.main(DefaultClient.java:72)
Caused by: java.io.IOException: No message received within 3000ms, daemon may have crashed
	at org.jboss.fuse.mvnd.client.DaemonClientConnection.receive(DaemonClientConnection.java:100)
	... 2 more

However if I check the running processes, I can still see mvnd building the project in the background. Is there a setting for the timeout to the deamon process?

Thanks for the nice tool and your support!

@ppalaga
Copy link
Contributor

ppalaga commented Oct 27, 2020

Could you please check the logs in $MVND_HOME/daemon?

@deradam
Copy link
Author

deradam commented Oct 27, 2020

@ppalaga thanks for the quick reply. I check the logs and see the following.

10:24:36.388 E Error dispatching events
org.jboss.fuse.mvnd.common.DaemonException$RecoverableMessageIOException: Could not write message BuildEvent{projectId='backend-core', type=MojoStarted, display=':backend-core:org.apache.maven.plugins:maven-resources-plugin:2.6:resources {execution: default-resources}'} to '/127.0.0.1:62278'.
	at org.jboss.fuse.mvnd.common.DaemonConnection.dispatch(DaemonConnection.java:116)
	at org.jboss.fuse.mvnd.daemon.Server.lambda$handle$2(Server.java:428)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Broken pipe
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:51)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
	at org.jboss.fuse.mvnd.common.DaemonConnection$SocketOutputStream.writeWithNonBlockingRetry(DaemonConnection.java:275)
	at org.jboss.fuse.mvnd.common.DaemonConnection$SocketOutputStream.writeBufferToChannel(DaemonConnection.java:263)
	at org.jboss.fuse.mvnd.common.DaemonConnection$SocketOutputStream.flush(DaemonConnection.java:257)
	at java.io.DataOutputStream.flush(DataOutputStream.java:123)
	at org.jboss.fuse.mvnd.common.DaemonConnection.dispatch(DaemonConnection.java:113)
	... 2 common frames omitted

@ppalaga
Copy link
Contributor

ppalaga commented Oct 27, 2020

What's your operating system? Is it always reproducible?

@deradam
Copy link
Author

deradam commented Oct 27, 2020

➜  ~ mvnd -version
Maven Daemon 0.0.10-darwin-amd64 (native)
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /usr/local/Cellar/mvnd/0.0.10/libexec/mvn
Java version: 1.8.0_265, vendor: AdoptOpenJDK, runtime: /Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre
Default locale: de_DE, platform encoding: UTF-8
OS name: "mac os x", version: "10.15.7", arch: "x86_64", family: "mac"

Funny enough, 10 minutes later ->

10:34:40.119 I Build finished, finishing message dispatch
10:34:40.120 I Daemon back to idle
10:34:40.120 I Updating state to: Idle
10:34:42.366 I Waiting for request
10:34:42.366 D daemon is running. Sleeping until state changes.

And yes, it is reproducible.

@ppalaga
Copy link
Contributor

ppalaga commented Oct 27, 2020

So the daemon finishes building the request, the problem is only that the client disconnects before it is finished?

@gnodet
Copy link
Contributor

gnodet commented Oct 27, 2020

So the daemon finishes building the request, the problem is only that the client disconnects before it is finished?

Yes, I think that's correct. The client disconnects because it has not received any message (including the keep-alive ones) for 3 seconds. That's a new behaviour for the keep-alive.
Two properties can be used to raise the keep alive timeout (daemon.keepAliveMs) and the multiplier after which the client disconnects (daemon.maxLostKeepAlive). You may want to raise them, but I'm not sure how this would change anything.
I suspect something wrong happened in the dispatch loop of the daemon.
Can you check if you have anything in the daemon log after the last "Dispatch message" event.

@deradam
Copy link
Author

deradam commented Oct 27, 2020

Okay, indeed it is running if I add the following parameter -Ddaemon.keepAliveMs=10000. I guess I can add it somewhere "globally" so I do not need to add this with each and every call?

@gnodet
Copy link
Contributor

gnodet commented Oct 28, 2020

I've been able to reproduce the problem, but in my case, it was a follow-up of #155 : when building the java doc, the daemon try to send a big message, fails with an exception, thus the client fail 3s later.

@gnodet
Copy link
Contributor

gnodet commented Oct 28, 2020

@deradam the daemon.keepAliveMs also effects the client refresh rate in case there are no other messages, so you may prefer keeping daemon.keepAliveMs=1000 and raise daemon.maxLostKeepAlive=10.

The global properties location is ~/.m2/mvnd.properties.

@gnodet
Copy link
Contributor

gnodet commented Oct 28, 2020

@deradam If that's easily reproductible, would you mind checking if you have an error in the daemon log prior to the one you pasted above (which is the result of the client dropping the connection) ? In my case, I had the one in #155, so I wonder if that's the same for you.

@deradam
Copy link
Author

deradam commented Oct 29, 2020

Yes, I'm on it. However I am fighting with the overall build now :/

@deradam
Copy link
Author

deradam commented Nov 2, 2020

Hi @gnodet,

so this is the error that I am always getting when building the project.

10:01:18.565 I Dispatch message: KeepAliveMessage{}
10:01:18.566 I Dispatch message: BuildEvent{projectId='backend-core', type=MojoStopped, display=':backend-core:org.apache.cxf:cxf-codegen-plugin:3.2.8:wsdl2java {execution: generate-sources-france-routing-ws}'}
10:01:18.567 E Error dispatching events
org.jboss.fuse.mvnd.common.DaemonException$RecoverableMessageIOException: Could not write message BuildEvent{projectId='backend-core', type=MojoStopped, display=':backend-core:org.apache.cxf:cxf-codegen-plugin:3.2.8:wsdl2java {execution: generate-sources-routing-ws}'} to '/127.0.0.1:59959'.
	at org.jboss.fuse.mvnd.common.DaemonConnection.dispatch(DaemonConnection.java:116)
	at org.jboss.fuse.mvnd.daemon.Server.lambda$handle$2(Server.java:428)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Broken pipe
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:51)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
	at org.jboss.fuse.mvnd.common.DaemonConnection$SocketOutputStream.writeWithNonBlockingRetry(DaemonConnection.java:275)
	at org.jboss.fuse.mvnd.common.DaemonConnection$SocketOutputStream.writeBufferToChannel(DaemonConnection.java:263)
	at org.jboss.fuse.mvnd.common.DaemonConnection$SocketOutputStream.flush(DaemonConnection.java:257)
	at java.io.DataOutputStream.flush(DataOutputStream.java:123)
	at org.jboss.fuse.mvnd.common.DaemonConnection.dispatch(DaemonConnection.java:113)
	... 2 common frames omitted

As you can see, it "breaks" when maven is calling wsdl2java which seem to run for quite some time. I don't know if this helps you in any way. At least I can overcome this with the properties proposed above.

Thanks for your support!

@DoPaNik
Copy link

DoPaNik commented Jan 26, 2021

hi together,

unfortunately I have the same problem.
I have already increased the values for the keepAliveMs and maxLostKeepAlive:

daemon.keepAliveMs=10000
daemon.maxLostKeepAlive=10

Problem occurs in multi-thread as well as with only one thread mode...
I have also already tried to increase the heap... without success so far!

Following system infos and the error log:

Terminal: org.jline.terminal.impl.PosixSysTerminal with pty org.jline.terminal.impl.jansi.linux.LinuxNativePty
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /home/6024/.sdkman/candidates/mvnd/0.4.1/mvn
Java version: 1.8.0_191, vendor: Oracle Corporation, runtime: /home/ubuntu_entwshare/develop/java/jdk1.8.0_191/jre
Default locale: de_DE, platform encoding: UTF-8
OS name: "linux", version: "5.4.0-55-generic", arch: "amd64", family: "unix"
11:41:05.076 I Dispatch message: KeepAlive
11:41:05.176 I Dispatch message: KeepAlive
11:41:05.276 I Dispatch message: KeepAlive
11:41:05.376 I Dispatch message: KeepAlive
11:41:05.477 I Dispatch message: KeepAlive
11:41:05.577 I Dispatch message: KeepAlive
11:41:05.677 I Dispatch message: KeepAlive
11:41:05.777 I Dispatch message: KeepAlive
11:41:05.877 I Dispatch message: KeepAlive
11:41:05.978 I Dispatch message: KeepAlive
11:41:06.244 I Dispatch message: KeepAlive
11:41:06.345 I Dispatch message: KeepAlive
11:41:06.445 I Dispatch message: KeepAlive
11:41:06.545 I Dispatch message: KeepAlive
11:41:11.179 I Dispatch message: KeepAlive
11:41:11.280 I Dispatch message: KeepAlive
11:41:11.280 E Error dispatching events
org.mvndaemon.mvnd.common.DaemonException$RecoverableMessageIOException: Could not write message KeepAlive to '/127.0.0.1:59906'.
        at org.mvndaemon.mvnd.common.DaemonConnection.dispatch(DaemonConnection.java:116)
        at org.mvndaemon.mvnd.daemon.Server.lambda$handle$3(Server.java:460)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Datenübergabe unterbrochen (broken pipe)
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:51)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
        at org.mvndaemon.mvnd.common.DaemonConnection$SocketOutputStream.writeWithNonBlockingRetry(DaemonConnection.java:275)
        at org.mvndaemon.mvnd.common.DaemonConnection$SocketOutputStream.writeBufferToChannel(DaemonConnection.java:263)
        at org.mvndaemon.mvnd.common.DaemonConnection$SocketOutputStream.flush(DaemonConnection.java:257)
        at java.io.DataOutputStream.flush(DataOutputStream.java:123)
        at org.mvndaemon.mvnd.common.DaemonConnection.dispatch(DaemonConnection.java:113)
        ... 2 common frames omitted

Do you have any ideas?

@DoPaNik
Copy link

DoPaNik commented Feb 22, 2021

Any updates?

@ppalaga
Copy link
Contributor

ppalaga commented Feb 22, 2021

@DoPaNik seems to be like #291 Tweaking the props should be right strategy, but the ones you mention are either not properly spelled or have weird values. As step one, I'd recommend increasing mvnd.maxLostKeepAlive from its default 30 to 60, 120, etc.

daemon.keepAliveMs does not exist in mvnd 0.4.1, the correct name is mvnd.keepAlive.

There is a pause of 3+seconds between the following two messages in your log:

11:41:06.545 I Dispatch message: KeepAlive
11:41:11.179 I Dispatch message: KeepAlive

Given your config, tt should be around 100ms.

It would be interesting to figure what's the cause. You may want to attach jconsole or similar to the daemon process to see whether it is GC or something else.

@gnodet
Copy link
Contributor

gnodet commented Feb 22, 2021

I've also hit the problem a few times under heavy cpu load. I wonder if we should slightly raise a bit the default values.

@ppalaga
Copy link
Contributor

ppalaga commented Feb 22, 2021

I think we should try to figure out first where the bottle neck is. Is that overall CPU contention? Or is that client threads blocking each other? Daemon threads blocking each other? Not all of those potential issues would necessarily be solved by increasing mvnd.maxLostKeepAlive

@gnodet
Copy link
Contributor

gnodet commented Feb 22, 2021

I think we should try to figure out first where the bottle neck is. Is that overall CPU contention? Or is that client threads blocking each other? Daemon threads blocking each other? Not all of those potential issues would necessarily be solved by increasing mvnd.maxLostKeepAlive

This is most probably something like a big GC pause under heavy CPU usage. I really don't think this comes from blocking threads. The threading model is not that complicated and this would happen more often I think.

In addition, if there's a deadlock, raising the timeout will definitely not help. So the opposite is true: raising the timeout will remove possible problems with GC/high CPU usage why not hiding other problems, so it looks quite safe to do it anyway...

@DoPaNik
Copy link

DoPaNik commented Feb 22, 2021

Thank you for the tips. Now I have set the mvnd.properties to the following values:

mvnd.keepAlive=10000
mvnd.maxLostKeepAlive=120

This looks much better at first glance. But the build breaks off after about 4 minutes with the following error:

[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  04:29 min (Wall Clock)
[INFO] Finished at: 2021-02-22T11:13:00+01:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project base: Compilation failure
[ERROR] An unknown compilation problem occurred
[ERROR] 
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project base: Compilation failure
An unknown compilation problem occurred

 

        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:215)
        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:156)
        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:148)
        at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
        at org.mvndaemon.mvnd.builder.SmartBuilderImpl.buildProject(SmartBuilderImpl.java:178)
        at org.mvndaemon.mvnd.builder.SmartBuilderImpl$ProjectBuildTask.run(SmartBuilderImpl.java:198)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.maven.plugin.compiler.CompilationFailureException: Compilation failure
An unknown compilation problem occurred

 

        at org.apache.maven.plugin.compiler.AbstractCompilerMojo.execute(AbstractCompilerMojo.java:1220)
        at org.apache.maven.plugin.compiler.CompilerMojo.execute(CompilerMojo.java:187)
        at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:210)
        ... 10 common frames omitted
[ERROR] 
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <args> -rf :base

Looks like a heap size issue to me. Unfortunately, I can't connect to the jvm via jconsole at the moment.
I will simply increase the heap size.

Is there a hack to start with the jconsole? I simply have tried it via jconsole UI to connect via the local connection dialog.

@ppalaga
Copy link
Contributor

ppalaga commented Feb 22, 2021

Looks like a heap size issue to me.

Have you tried increasing the heap via mvnd.maxHeapSize ?

I simply have tried it via jconsole UI to connect via the local connection dialog.

That's what I usually do. You should see the daemon process in the list. You may want to start the daemon eagerly via mvnd --version.

@ppalaga
Copy link
Contributor

ppalaga commented Feb 22, 2021

This is most probably something like a big GC pause under heavy CPU usage.

Yes, I also see it as the main suspect :) But it would still be nice to get a witness who sees a GC happening at that time via some monitoring tool.

By increasing the defaults we just move the threshold for the client disconnect, but we do not help much those poor devs whose build is still bigger than that new threshold. Maybe we should try harder on the client side to figure out what is happening with the daemon. If the process is still up, we could perhaps prolong the grace period, but output some sort of a warning that something unusual may be going on. Also we should give a hint which particular config options should be tweaked if the user thinks the situation is normal.

And finally, if somebody really sees GC as the root cause, it would be extremely interesting to see whether using shenandoah (or any other low pause gc) would help.

@DoPaNik
Copy link

DoPaNik commented Feb 22, 2021

Have you tried increasing the heap via mvnd.maxHeapSize ?

Yep. I have set the heap to 8G, Which should be enough. Unfortunately still the same error.
I just started again with 16 - that should be way too much, but then we can exclude this as problem

@DoPaNik
Copy link

DoPaNik commented Feb 22, 2021

Yes, I also see it as the main suspect :) But it would still be nice to get a witness who sees a GC happening at that time via some monitoring tool.

Of course. I try to collect data and then post them.

@DoPaNik
Copy link

DoPaNik commented Feb 22, 2021

Fun fact: With 16G started build is still running ...
daemon: 1b1999da threads used/hidden/max: 1/0/11 progress: 6/16 37% time: 82:02

A normal maven build takes about 30 min.

I am trying to provide data via jconsole as soon as possible

@gnodet
Copy link
Contributor

gnodet commented Feb 22, 2021

I also wonder if the messages ordering on the server side isn't problematic:
https://github.com/mvndaemon/mvnd/blob/5fa1f3bd2f4a80245e06562078946d2af8703c66/daemon/src/main/java/org/mvndaemon/mvnd/daemon/Server.java#L596-L597
The keep alive messages are low priority, so it could happen that they are almost never sent to the client. The ordering could also be the problem for #362

@gnodet gnodet added the bug Something isn't working label Feb 23, 2021
@DoPaNik
Copy link

DoPaNik commented Mar 16, 2021

@ppalaga as promised, albeit a bit late, here is the output from jconsole:
jconsole_cpu.txt
jconsole_heap.txt
jconsole_thread.txt
Bildschirmfoto vom 2021-03-16 10-49-00

After I set the heap to 8 gb, the build runs without errors. But apparently mvnd had not done anything for a long time and waited in the middle of the run....

@lanmaoxinqing
Copy link
Contributor

Raise the keep-alive timeout seems to no effects now.
https://github.com/lanmaoxinqing/mvnd/runs/2902075523?check_suite_focus=true#step:7:1100

@findepi
Copy link
Member

findepi commented Jan 18, 2022

Looks like a heap size issue to me.

Have you tried increasing the heap via mvnd.maxHeapSize ?

@ppalaga i run into a timeout issue upon today too.
In my case it was a memory issue. Would it be possible for mvnd to detect memory pressure and report it?
It took me some time to connect timeout reported by the CLI side with mvnd memory settings.

FWIW, in Trino, we use https://github.com/airlift/airlift/blob/7e987a2c7f55c1703b9ea383daa32c657d29b7dd/stats/src/main/java/io/airlift/stats/JmxGcMonitor.java#L77 to report GC pauses.

@gnodet
Copy link
Contributor

gnodet commented Jan 18, 2022

Looks like a heap size issue to me.

Have you tried increasing the heap via mvnd.maxHeapSize ?

@ppalaga i run into a timeout issue upon today too. In my case it was a memory issue. Would it be possible for mvnd to detect memory pressure and report it? It took me some time to connect timeout reported by the CLI side with mvnd memory settings.

It's already supposed to : https://github.com/apache/maven-mvnd/blob/master/daemon/src/main/java/org/mvndaemon/mvnd/daemon/DaemonExpiration.java
It may be broken somehow...

FWIW, in Trino, we use https://github.com/airlift/airlift/blob/7e987a2c7f55c1703b9ea383daa32c657d29b7dd/stats/src/main/java/io/airlift/stats/JmxGcMonitor.java#L77 to report GC pauses.

@delanym
Copy link
Contributor

delanym commented Feb 13, 2023

Is this still an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants