GelfLogHandler fails to recover if channel is closed in GelfUDPSender #92

cfraenkel · 2016-07-29T11:59:32Z

I'm using logstash-gelf-1.9.0 on JBoss EAP 6.4 deployed as a module and configured as a custom handler.

In my tests sometimes the DatagramChannel gets closed. I have no reproducible test case, but the times I've seen this occur coincided with a thread cleanup by the arjuna transaction reaper thread.

After the exception happens the GelfLogHandler does not recover and no further log messages are sent.

The following log message is printed once to stderr by the ErrorManager:

java.io.IOException: Cannot send data to hostname/10.10.200.6:12201
        at biz.paluch.logging.gelf.intern.sender.GelfUDPSender.sendDatagrams(GelfUDPSender.java:59)
        at biz.paluch.logging.gelf.intern.sender.GelfUDPSender.sendMessage(GelfUDPSender.java:49)
        at biz.paluch.logging.gelf.jul.GelfLogHandler.publish(GelfLogHandler.java:124)
        at biz.paluch.logging.gelf.jboss7.JBoss7GelfLogHandler.publish(JBoss7GelfLogHandler.java:85)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:302)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:310)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:310)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:310)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:310)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:310)
        at org.jboss.logmanager.Logger.logRaw(Logger.java:719)
        at org.slf4j.impl.Slf4jLogger.log(Slf4jLogger.java:326)
        at org.slf4j.impl.Slf4jLogger.log(Slf4jLogger.java:70)
        at org.apache.commons.logging.impl.SLF4JLocationAwareLog.info(SLF4JLocationAwareLog.java:159)
        at org.quartz.plugins.history.LoggingTriggerHistoryPlugin.triggerMisfired(LoggingTriggerHistoryPlugin.java:402)
        at org.quartz.core.QuartzScheduler.notifyTriggerListenersMisfired(QuartzScheduler.java:1822)
        at org.quartz.core.SchedulerSignalerImpl.notifyTriggerListenersMisfired(SchedulerSignalerImpl.java:76)
        at org.quartz.impl.jdbcjobstore.JobStoreSupport.recoverMisfiredJobs(JobStoreSupport.java:961)
        at org.quartz.impl.jdbcjobstore.JobStoreSupport.doRecoverMisfires(JobStoreSupport.java:3126)
        at org.quartz.impl.jdbcjobstore.JobStoreSupport$MisfireHandler.manage(JobStoreSupport.java:3887)
        at org.quartz.impl.jdbcjobstore.JobStoreSupport$MisfireHandler.run(JobStoreSupport.java:3907)
Caused by: java.nio.channels.ClosedChannelException
        at sun.nio.ch.DatagramChannelImpl.ensureOpen(DatagramChannelImpl.java:320)
        at sun.nio.ch.DatagramChannelImpl.write(DatagramChannelImpl.java:594)
        at biz.paluch.logging.gelf.intern.sender.GelfUDPSender.sendDatagrams(GelfUDPSender.java:56)
        ... 20 more

I see you added code to handle this in the GelfTcpSender with #83. Would it be an option to clear the gelfSender in GelfLogHandler on a failed publish? This might solve these issues in general instead of handling them on a case by case basis.

The text was updated successfully, but these errors were encountered:

mp911de · 2016-07-30T00:24:08Z

I started working on the issue a while ago with branch issue/88-disconnected-detect. I never was able to reproduce socket failures with UDP. Could you give the code from issue/88-disconnected-detect a spin and test, whether it solves your issue?

cfraenkel · 2016-08-01T08:42:27Z

I'll give it a try but I suspect I'll have a hard time getting a reproducible testcase.
Regarding your proposed solution: do you really have to do a read()? Calling java.nio.channels.Channel.isOpen() should return false at least in the above case.

cfraenkel · 2016-08-01T10:45:40Z

I was able to reproduce the behaviour and exception on 1.9 by causing transaction timeouts in the JBoss application server.
After that I deployed a build from the https://github.com/mp911de/logstash-gelf/tree/issue/88-disconnected-detect branch. I caused further timeouts which the branch code successfully recovered from. I also see the causing exceptions now in my elk stack (they did not appear previously).
I did get a new exception on stderr from the LogManager:

LogManager error of type GENERIC_FAILURE: null
java.io.IOException: Cannot send data to hostname:12201
        at biz.paluch.logging.gelf.intern.sender.GelfUDPSender.sendDatagrams(GelfUDPSender.java:45)
        at biz.paluch.logging.gelf.intern.sender.GelfUDPSender.sendMessage(GelfUDPSender.java:28)
        at biz.paluch.logging.gelf.jul.GelfLogHandler.publish(GelfLogHandler.java:124)
        at biz.paluch.logging.gelf.jboss7.JBoss7GelfLogHandler.publish(JBoss7GelfLogHandler.java:85)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:302)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:310)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:310)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:310)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:310)
        at org.jboss.logmanager.Logger.logRaw(Logger.java:719)
        at org.jboss.logmanager.Logger.log(Logger.java:670)
        at org.jboss.logging.JBossLogManagerLogger.doLog(JBossLogManagerLogger.java:42)
        at org.jboss.logging.Logger.logv(Logger.java:1971)
        at com.arjuna.ats.arjuna.logging.arjunaI18NLogger_$logger.warn_coordinator_TransactionReaper_13(arjunaI18NLogger_$logger.java:3048)
        at com.arjuna.ats.arjuna.coordinator.TransactionReaper.doCancellations(TransactionReaper.java:406)
        at com.arjuna.ats.internal.arjuna.coordinator.ReaperWorkerThread.run(ReaperWorkerThread.java:78)
Caused by: java.nio.channels.ClosedByInterruptException
        at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
        at sun.nio.ch.DatagramChannelImpl.read(DatagramChannelImpl.java:554)
        at biz.paluch.logging.gelf.intern.sender.AbstractNioSender.isConnected(AbstractNioSender.java:46)
        at biz.paluch.logging.gelf.intern.sender.GelfUDPSender.sendDatagrams(GelfUDPSender.java:35)
        ... 15 more

logging to logstash resumed after that exception though so I expect I only "lost" a single log message.

mp911de · 2016-08-01T14:06:13Z

The issue with isOpen is that the connections don't flip the bit immediately. It always requires I/O to happen. Previously, the senders wrote something, the data got lost and the next write ran into an IOException/isOpen reported false. I felt using read() should not harm because:

Graylog/Logstash do not respond anyway
That's a non-blocking operation so it should not cause too much impact.

That sound like your issue is solved.

cfraenkel · 2016-08-01T15:44:02Z

Yes, the issue was solved using the branch you provided. Do you plan to merge it for 1.11 ?

mp911de · 2016-08-01T15:54:50Z

I totally missed it's already merged with 90e9429. It will be part of the upcoming 1.11 release. I have one ticket left (reduce GC pressure) until I'll be able to release 1.11. Expect a release in the next 1-2 months.

cfraenkel · 2016-08-01T16:08:40Z

Great to hear, thanks for making this library, and making it better.

mp911de · 2016-08-01T16:10:56Z

Would not be possible without a great community.

mp911de added the type: bug A general bug label Jul 30, 2016

mp911de closed this as completed Aug 1, 2016

mp911de added this to the logstash-gelf 1.11.0 milestone Aug 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GelfLogHandler fails to recover if channel is closed in GelfUDPSender #92

GelfLogHandler fails to recover if channel is closed in GelfUDPSender #92

cfraenkel commented Jul 29, 2016

mp911de commented Jul 30, 2016

cfraenkel commented Aug 1, 2016

cfraenkel commented Aug 1, 2016

mp911de commented Aug 1, 2016

cfraenkel commented Aug 1, 2016

mp911de commented Aug 1, 2016

cfraenkel commented Aug 1, 2016

mp911de commented Aug 1, 2016

GelfLogHandler fails to recover if channel is closed in GelfUDPSender #92

GelfLogHandler fails to recover if channel is closed in GelfUDPSender #92

Comments

cfraenkel commented Jul 29, 2016

mp911de commented Jul 30, 2016

cfraenkel commented Aug 1, 2016

cfraenkel commented Aug 1, 2016

mp911de commented Aug 1, 2016

cfraenkel commented Aug 1, 2016

mp911de commented Aug 1, 2016

cfraenkel commented Aug 1, 2016

mp911de commented Aug 1, 2016