Skip to content
This repository has been archived by the owner on Jun 29, 2023. It is now read-only.

GelfLogHandler fails to recover if channel is closed in GelfUDPSender #92

Closed
cfraenkel opened this issue Jul 29, 2016 · 8 comments
Closed
Labels
type: bug A general bug

Comments

@cfraenkel
Copy link

I'm using logstash-gelf-1.9.0 on JBoss EAP 6.4 deployed as a module and configured as a custom handler.

In my tests sometimes the DatagramChannel gets closed. I have no reproducible test case, but the times I've seen this occur coincided with a thread cleanup by the arjuna transaction reaper thread.

After the exception happens the GelfLogHandler does not recover and no further log messages are sent.

The following log message is printed once to stderr by the ErrorManager:

java.io.IOException: Cannot send data to hostname/10.10.200.6:12201
        at biz.paluch.logging.gelf.intern.sender.GelfUDPSender.sendDatagrams(GelfUDPSender.java:59)
        at biz.paluch.logging.gelf.intern.sender.GelfUDPSender.sendMessage(GelfUDPSender.java:49)
        at biz.paluch.logging.gelf.jul.GelfLogHandler.publish(GelfLogHandler.java:124)
        at biz.paluch.logging.gelf.jboss7.JBoss7GelfLogHandler.publish(JBoss7GelfLogHandler.java:85)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:302)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:310)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:310)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:310)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:310)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:310)
        at org.jboss.logmanager.Logger.logRaw(Logger.java:719)
        at org.slf4j.impl.Slf4jLogger.log(Slf4jLogger.java:326)
        at org.slf4j.impl.Slf4jLogger.log(Slf4jLogger.java:70)
        at org.apache.commons.logging.impl.SLF4JLocationAwareLog.info(SLF4JLocationAwareLog.java:159)
        at org.quartz.plugins.history.LoggingTriggerHistoryPlugin.triggerMisfired(LoggingTriggerHistoryPlugin.java:402)
        at org.quartz.core.QuartzScheduler.notifyTriggerListenersMisfired(QuartzScheduler.java:1822)
        at org.quartz.core.SchedulerSignalerImpl.notifyTriggerListenersMisfired(SchedulerSignalerImpl.java:76)
        at org.quartz.impl.jdbcjobstore.JobStoreSupport.recoverMisfiredJobs(JobStoreSupport.java:961)
        at org.quartz.impl.jdbcjobstore.JobStoreSupport.doRecoverMisfires(JobStoreSupport.java:3126)
        at org.quartz.impl.jdbcjobstore.JobStoreSupport$MisfireHandler.manage(JobStoreSupport.java:3887)
        at org.quartz.impl.jdbcjobstore.JobStoreSupport$MisfireHandler.run(JobStoreSupport.java:3907)
Caused by: java.nio.channels.ClosedChannelException
        at sun.nio.ch.DatagramChannelImpl.ensureOpen(DatagramChannelImpl.java:320)
        at sun.nio.ch.DatagramChannelImpl.write(DatagramChannelImpl.java:594)
        at biz.paluch.logging.gelf.intern.sender.GelfUDPSender.sendDatagrams(GelfUDPSender.java:56)
        ... 20 more

I see you added code to handle this in the GelfTcpSender with #83. Would it be an option to clear the gelfSender in GelfLogHandler on a failed publish? This might solve these issues in general instead of handling them on a case by case basis.

@mp911de
Copy link
Owner

mp911de commented Jul 30, 2016

I started working on the issue a while ago with branch issue/88-disconnected-detect. I never was able to reproduce socket failures with UDP. Could you give the code from issue/88-disconnected-detect a spin and test, whether it solves your issue?

@mp911de mp911de added the type: bug A general bug label Jul 30, 2016
@cfraenkel
Copy link
Author

I'll give it a try but I suspect I'll have a hard time getting a reproducible testcase.
Regarding your proposed solution: do you really have to do a read()? Calling java.nio.channels.Channel.isOpen() should return false at least in the above case.

@cfraenkel
Copy link
Author

I was able to reproduce the behaviour and exception on 1.9 by causing transaction timeouts in the JBoss application server.
After that I deployed a build from the https://github.com/mp911de/logstash-gelf/tree/issue/88-disconnected-detect branch. I caused further timeouts which the branch code successfully recovered from. I also see the causing exceptions now in my elk stack (they did not appear previously).
I did get a new exception on stderr from the LogManager:

LogManager error of type GENERIC_FAILURE: null
java.io.IOException: Cannot send data to hostname:12201
        at biz.paluch.logging.gelf.intern.sender.GelfUDPSender.sendDatagrams(GelfUDPSender.java:45)
        at biz.paluch.logging.gelf.intern.sender.GelfUDPSender.sendMessage(GelfUDPSender.java:28)
        at biz.paluch.logging.gelf.jul.GelfLogHandler.publish(GelfLogHandler.java:124)
        at biz.paluch.logging.gelf.jboss7.JBoss7GelfLogHandler.publish(JBoss7GelfLogHandler.java:85)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:302)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:310)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:310)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:310)
        at org.jboss.logmanager.LoggerNode.publish(LoggerNode.java:310)
        at org.jboss.logmanager.Logger.logRaw(Logger.java:719)
        at org.jboss.logmanager.Logger.log(Logger.java:670)
        at org.jboss.logging.JBossLogManagerLogger.doLog(JBossLogManagerLogger.java:42)
        at org.jboss.logging.Logger.logv(Logger.java:1971)
        at com.arjuna.ats.arjuna.logging.arjunaI18NLogger_$logger.warn_coordinator_TransactionReaper_13(arjunaI18NLogger_$logger.java:3048)
        at com.arjuna.ats.arjuna.coordinator.TransactionReaper.doCancellations(TransactionReaper.java:406)
        at com.arjuna.ats.internal.arjuna.coordinator.ReaperWorkerThread.run(ReaperWorkerThread.java:78)
Caused by: java.nio.channels.ClosedByInterruptException
        at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
        at sun.nio.ch.DatagramChannelImpl.read(DatagramChannelImpl.java:554)
        at biz.paluch.logging.gelf.intern.sender.AbstractNioSender.isConnected(AbstractNioSender.java:46)
        at biz.paluch.logging.gelf.intern.sender.GelfUDPSender.sendDatagrams(GelfUDPSender.java:35)
        ... 15 more

logging to logstash resumed after that exception though so I expect I only "lost" a single log message.

@mp911de
Copy link
Owner

mp911de commented Aug 1, 2016

The issue with isOpen is that the connections don't flip the bit immediately. It always requires I/O to happen. Previously, the senders wrote something, the data got lost and the next write ran into an IOException/isOpen reported false. I felt using read() should not harm because:

  1. Graylog/Logstash do not respond anyway
  2. That's a non-blocking operation so it should not cause too much impact.

That sound like your issue is solved.

@cfraenkel
Copy link
Author

Yes, the issue was solved using the branch you provided. Do you plan to merge it for 1.11 ?

@mp911de
Copy link
Owner

mp911de commented Aug 1, 2016

I totally missed it's already merged with 90e9429. It will be part of the upcoming 1.11 release. I have one ticket left (reduce GC pressure) until I'll be able to release 1.11. Expect a release in the next 1-2 months.

@mp911de mp911de closed this as completed Aug 1, 2016
@mp911de mp911de added this to the logstash-gelf 1.11.0 milestone Aug 1, 2016
@cfraenkel
Copy link
Author

Great to hear, thanks for making this library, and making it better.

@mp911de
Copy link
Owner

mp911de commented Aug 1, 2016

Would not be possible without a great community.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type: bug A general bug
Projects
None yet
Development

No branches or pull requests

2 participants