Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle a (possibly) misbehaving logger? #390

Closed
kace opened this issue Jan 27, 2021 · 11 comments
Closed

How to handle a (possibly) misbehaving logger? #390

kace opened this issue Jan 27, 2021 · 11 comments
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@kace
Copy link

kace commented Jan 27, 2021

Hi Delgan,

In one of my multithreaded service daemons I'm seeing that the logger will occasionally stop altogether. By that I mean the application continues to run as if there are no errors but nothing is written to the logs and the files themselves aren't rotated. The file descriptors for the log files themselves are still open too. I suspect that I'm either blocking on a single logging call or that the there could have been an exception in the logger itself. Unfortunately I don't have a record of strerr to check this.

In the interest of debugging my problem I have a couple of questions:

  1. Do you have an example of handling errors in the Loguru logger (setting catch to False when adding a sink) to the documentation? Would this basically mean that every logging call needs to be put into a try/except block?
  2. If there is an exception in the logger, can it be restarted to make it sane again? If so, how can I accomplish that?
  3. If on the other hand I'm being blocked by a single logging call (not sure why this would be), do you think that setting enqueue to True when adding a sink might solve my problem? Do logging calls eventually timeout?

I use Loguru extensively in my work. Thanks for your effort!

@Delgan
Copy link
Owner

Delgan commented Jan 27, 2021

Hi @kace. ;)

Is it possible that you inadvertently used the logger in one of your handlers (in the sink or filter attribute for example)? In such case, that could cause a deadlock. If you can share your logging configuration maybe I can help.

Answering your questions:

  1. Well, I never thought of this use case. But instead of wrapping every logging call, maybe can you just use a try / catch block in a (custom) sink?
  2. Theoretically, the logger should still be usable even if an exception occurs during one logging call (assuming catch=True). As long as the sink is still in a valid state, it should be ok.
  3. You can give it a try, although I should not be needed. Do you use multiprocessing or some third library spawning subprocesses?

@kace
Copy link
Author

kace commented Jan 29, 2021

Thanks for the quick response @Delgan!

I don't think I do too much with configuring the logger aside from setting up an intercept handler (following the documentation), adding a field to the extra's dictionary, and setting a custom log format (using the custom extra's dict). Below you can see some snippets from the convenience module where I wrap Loguru.

class InterceptHandler(logging.Handler):
    '''https://github.com/Delgan/loguru#entirely-compatible-with-standard-logging'''
    def emit(self, record):
        # Get corresponding Loguru level if it exists
        try:
            level = logger.level(record.levelname).name
        except ValueError:
            level = record.levelno

        # Find caller from where originated the logged message
        frame, depth = logging.currentframe(), 2
        while frame.f_code.co_filename == logging.__file__:
            frame = frame.f_back
            depth += 1

        # HACK must do a .bind() here instead of passing `id` as a kwarg to the call to .log()
        # https://github.com/Delgan/loguru/issues/318
        logger.opt(depth=depth, exception=record.exc_info).bind(id=record.name).log(
            level,
            record.getMessage(),
        )


def enable_intercept_logging():
    logging.basicConfig(handlers=[InterceptHandler()], level=0)
    logger.debug(f"Redirecting from stdlib logging")


def disable_intercept_logging():
    l = logging.getLogger()
    for handler in l.handlers[:]:
        if isinstance(handler, InterceptHandler):
            l.removeHandler(handler)
    logger.debug(f"Redirecting from stdlib logging disabled")


def set_intercept_level(level):
    l = logging.getLogger()
    l.setLevel(level)
    logger.debug(f"Redirect log level set to {level}")


FORMAT = "<green>{time:YYYY-MM-DD HH:mm:ss.SSS}</green> | <level>{level: <8}</level> | <cyan>{thread.name: <12}</cyan> | <m>{extra[id]}</m> | <level>{message}</level>"

logger = loguru.logger
logger.configure(
    handlers=[
        {"sink": sys.stderr, "format": FORMAT},
    ],
    extra={"id": "-"},
)

logger = logger.bind(id="logging")

Where I add sinks is pretty straightforward also - I'll be updating these to roll on either time or space following one of your documented recipes.

logger.add(
    "agent_{time}.log",
    retention=10,
    rotation="100 MB",
    compression="tar.gz",
    level="INFO",
)
logger.add(
    "agent_debug_{time}.log",
    retention=2,
    rotation="100 MB",
    compression="tar.gz",
    level="DEBUG",
    backtrace=True,
    diagnose=True,
)
  1. I see. I'll probably ask for a bit of direction on creating that custom sink if it comes down to that.
  2. So following my description it's probably some sort of deadlock since the logs should've continued if there was an exception. To clarify both of my sinks are standard files (as shown above) and writes to both stopped at the same time. Follow-up question, what's the likelihood that I have an issue in both of my sinks at the same time?
  3. I don't use multiprocessing. This is probably unrelated but I occasionally use subprocess to interact with some system tools. The output of these calls are parsed and eventually logged, but I doubt that has anything to do with this.

I might try capturing stdout/stderr while catch=False to see if I could get more insight on what's happening. Otherwise I'm open to ideas and suggestions on where to start investigating this.

@Delgan
Copy link
Owner

Delgan commented Feb 6, 2021

Hi @kace.

I made a small test. It seems the compression operation can be quite long. It takes probably half a minute to properly rotate the two files. During this time, the logger is frozen. But after that, he resumes normally. Could that explain the problem you're facing?

Apart from that, I don't see any issues with your configuration.

@kace
Copy link
Author

kace commented Feb 10, 2021

Hi @Delgan.

I suspected that myself at first, but it seems my deadlock is completely unrelated to log file size or log rotation. In my case the whole application appears to be frozen where all of the system resources are held indefinitely. I think no work (and no logging) is being done. I'm having a difficult time confirming that because the process is daemonized and isn't logging.

Before asking you to spend more time on this, I'll run some more tests and do more to try to reproduce and resolve this on my side. To be honest I'm thinking this might not have anything to do with Loguru after all. I'll let you know what I come up with.

Thanks for confirming my configuration.

@kace
Copy link
Author

kace commented Feb 17, 2021

Hi @Delgan.

Quick update. I've had some luck recreating this issue since I've last replied. It seems to be a race condition related to subprocess calls. What I've observed is that if I have a thread attempting to log at the same moment another thread is initiating a subprocess call then I get an immediate deadlock. I suspect this has to do with the process forking and copying of process memory/file descriptors, but haven't gotten that far yet.

Here are some example logs:

2021-02-17 04:48:50.116 | DEBUG    | ExecThread_3 | virt | Using check_output to run subprocess
2021-02-17 04:48:50.116 | DEBUG    | ScheduleThread | schedule | Next run scheduled for 2021-02-17 04:48:50.121779
2021-02-17 04:53:07.324 | DEBUG    | ExecThread_1 | virt | Using check_output to run subprocess
2021-02-17 04:53:07.327 | DEBUG    | MainThread   | hello_world | Keep alive

To give you some context on the above, my ExecThread pool executes some subprocess call while at the same time another thread (ScheduleThread or MainThread above) in my application writes a log entry. The log entry signaling the start of the subprocess call is always the second to last log line while the entry from the other threads (the one's not performing the subprocess call) is always the last log line before the deadlock.

If I'm right this maybe an issue with the subprocess module rather than Loguru. I'm using Python 3.8 for your reference. It's also possible that using the enqueue flag on my log sinks might prevent this, but I haven't tried it yet.

Before digging deeper I'll work on creating a simple code snippet to reproduce the error. I'll share this with you once it's done.

Thanks!

@Delgan
Copy link
Owner

Delgan commented Feb 17, 2021

Thanks for the update! Surely, being able to reproduce the error is the hardest part. Then, we should be able to understand the problem much more easily.

Your description reminds me of a somehow similar bug reported while mixing threading with multiprocessing: #231
It was a problem with internal lock used while writing to stderr, which may be inherited in a locked state and causes deadlock in the child process.
I was not able to reproduce it, replacing multiprocessing with subprocess but maybe it's related.

@kace
Copy link
Author

kace commented Feb 24, 2021

Good news @Delgan, I managed create a minimal example for reproducing this deadlock! I've successfully reproduced in both Python 3.6.13 and 3.8.5 using Loguru version 0.5.3 in both cases.

Some notes about the example:

  • The key to reproducing the deadlock was executing subprocess calls while triggering a signal handler (sigchld_handler in the example) that writes log entries.
  • Increasing the numbers of threads will produce the deadlock faster. In my testing 2 threads were sufficient, but using 5 threads guaranteed that it would happen immediately.
  • Which signal we catch isn't actually important. It's convenient to catch the SIGCHLD signals from the subprocess calls, but I was also able to reproduce the deadlock by adding a SIGUSR1 handler and sending those signals from another script.
import signal
import subprocess
import threading
import time

from loguru import logger


STOP_EVENT = threading.Event()

    
def sigchld_handler(signum, frame):
    logger.warning("SIGCHLD recieved - ignoring")


signal.signal(signal.SIGCHLD, sigchld_handler)


def stop(signum, frame):
    logger.warning(f"Signal {signum} recieved - stop requested")
    STOP_EVENT.set()


for sig in [signal.SIGTERM, signal.SIGINT]:
    signal.signal(sig, stop)


def run_process():
    while not STOP_EVENT.is_set():
        subprocess.run(["ls", "-la"], stdout=subprocess.DEVNULL)
        

if __name__ == "__main__":


    threads = []
    for _ in range(2):
        threads.append(threading.Thread(target=run_process, name=f"ProcessThread-{_}"))
       

    for t in threads:
        t.start()
        logger.info(f"{t.name} started")

    # Keep the main thread alive
    while not STOP_EVENT.is_set():
        time.sleep(0.1)
        logger.info("tick")

    # Stop cleanly
    for t in threads:
        t.join()

    logger.info("Exiting.")

Output from hanging_threads

From what I can tell the deadlock is caused by a lock in the Handler class - _handler.py:171. This would explain why my entire application would deadlock instead of just a few threads as all of my threads log and would eventually wait on acquiring that lock.

                -------Thread 140297772562176 "ProcessThread" hangs --------
        File "threading.py", line 884, in _bootstrap
                        File "threading.py", line 916, in _bootstrap_inner
                        File "threading.py", line 864, in run
                        File "loguru_test2.py", line 62, in run_process
                        File "loguru/_logger.py", line 1971, in info
                        File "loguru/_logger.py", line 1959, in _log
                        File "loguru/_handler.py", line 171, in emit
                ---------Thread 140297783052032 "LogThread" hangs ----------
        File "threading.py", line 884, in _bootstrap
                        File "threading.py", line 916, in _bootstrap_inner
                        File "threading.py", line 864, in run
                        File "loguru_test2.py", line 53, in log_indefinitely
                        File "loguru/_logger.py", line 1971, in info
                        File "loguru/_logger.py", line 1959, in _log
                        File "loguru/_handler.py", line 171, in emit
                ---------Thread 140298007164672 "MainThread" hangs ---------
        File "loguru_test2.py", line 87, in <module>
                        File "loguru_test2.py", line 70, in sigchld_handler
                        File "loguru/_logger.py", line 1979, in warning
                        File "loguru/_logger.py", line 1959, in _log
                        File "loguru/_handler.py", line 177, in emit
                        File "loguru/_file_sink.py", line 176, in write
                        File "loguru_test2.py", line 70, in sigchld_handler
                        File "loguru/_logger.py", line 1979, in warning
                        File "loguru/_logger.py", line 1959, in _log
                        File "loguru/_handler.py", line 171, in emit

Also note that during my tests with my full application I would often see two back-to-back signals of the same signal type right before the deadlock. As a temporary fix for this deadlock I've removed all logging calls from my signal handlers. This seems to have worked as I haven't had any issues for the last 12 hours - for reference I used to see problems within just a couple of hours.

Let me know what you think!

@Delgan
Copy link
Owner

Delgan commented Mar 15, 2021

Wow! Congratulations on successfully tracking down the problem!

This is similar to this StackOverflow question. Basically, the signal handlers need be re-entrant. This means that it must be possible to stop your function during its execution and launch another one before finally resuming it.

However, logging functions are not re-entrant. The sinks are protected by a Lock. If the signal handler is stopped while the lock is acquired, and that another signal handler using the logger is called, then it will face an acquired Lock and hang forever. It's not allowed to stop a logging call in the middle of its execution, because it could compromise the integrity of the sink.

Unfortunately, I don't think there exists a possible solution. I should probably add in the documentation that one can't use the logger inside a signal handler.

@kace
Copy link
Author

kace commented Mar 16, 2021

I assumed this was the case, but failed to find anything from the Python stdlib documentation listing this requirement (not sure if re-entrant signal handlers are the norm across all programming languages).

Your explanation makes sense. Don't worry about coming up with a fix. I've already removed the logging calls from my signal handlers, but a note in the documentation might save clueless people like me some time!

Thanks again Delgan and great work on Loguru!

@kace kace closed this as completed Mar 16, 2021
@Delgan Delgan added documentation Improvements or additions to documentation question Further information is requested labels Mar 16, 2021
@jacksmith15
Copy link
Contributor

jacksmith15 commented Sep 26, 2022

Noting here for posterity, I have seen a very similar issue which relates not to threading, but to garbage collection.

Is it possible that you inadvertently used the logger in one of your handlers

In my case, during the process of emitting a log, and object with a __del__ implementation was garbage collected. This __del__ implementation involved a logging call, which was then diverted through an InterceptHandler similar to the one above. This resulted in a rare intermittent issue that caused a deadlock.

As a (non-performant) workaround, I have added logic to the InterceptHandler which inspects the stack to ensure another log is not in the process of being emitted, specifically:

class LoguruInterceptHandler(logging.Handler):
    def emit(self, record):
        for frame_info in inspect.stack(0):
            if "loguru" in frame_info.filename and "emit" in frame_info.function:
                # Don't intercept `logging` calls whilst handling a `loguru` message, or this will create deadlock.
                return
        ... # Forward to loguru

I suspect a more performant solution could be achieved by using a context var in the loguru handler, which ignores logs emitted during the process of emitting another log.

@jacksmith15
Copy link
Contributor

jacksmith15 commented Sep 26, 2022

Actually, I've managed to reproduce and I think what I've found constitutes a separate issue. I will raise and link it here. EDIT: #712

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants