-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add logging to 4xx responses from API at reduced rate. #4770
Comments
Rekindling the discussion by adding to the above points. Is your feature request related to a problem? Please describe. At the moment, there is no way to rate limit the logging of exceptions at a particular log level. Currently if we enable a lesser specific logging level for a class to gain more insights into the exceptions occurring, we might run into bloating of log files with excessive trivial logs. This will further enable us to add more logging, with the ability to rate limit, and debug root cause of the issues quickly. Describe the solution you'd like We will develop a central framework, which will enable a custom log4j filter. In this filter, we will introduce our own filtering logic, and on the basis of our algorithm, we will return the three states (ACCEPT, DENY, NEUTRAL) under different circumstances. This will also require us to enable dynamic cluster settings, where we can tune the frequency of the threshold at which we want to limit the exceptions, the level at which we want to enable the rate limiting in the class, and the packages/classes where we are enabling the exceptions using their package name like we do for configuring dynamic logging.
This is how we can dynamically toggle the settings to achieve the logging at the reduced rate. This approach is considered is good option for below reasons
Describe alternatives you've considered We considered a approach that we can enable rate limiting for few logs like in NodeConnectionsService, by placing a counter for the exception, and increasing the log level for the exception after every nth time. This approach was discarded because of the following reasons.
|
Thanks @rayshrey for adding details.
Do you mean we prioritize package over classes here? Should it not be the lowest common ancestor among a class and available settings which gets the higher priority?
Are you proposing this as a common setting for all loggers? Do you think we can have this defined on class/package level as well if needed?
How do we define the window for this threshold? |
@shwetathareja Thoughts on this? |
@mgodwan I think the log level settings prioritizes the parent over child in applying settings (@rayshrey can verify this once), but I agree with you, we should give higher priority to the lowest common ancestor.
@rayshrey three types of benchmarks will be helpful. One without the filter, one with the filter but not applied via cluster settings (since all logs will pass through the present filter even if it has no affect), and one where we have the filter applied. You can also look into microbenchmarking for this. |
Thanks for the comments @mgodwan and @sarthakaggarwal97
For now we have set a expiry time of 1hr for the cache. So the counter for the exception type resets after 1 hr from the last accessed time.
Will benchmark the results as described by @sarthakaggarwal97 and let you know about the findings once the implementation is complete
@sarthakaggarwal97 I verified that the logger also follows lowest common ancestor settings. Hence we should also follow similarly. Will update the implementation accordingly.
Yes, initially though of keeping it common for all loggers. But as suggested by @mgodwan we can make this setting on a modular level. Will update the implementation accordingly. |
Thanks @rayshrey for the proposal. There is already WIP for Request Tracing Framework. @Gaganjuneja has started discussion around Sampling #8918 and one of the option would be to enable Tail Sampling based on response status. Instead of adding specific support for sampling 4xx logs explicitly. It might be useful to onboard to Tracing framework. Thoughts @mgodwan |
@shwetathareja, Thanks for calling this out. Yes, these use cases will be easily served through sampled traces. |
|
Is your feature request related to a problem? Please describe.
Today, a few errors returned by opensearch are not logged if they are customer side errors. While we may provide the message as part of the response, it is not always possible to determine the cause of the errors. We can explore to determine if it may benefit to have the logging for such errors at a reduced rate (e.g. how it is done today for connection errors in nodes). This will allow to better debug issues without the need to determine which debug/trace logs should be enabled.
Describe the solution you'd like
Sample 4xx errors and log at reduced rate
Describe alternatives you've considered
Enable debug/trace logs at multiple places. This requires users/operators to still identify the classes for which logs should be enabled and may not be straight forward.
The text was updated successfully, but these errors were encountered: