Add logging to 4xx responses from API at reduced rate. #4770

mgodwan · 2022-10-13T10:43:11Z

Is your feature request related to a problem? Please describe.
Today, a few errors returned by opensearch are not logged if they are customer side errors. While we may provide the message as part of the response, it is not always possible to determine the cause of the errors. We can explore to determine if it may benefit to have the logging for such errors at a reduced rate (e.g. how it is done today for connection errors in nodes). This will allow to better debug issues without the need to determine which debug/trace logs should be enabled.

Describe the solution you'd like
Sample 4xx errors and log at reduced rate

Describe alternatives you've considered
Enable debug/trace logs at multiple places. This requires users/operators to still identify the classes for which logs should be enabled and may not be straight forward.

rayshrey · 2023-07-24T10:04:21Z

Rekindling the discussion by adding to the above points.
@mgodwan @sarthakaggarwal97 please take a look.

Is your feature request related to a problem? Please describe.

At the moment, there is no way to rate limit the logging of exceptions at a particular log level. Currently if we enable a lesser specific logging level for a class to gain more insights into the exceptions occurring, we might run into bloating of log files with excessive trivial logs. This will further enable us to add more logging, with the ability to rate limit, and debug root cause of the issues quickly.

Describe the solution you'd like

We will develop a central framework, which will enable a custom log4j filter. In this filter, we will introduce our own filtering logic, and on the basis of our algorithm, we will return the three states (ACCEPT, DENY, NEUTRAL) under different circumstances.

This will also require us to enable dynamic cluster settings, where we can tune the frequency of the threshold at which we want to limit the exceptions, the level at which we want to enable the rate limiting in the class, and the packages/classes where we are enabling the exceptions using their package name like we do for configuring dynamic logging.

{
    "transient": {
        "log_limiting.filter.org.opensearch.index.engine.InternalEngine":"true",
        "log_limiting._settings.threshold":5,
        "log_limiting._settings.level": "DEBUG"
    }
}

This is how we can dynamically toggle the settings to achieve the logging at the reduced rate.
In the cases where we have enabled different settings for the parent-child classes/packages, we can give the priority to the more specific package setting.

This approach is considered is good option for below reasons

It would enable us to log the root cause of the issues with more flexibility and without bloating the log files.
It is independent and extensible to all flows in the OpenSearch.

Describe alternatives you've considered

We considered a approach that we can enable rate limiting for few logs like in NodeConnectionsService, by placing a counter for the exception, and increasing the log level for the exception after every nth time. This approach was discarded because of the following reasons.

It will not give us flexibility of the toggling the parameters of rate limiting without a code change.
We will have to make code change to enable/disable rate limiting every time.
We will have to maintain multiple counters for each exception in the class, which is not the best design practice.

mgodwan · 2023-08-01T06:41:01Z

Thanks @rayshrey for adding details.

Whay criteria do we use to determine the common log within the window?
With the proposed approach, what kind of compute/JVM overhead will we incur with the log4j filter in the critical path for indexing/search requests?

In the cases where we have enabled different settings for the parent-child classes/packages, we can give the priority to the more specific package setting.

Do you mean we prioritize package over classes here? Should it not be the lowest common ancestor among a class and available settings which gets the higher priority?

"log_limiting._settings.threshold":5,
"log_limiting._settings.level": "DEBUG"

Are you proposing this as a common setting for all loggers? Do you think we can have this defined on class/package level as well if needed?

where we can tune the frequency of the threshold at which we want to limit the exceptions

How do we define the window for this threshold?

mgodwan · 2023-08-01T06:44:36Z

@shwetathareja Thoughts on this?

sarthakaggarwal97 · 2023-08-01T09:39:38Z

Do you mean we prioritize package over classes here? Should it not be the lowest common ancestor among a class and available settings which gets the higher priority?

@mgodwan I think the log level settings prioritizes the parent over child in applying settings (@rayshrey can verify this once), but I agree with you, we should give higher priority to the lowest common ancestor.

With the proposed approach, what kind of compute/JVM overhead will we incur with the log4j filter in the critical path for indexing/search requests?

@rayshrey three types of benchmarks will be helpful. One without the filter, one with the filter but not applied via cluster settings (since all logs will pass through the present filter even if it has no affect), and one where we have the filter applied. You can also look into microbenchmarking for this.

rayshrey · 2023-08-01T11:41:52Z

Thanks for the comments @mgodwan and @sarthakaggarwal97
Please find below the replies to your queries.

What criteria do we use to determine the common log within the window?

For now we have set a expiry time of 1hr for the cache. So the counter for the exception type resets after 1 hr from the last accessed time.

With the proposed approach, what kind of compute/JVM overhead will we incur with the log4j filter in the critical path for indexing/search requests?

Will benchmark the results as described by @sarthakaggarwal97 and let you know about the findings once the implementation is complete

Should it not be the lowest common ancestor among a class and available settings which gets the higher priority

@sarthakaggarwal97 I verified that the logger also follows lowest common ancestor settings. Hence we should also follow similarly. Will update the implementation accordingly.

Are you proposing this as a common setting for all loggers? Do you think we can have this defined on class/package level as well if needed?

Yes, initially though of keeping it common for all loggers. But as suggested by @mgodwan we can make this setting on a modular level. Will update the implementation accordingly.

shwetathareja · 2023-08-03T11:46:56Z

Thanks @rayshrey for the proposal. There is already WIP for Request Tracing Framework. @Gaganjuneja has started discussion around Sampling #8918 and one of the option would be to enable Tail Sampling based on response status. Instead of adding specific support for sampling 4xx logs explicitly. It might be useful to onboard to Tracing framework. Thoughts @mgodwan

Gaganjuneja · 2023-08-04T07:14:48Z

@shwetathareja, Thanks for calling this out. Yes, these use cases will be easily served through sampled traces.

mgodwan · 2023-08-04T08:23:57Z

@shwetathareja

Logging is still something most customers use today and this sampling will benefit such customers out of the box. We already have a lot of error logs at debug/trace level, and allowing local sampling of them should provide debugging benefits without adding much cost for users interested in using only logging. Based on the infrastructure customers choose to setup, logging or tracing can help with such debugging.
I see [Tracing Framework] RFC - Sampling Strategy #8918 suggests to use head based sampling for now. This may mean that we may miss not too frequents errors in such cases which may otherwise be caught with sampling on exceptions at a node level itself as logs are discretely written on each node.

mgodwan added enhancement Enhancement or improvement to existing feature or request untriaged labels Oct 13, 2022

mch2 added the Indexing & Search label Oct 18, 2022

anasalkouz removed the untriaged label Oct 18, 2022

mgodwan mentioned this issue Jul 20, 2023

Add Bulk Item Failure Count #4562

Closed

gauravruhela added the Indexing Indexing, Bulk Indexing and anything related to indexing label Jul 27, 2023

anasalkouz removed the Indexing & Search label Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add logging to 4xx responses from API at reduced rate. #4770

Add logging to 4xx responses from API at reduced rate. #4770

mgodwan commented Oct 13, 2022

rayshrey commented Jul 24, 2023

mgodwan commented Aug 1, 2023

mgodwan commented Aug 1, 2023

sarthakaggarwal97 commented Aug 1, 2023

rayshrey commented Aug 1, 2023

shwetathareja commented Aug 3, 2023

Gaganjuneja commented Aug 4, 2023

mgodwan commented Aug 4, 2023 •

edited

Loading

Add logging to 4xx responses from API at reduced rate. #4770

Add logging to 4xx responses from API at reduced rate. #4770

Comments

mgodwan commented Oct 13, 2022

rayshrey commented Jul 24, 2023

mgodwan commented Aug 1, 2023

mgodwan commented Aug 1, 2023

sarthakaggarwal97 commented Aug 1, 2023

rayshrey commented Aug 1, 2023

shwetathareja commented Aug 3, 2023

Gaganjuneja commented Aug 4, 2023

mgodwan commented Aug 4, 2023 • edited Loading

mgodwan commented Aug 4, 2023 •

edited

Loading