Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logging to 4xx responses from API at reduced rate. #4770

Open
mgodwan opened this issue Oct 13, 2022 · 8 comments
Open

Add logging to 4xx responses from API at reduced rate. #4770

mgodwan opened this issue Oct 13, 2022 · 8 comments
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing

Comments

@mgodwan
Copy link
Member

mgodwan commented Oct 13, 2022

Is your feature request related to a problem? Please describe.
Today, a few errors returned by opensearch are not logged if they are customer side errors. While we may provide the message as part of the response, it is not always possible to determine the cause of the errors. We can explore to determine if it may benefit to have the logging for such errors at a reduced rate (e.g. how it is done today for connection errors in nodes). This will allow to better debug issues without the need to determine which debug/trace logs should be enabled.

Describe the solution you'd like
Sample 4xx errors and log at reduced rate

Describe alternatives you've considered
Enable debug/trace logs at multiple places. This requires users/operators to still identify the classes for which logs should be enabled and may not be straight forward.

@mgodwan mgodwan added enhancement Enhancement or improvement to existing feature or request untriaged labels Oct 13, 2022
@rayshrey
Copy link
Contributor

Rekindling the discussion by adding to the above points.
@mgodwan @sarthakaggarwal97 please take a look.

Is your feature request related to a problem? Please describe.

At the moment, there is no way to rate limit the logging of exceptions at a particular log level. Currently if we enable a lesser specific logging level for a class to gain more insights into the exceptions occurring, we might run into bloating of log files with excessive trivial logs. This will further enable us to add more logging, with the ability to rate limit, and debug root cause of the issues quickly.

Describe the solution you'd like

We will develop a central framework, which will enable a custom log4j filter. In this filter, we will introduce our own filtering logic, and on the basis of our algorithm, we will return the three states (ACCEPT, DENY, NEUTRAL) under different circumstances.

This will also require us to enable dynamic cluster settings, where we can tune the frequency of the threshold at which we want to limit the exceptions, the level at which we want to enable the rate limiting in the class, and the packages/classes where we are enabling the exceptions using their package name like we do for configuring dynamic logging.

{
    "transient": {
        "log_limiting.filter.org.opensearch.index.engine.InternalEngine":"true",
        "log_limiting._settings.threshold":5,
        "log_limiting._settings.level": "DEBUG"
    }
}

This is how we can dynamically toggle the settings to achieve the logging at the reduced rate.
In the cases where we have enabled different settings for the parent-child classes/packages, we can give the priority to the more specific package setting.

This approach is considered is good option for below reasons

  • It would enable us to log the root cause of the issues with more flexibility and without bloating the log files.
  • It is independent and extensible to all flows in the OpenSearch.

Describe alternatives you've considered

We considered a approach that we can enable rate limiting for few logs like in NodeConnectionsService, by placing a counter for the exception, and increasing the log level for the exception after every nth time. This approach was discarded because of the following reasons.

  • It will not give us flexibility of the toggling the parameters of rate limiting without a code change.
  • We will have to make code change to enable/disable rate limiting every time.
  • We will have to maintain multiple counters for each exception in the class, which is not the best design practice.

@gauravruhela gauravruhela added the Indexing Indexing, Bulk Indexing and anything related to indexing label Jul 27, 2023
@mgodwan
Copy link
Member Author

mgodwan commented Aug 1, 2023

Thanks @rayshrey for adding details.

  • Whay criteria do we use to determine the common log within the window?
  • With the proposed approach, what kind of compute/JVM overhead will we incur with the log4j filter in the critical path for indexing/search requests?

In the cases where we have enabled different settings for the parent-child classes/packages, we can give the priority to the more specific package setting.

Do you mean we prioritize package over classes here? Should it not be the lowest common ancestor among a class and available settings which gets the higher priority?

"log_limiting._settings.threshold":5,
"log_limiting._settings.level": "DEBUG"

Are you proposing this as a common setting for all loggers? Do you think we can have this defined on class/package level as well if needed?

where we can tune the frequency of the threshold at which we want to limit the exceptions

How do we define the window for this threshold?

@mgodwan
Copy link
Member Author

mgodwan commented Aug 1, 2023

@shwetathareja Thoughts on this?

@sarthakaggarwal97
Copy link
Contributor

Do you mean we prioritize package over classes here? Should it not be the lowest common ancestor among a class and available settings which gets the higher priority?

@mgodwan I think the log level settings prioritizes the parent over child in applying settings (@rayshrey can verify this once), but I agree with you, we should give higher priority to the lowest common ancestor.

With the proposed approach, what kind of compute/JVM overhead will we incur with the log4j filter in the critical path for indexing/search requests?

@rayshrey three types of benchmarks will be helpful. One without the filter, one with the filter but not applied via cluster settings (since all logs will pass through the present filter even if it has no affect), and one where we have the filter applied. You can also look into microbenchmarking for this.

@rayshrey
Copy link
Contributor

rayshrey commented Aug 1, 2023

Thanks for the comments @mgodwan and @sarthakaggarwal97
Please find below the replies to your queries.

What criteria do we use to determine the common log within the window?

For now we have set a expiry time of 1hr for the cache. So the counter for the exception type resets after 1 hr from the last accessed time.

With the proposed approach, what kind of compute/JVM overhead will we incur with the log4j filter in the critical path for indexing/search requests?

Will benchmark the results as described by @sarthakaggarwal97 and let you know about the findings once the implementation is complete

Should it not be the lowest common ancestor among a class and available settings which gets the higher priority

@sarthakaggarwal97 I verified that the logger also follows lowest common ancestor settings. Hence we should also follow similarly. Will update the implementation accordingly.

Are you proposing this as a common setting for all loggers? Do you think we can have this defined on class/package level as well if needed?

Yes, initially though of keeping it common for all loggers. But as suggested by @mgodwan we can make this setting on a modular level. Will update the implementation accordingly.

@shwetathareja
Copy link
Member

Thanks @rayshrey for the proposal. There is already WIP for Request Tracing Framework. @Gaganjuneja has started discussion around Sampling #8918 and one of the option would be to enable Tail Sampling based on response status. Instead of adding specific support for sampling 4xx logs explicitly. It might be useful to onboard to Tracing framework. Thoughts @mgodwan

@Gaganjuneja
Copy link
Contributor

@shwetathareja, Thanks for calling this out. Yes, these use cases will be easily served through sampled traces.

@mgodwan
Copy link
Member Author

mgodwan commented Aug 4, 2023

@shwetathareja

  • Logging is still something most customers use today and this sampling will benefit such customers out of the box. We already have a lot of error logs at debug/trace level, and allowing local sampling of them should provide debugging benefits without adding much cost for users interested in using only logging. Based on the infrastructure customers choose to setup, logging or tracing can help with such debugging.

  • I see [Tracing Framework] RFC - Sampling Strategy #8918 suggests to use head based sampling for now. This may mean that we may miss not too frequents errors in such cases which may otherwise be caught with sampling on exceptions at a node level itself as logs are discretely written on each node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing
Projects
None yet
Development

No branches or pull requests

8 participants