-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
access log: new 20x faster json formatter implementation #35545
Conversation
/wait |
exception... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did one quick pass. The big win will be to not use a protobuf as an intermediate format, but purely use the json streamer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should also say: this is great work! And even without the suggestions I have made it's a huge improvement, though I'm not entirely clear why the old one was so slow as you didn't really change it in this PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few more comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/wait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/wait
Fine, most comments are addressed. There are two problems:
cc @jmarantz |
/wait |
This PR involves performance, substitution format, Json serializing, etc. and this PR has lots of technical details and background context. It's not easy to review this PR. Thanks so much that @jmarantz has provided lots of great suggestions that make the PR better, on both the code quality and performance (now the new Because these are lots of historical comments and responses to the comments. And the PR also changed a lot compared to initial version. So, I cleaned all history context. And lets re-start from here. I will add some comment about:
Hope these could help to future reviewing and other reviewers. |
This comment was marked as duplicate.
This comment was marked as duplicate.
/wait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I wanted to finish this review off but I feel like we are still not aligned on converging JSON implementations.
struct JsonString { | ||
std::string value_; | ||
}; | ||
struct TmplString { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we pick a better name than TmplString? TemplateString? Add comments to both of these options indicating what they mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still i'ts not clear from the class names what they do. More descriptive names and comments would help.
struct TmplString { | ||
std::string value_; | ||
}; | ||
using FormatElement = absl::variant<JsonString, TmplString>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perf nit: I think absl::variant might be overkill since both options have the same type underneath, but maybe it optimizes well. But you could obviously do this with a string and bool, or string and enum.
Or if you were really after performance you'd make the first character of the string indicate which variant it is, and use the absl::string_view(str).substr(1,0)
to get the rest out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion. Sounds string + bool is better for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you evaluate this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for calling it out. There are too much historical context. And I want re-start it, so I close all comment directly without a careful checking. Sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to re-enter comments I previously entered.
* For example given the following proto struct format configuration: | ||
* | ||
* json_format: | ||
* text: "text" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use different words in the example here instead of 'text' and 'text' so the rendering is unambiguous? Say, name: "value"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still looks the same.
It might be good to leave comments unresolved and let the reviewer resolve them?
void formatValueToFormatElements(const ProtobufWkt::Value& value); | ||
void formatValueToFormatElements(const ProtoList& list_value); | ||
|
||
const bool keep_value_type_{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove the {}
-- this has to be initialized in the ctor based on the ctor args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not done
np. We all want a best solution. I will address your comments after we get a conclusion about the exception usage. |
/wait I think a separate PR should be cut to sort out the exception issue with the sanitizer. |
/retest |
// NOTE: This helper class has duplicated logic with the Json::Streamer class but | ||
// provides lower level of APIs to operate on the output buffer (like control the | ||
// delimiters). This is designed for special scenario of substitution formatter and | ||
// is not intended to be used by other parts of the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, thanks.
void addNumber(uint64_t i) { output_buffer_.add(absl::StrCat(i)); } | ||
void addNumber(int64_t i) { output_buffer_.add(absl::StrCat(i)); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to worry about all JSON numbers being floats (and therefore int64 > ~2^53 are invalid numbers)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, but lots of popular JSON libs will support integer (like the nlohmann/json, rapidjson, etc), so I think this's ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://stackoverflow.com/questions/13502398/json-integers-limit-on-size is of interest; TL;DR the json spec does not describe this limit. It might be that precision is lost if this is read by JavaScript, but a C++ json reader is free to retain the precision. There is an interoperability note in RFC7159 referencing this, but my feeling is we should not clamp to that value, and leave it to the JSON reader to drop precision if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...although putting a comment in here referencing the json spec and RFC about numbers would make sense -- we should also add that to json_streamer.h where all this is sadly duplicated :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it necessary to add additional comments for this.
Although all json numbers are float in definition. But in the actual implementation, most json lib provide the integer support.
Maybe some lib may lost precision when loading the output, but we basically have no better choice anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your reasoning, and think it would be good to write a comment so that readers are on the same page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also add that to json_streamer.h where all this is sadly duplicated :)
I think we could let the Streamer inherit from here's serializer in the future, like my very initial suggestion. They basically are different levels of abstraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments are done for both streamer and this serializer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the serializer should definitely not inherit from the formatter. But a common base-class (or even just helper methods) could be factored out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the serializer should definitely not inherit from the formatter. But a common base-class (or even just helper methods) could be factored out.
we can move the new serializer here out from formatter. I think this doesn't matter. we can give a try at other PR to see if it make sense.
break; | ||
case ProtobufWkt::Value::kStringValue: { | ||
absl::string_view string_format = value.string_value(); | ||
if (!absl::StrContains(string_format, '%')) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed an issue with malformed format specifiers: I think if you see a kStringValue here with value
holding a single "%" sign, you'll treat it as a format specifier. The parser will swallow the % and return an empty list of formatters (https://github.com/envoyproxy/envoy/blob/main/source/common/formatter/substitution_formatter.h#L192-L199) and then when generating the JSON output (https://github.com/envoyproxy/envoy/pull/35545/files#diff-55d03d93118484f891da85b3ae673a01955faa49745104921d5128f5031e906aR414) you index into the formatters without an index check.
I'd say this is a bug with SubstitutionFormatParser::parse -- it should probably never return an empty list of formatters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SubstitutionFormatParser::parse
will never return an empty list. There are possible cases for the parse method:
- empty format: return a list that contains a plain string provider which will output empty string.
- non-emtpy format:
2.1 return a list that contains plain string providers and other providers.
2.2 throw exception to reject the configuration if the format is invalid.
The single %
will hit the case 2.2.
But in case some one change the behavior of parse in the future, I can do a minor change to guard that. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the if check to determine if the formatter list size is 1 or not. Now, it should be safe anyway, even someone change the behavior of parse in the future (although I think this won't happen).
if (formatters.size() != 1) {
// 2. Handle the formatter element with multiple or zero providers.
stringValueToLogLine(formatters, context, info, serializer);
} else {
// 3. Handle the formatter element with a single provider and value
// type needs to be kept.
typedValueToLogLine(formatters, context, info, serializer);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this change addresses all of @zuercher 's comments, but I'll leave it to him.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misread the parsing code a bit, and it doesn't do what I thought. I think handling an empty formatter list is not a bad thing though. Thanks for adding the guard.
Signed-off-by: wangbaiping <[email protected]>
/retest |
Signed-off-by: wangbaiping <[email protected]>
/** | ||
* Serializes a integer number. | ||
* NOTE: All numbers in JSON is float. When loading output of this serializer, the parser's | ||
* implementation decides if the full precision of big integer could be preserved or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the ref to the stackoverflow article & rfc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it.
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@wbpcode am I correct that new implementation in 1.32 ignores |
it will ignore the sort_properties, but will obey the omit_empty_value And the new implementation only be enabled by default after 1.32. :) |
I have just tested it in our custom build of 1.32 and I see null-s with this new implementation, while we have |
I will check it this weekend. Maybe there is a problem we didn't notice. |
Thank you! |
@fxposter I think I know why. We do almost every thing at configuration loading time, and only need to do the substitution command evaluation and string joining at runtime. This way bring great performance for the Json formatter. But also make it impossible to eliminate the null entry. Because the key has been serialized at configuration loading time. The only way to resolve it is keep the structural tree of configuration and iterate the tree at runtime. It's no doubt that will penalize performance. |
Is this a block point for you? I can try to find some time to test how much performance degradation that the fix will bring. If the degradation is acceptable, we can try to fix it. |
I don't think it's problematic for us in a way that we won't be able to adjust the log collector code, it just no longer matches the documentation. |
Also, probably not adding unnecessary strings and writing them to file can have benefits by itself. |
It's more complex because the serialization of the keys is done when we loading the configuration, and we can never know if the value will be null or not at that time. I will create a bug first to record this problem. |
@wbpcode Did you create a bug? |
nope. I record it in my todo but forget to create a bug |
@wbpcode Would you mind creating a bug? |
ofcourse not. I will create one when I could access my computer. But would you mind give more context about the parsing problem? The reason why the bug has no priority is because I think it's more like a minor behavior change and won't actually effect the users. |
Commit Message: access log: new 20x faster json formatter implementation
Additional Description:
The core idea of the new
JsonFormatter
.The core idea of the new
JsonFormatter
is to do as much as possible work when loading configuration. Given a specificjson_format
:When loading configuration, the keys, raw values, delimiters (all parts don't contains the substitution format commands) will be sanitized and serialized as JSON string pieces (raw JSON strings) directly.
The strings that contain substitution commands will be extracted as template string and parsed as substitution formatters (objects that could extract dynamic values from request/response/stream info, etc.).
Finally, the whole proto struct configuration we be parsed to an array of raw JSON piece strings and formatters.
Basically, above configuration finally will be parsed and serialized to following pieces and formatters:
Then, at runtime, when a request is complete and the caller want to construct a log line. The above element array will be iterated in order.
The raw JSON piece string will be put into output buffer directly. And the formatters will be evaluated. The outputs of formatters will be sanitized/serialized to legal JSON values and then be put into output buffer.
At the runtime (core data paths), the logic of new JsonFormatter is almost same with the text formatter. By this way, the JsonFormatter could get comparable/same performance with the text formatter.
The performance result of the formatter.
The previous json formatter is too too slow and the new one perform much better (16x-26x faster).
Risk Level: low. guard by the runtime flag.
Testing: unit.
Docs Changes: n/a.
Release Notes: added.
Platform Specific Features: n/a.