Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

access log: new 20x faster json formatter implementation #35545

Merged
merged 13 commits into from
Sep 26, 2024

Conversation

wbpcode
Copy link
Member

@wbpcode wbpcode commented Aug 1, 2024

Commit Message: access log: new 20x faster json formatter implementation
Additional Description:

The core idea of the new JsonFormatter.

The core idea of the new JsonFormatter is to do as much as possible work when loading configuration. Given a specific json_format:

   json_format:
     text: "text"
     template: "%START_TIME%"
     number: 2
     bool: true
     list:
       - "list_raw_value"
       - false
       - "%EMIT_TIME%"
     nested:
       text: "nested_text"

When loading configuration, the keys, raw values, delimiters (all parts don't contains the substitution format commands) will be sanitized and serialized as JSON string pieces (raw JSON strings) directly.
The strings that contain substitution commands will be extracted as template string and parsed as substitution formatters (objects that could extract dynamic values from request/response/stream info, etc.).
Finally, the whole proto struct configuration we be parsed to an array of raw JSON piece strings and formatters.

Basically, above configuration finally will be parsed and serialized to following pieces and formatters:

- '{"text":"text","tmpl":'                                       # Element 0, raw JSON piece.
- '%START_TIME% %FLAG%'                                          # Element 1, formatter string, must be independent element.
- ',"number":2,"bool":true,"list":["list_raw_value",false,'      # Element 2, raw JSON piece.
- '%EMIT_TIME%'                                                  # Element 3, formatter string, must be independent element.
- '],"nested":{"text":"nested_text"}}'                           # Element 4, raw JSON piece.

Then, at runtime, when a request is complete and the caller want to construct a log line. The above element array will be iterated in order.
The raw JSON piece string will be put into output buffer directly. And the formatters will be evaluated. The outputs of formatters will be sanitized/serialized to legal JSON values and then be put into output buffer.

At the runtime (core data paths), the logic of new JsonFormatter is almost same with the text formatter. By this way, the JsonFormatter could get comparable/same performance with the text formatter.

The performance result of the formatter.

The previous json formatter is too too slow and the new one perform much better (16x-26x faster).

-------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations
-------------------------------------------------------------------------------
BM_AccessLogFormatterSetup                 7020 ns         7020 ns        96714
BM_AccessLogFormatter                       349 ns          349 ns      2008804
BM_StructAccessLogFormatter                5837 ns         5837 ns       117163
BM_TypedStructAccessLogFormatter           4883 ns         4883 ns       148206
BM_LegacyJsonAccessLogFormatter           10460 ns        10460 ns        68180
BM_LegacyTypedJsonAccessLogFormatter       9541 ns         9541 ns        73877
BM_JsonAccessLogFormatter                   460 ns          460 ns      1521227.  <-- here
BM_FormatterCommandParsing                 35.6 ns         35.6 ns     18557843


Risk Level: low. guard by the runtime flag.
Testing: unit.
Docs Changes: n/a.
Release Notes: added.
Platform Specific Features: n/a.

Copy link

CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).

🐱

Caused by: #35545 was opened by wbpcode.

see: more, trace.

@wbpcode
Copy link
Member Author

wbpcode commented Aug 1, 2024

/wait

@wbpcode
Copy link
Member Author

wbpcode commented Aug 1, 2024

exception...

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did one quick pass. The big win will be to not use a protobuf as an intermediate format, but purely use the json streamer.

changelogs/current.yaml Outdated Show resolved Hide resolved
source/common/formatter/substitution_formatter.cc Outdated Show resolved Hide resolved
source/common/formatter/substitution_formatter.cc Outdated Show resolved Hide resolved
source/common/formatter/substitution_formatter.cc Outdated Show resolved Hide resolved
source/common/formatter/substitution_formatter.cc Outdated Show resolved Hide resolved
source/common/formatter/substitution_formatter.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should also say: this is great work! And even without the suggestions I have made it's a huge improvement, though I'm not entirely clear why the old one was so slow as you didn't really change it in this PR :)

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few more comments.

source/common/formatter/substitution_formatter.h Outdated Show resolved Hide resolved
source/common/formatter/substitution_formatter.h Outdated Show resolved Hide resolved
source/common/formatter/substitution_formatter.h Outdated Show resolved Hide resolved
source/common/formatter/substitution_formatter.h Outdated Show resolved Hide resolved
source/common/formatter/substitution_formatter.h Outdated Show resolved Hide resolved
source/common/formatter/substitution_formatter.h Outdated Show resolved Hide resolved
source/common/formatter/substitution_formatter.h Outdated Show resolved Hide resolved
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait

changelogs/current.yaml Outdated Show resolved Hide resolved
source/common/formatter/substitution_formatter.cc Outdated Show resolved Hide resolved
source/common/formatter/substitution_formatter.h Outdated Show resolved Hide resolved
source/common/formatter/substitution_formatter.h Outdated Show resolved Hide resolved
source/common/formatter/substitution_formatter.h Outdated Show resolved Hide resolved
source/common/formatter/substitution_formatter.h Outdated Show resolved Hide resolved
source/common/formatter/substitution_formatter.h Outdated Show resolved Hide resolved
source/common/formatter/substitution_formatter.h Outdated Show resolved Hide resolved
@wbpcode
Copy link
Member Author

wbpcode commented Aug 5, 2024

Fine, most comments are addressed.

There are two problems:

  • the substitution_formatter.h is depended by the envoy-mobile. But the sanitize introduce a exception on the core data path.
  • serializeDubbo method take the Buffer::Instance as output parameter. Maybe we need to change it if we want to use it here.

cc @jmarantz

@jmarantz
Copy link
Contributor

jmarantz commented Aug 5, 2024

/wait

@wbpcode wbpcode linked an issue Aug 7, 2024 that may be closed by this pull request
@wbpcode
Copy link
Member Author

wbpcode commented Aug 7, 2024

This PR involves performance, substitution format, Json serializing, etc. and this PR has lots of technical details and background context.

It's not easy to review this PR. Thanks so much that @jmarantz has provided lots of great suggestions that make the PR better, on both the code quality and performance (now the new JsonFormatter get almost same performance with the text formatter, not just comparable, but same.)
Of course, @jmarantz and me don't get agreement on every point. But this's usual thing in OSS community.

Because these are lots of historical comments and responses to the comments. And the PR also changed a lot compared to initial version. So, I cleaned all history context. And lets re-start from here.

I will add some comment about:

  1. The core idea of the new JsonFormatter.
  2. What I did to the exist code and why?
  3. What I added in the PR and why?

Hope these could help to future reviewing and other reviewers.

@wbpcode

This comment was marked as duplicate.

changelogs/current.yaml Show resolved Hide resolved
source/common/json/json_serializer.h Outdated Show resolved Hide resolved
@jmarantz
Copy link
Contributor

jmarantz commented Aug 8, 2024

/wait

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I wanted to finish this review off but I feel like we are still not aligned on converging JSON implementations.

struct JsonString {
std::string value_;
};
struct TmplString {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we pick a better name than TmplString? TemplateString? Add comments to both of these options indicating what they mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still i'ts not clear from the class names what they do. More descriptive names and comments would help.

struct TmplString {
std::string value_;
};
using FormatElement = absl::variant<JsonString, TmplString>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perf nit: I think absl::variant might be overkill since both options have the same type underneath, but maybe it optimizes well. But you could obviously do this with a string and bool, or string and enum.

Or if you were really after performance you'd make the first character of the string indicate which variant it is, and use the absl::string_view(str).substr(1,0) to get the rest out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion. Sounds string + bool is better for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you evaluate this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling it out. There are too much historical context. And I want re-start it, so I close all comment directly without a careful checking. Sorry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to re-enter comments I previously entered.

* For example given the following proto struct format configuration:
*
* json_format:
* text: "text"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use different words in the example here instead of 'text' and 'text' so the rendering is unambiguous? Say, name: "value" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still looks the same.

It might be good to leave comments unresolved and let the reviewer resolve them?

void formatValueToFormatElements(const ProtobufWkt::Value& value);
void formatValueToFormatElements(const ProtoList& list_value);

const bool keep_value_type_{};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the {} -- this has to be initialized in the ctor based on the ctor args.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not done

source/common/json/json_serializer.h Outdated Show resolved Hide resolved
source/common/json/json_serializer.h Outdated Show resolved Hide resolved
@wbpcode
Copy link
Member Author

wbpcode commented Aug 13, 2024

Sorry I wanted to finish this review off but I feel like we are still not aligned on converging JSON implementations.

np. We all want a best solution. I will address your comments after we get a conclusion about the exception usage.

@jmarantz
Copy link
Contributor

/wait

I think a separate PR should be cut to sort out the exception issue with the sanitizer.

@wbpcode
Copy link
Member Author

wbpcode commented Sep 21, 2024

/retest

// NOTE: This helper class has duplicated logic with the Json::Streamer class but
// provides lower level of APIs to operate on the output buffer (like control the
// delimiters). This is designed for special scenario of substitution formatter and
// is not intended to be used by other parts of the code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thanks.

Comment on lines +274 to +275
void addNumber(uint64_t i) { output_buffer_.add(absl::StrCat(i)); }
void addNumber(int64_t i) { output_buffer_.add(absl::StrCat(i)); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to worry about all JSON numbers being floats (and therefore int64 > ~2^53 are invalid numbers)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but lots of popular JSON libs will support integer (like the nlohmann/json, rapidjson, etc), so I think this's ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://stackoverflow.com/questions/13502398/json-integers-limit-on-size is of interest; TL;DR the json spec does not describe this limit. It might be that precision is lost if this is read by JavaScript, but a C++ json reader is free to retain the precision. There is an interoperability note in RFC7159 referencing this, but my feeling is we should not clamp to that value, and leave it to the JSON reader to drop precision if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...although putting a comment in here referencing the json spec and RFC about numbers would make sense -- we should also add that to json_streamer.h where all this is sadly duplicated :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it necessary to add additional comments for this.

Although all json numbers are float in definition. But in the actual implementation, most json lib provide the integer support.

Maybe some lib may lost precision when loading the output, but we basically have no better choice anyway?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your reasoning, and think it would be good to write a comment so that readers are on the same page.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also add that to json_streamer.h where all this is sadly duplicated :)

I think we could let the Streamer inherit from here's serializer in the future, like my very initial suggestion. They basically are different levels of abstraction.

Copy link
Member Author

@wbpcode wbpcode Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments are done for both streamer and this serializer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the serializer should definitely not inherit from the formatter. But a common base-class (or even just helper methods) could be factored out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the serializer should definitely not inherit from the formatter. But a common base-class (or even just helper methods) could be factored out.

we can move the new serializer here out from formatter. I think this doesn't matter. we can give a try at other PR to see if it make sense.

break;
case ProtobufWkt::Value::kStringValue: {
absl::string_view string_format = value.string_value();
if (!absl::StrContains(string_format, '%')) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed an issue with malformed format specifiers: I think if you see a kStringValue here with value holding a single "%" sign, you'll treat it as a format specifier. The parser will swallow the % and return an empty list of formatters (https://github.com/envoyproxy/envoy/blob/main/source/common/formatter/substitution_formatter.h#L192-L199) and then when generating the JSON output (https://github.com/envoyproxy/envoy/pull/35545/files#diff-55d03d93118484f891da85b3ae673a01955faa49745104921d5128f5031e906aR414) you index into the formatters without an index check.

I'd say this is a bug with SubstitutionFormatParser::parse -- it should probably never return an empty list of formatters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@wbpcode wbpcode Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SubstitutionFormatParser::parse will never return an empty list. There are possible cases for the parse method:

  1. empty format: return a list that contains a plain string provider which will output empty string.
  2. non-emtpy format:
    2.1 return a list that contains plain string providers and other providers.
    2.2 throw exception to reject the configuration if the format is invalid.

The single % will hit the case 2.2.

But in case some one change the behavior of parse in the future, I can do a minor change to guard that. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the if check to determine if the formatter list size is 1 or not. Now, it should be safe anyway, even someone change the behavior of parse in the future (although I think this won't happen).

      if (formatters.size() != 1) {
        // 2. Handle the formatter element with multiple or zero providers.
        stringValueToLogLine(formatters, context, info, serializer);
      } else {
        // 3. Handle the formatter element with a single provider and value
        //    type needs to be kept.
        typedValueToLogLine(formatters, context, info, serializer);
      }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this change addresses all of @zuercher 's comments, but I'll leave it to him.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misread the parsing code a bit, and it doesn't do what I thought. I think handling an empty formatter list is not a bad thing though. Thanks for adding the guard.

Signed-off-by: wangbaiping <[email protected]>
@wbpcode
Copy link
Member Author

wbpcode commented Sep 24, 2024

/retest

/**
* Serializes a integer number.
* NOTE: All numbers in JSON is float. When loading output of this serializer, the parser's
* implementation decides if the full precision of big integer could be preserved or not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the ref to the stackoverflow article & rfc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it.

@wbpcode
Copy link
Member Author

wbpcode commented Sep 26, 2024

/retest

Copy link
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@wbpcode wbpcode merged commit b7cad39 into envoyproxy:main Sep 26, 2024
21 checks passed
@wbpcode wbpcode deleted the dev-opt-json-formatter branch September 26, 2024 23:43
@fxposter
Copy link

@wbpcode am I correct that new implementation in 1.32 ignores omit_empty_values: true?

@wbpcode
Copy link
Member Author

wbpcode commented Oct 25, 2024

@wbpcode am I correct that new implementation in 1.32 ignores omit_empty_values: true?

it will ignore the sort_properties, but will obey the omit_empty_value

And the new implementation only be enabled by default after 1.32. :)

@fxposter
Copy link

I have just tested it in our custom build of 1.32 and I see null-s with this new implementation, while we have omit_empty_values: true set. That's why I asked...

@wbpcode
Copy link
Member Author

wbpcode commented Oct 25, 2024

I have just tested it in our custom build of 1.32 and I see null-s with this new implementation, while we have omit_empty_values: true set. That's why I asked...

I will check it this weekend. Maybe there is a problem we didn't notice.

@fxposter
Copy link

Thank you!

@wbpcode
Copy link
Member Author

wbpcode commented Oct 25, 2024

Thank you!

@fxposter I think I know why. We do almost every thing at configuration loading time, and only need to do the substitution command evaluation and string joining at runtime.

This way bring great performance for the Json formatter. But also make it impossible to eliminate the null entry. Because the key has been serialized at configuration loading time.

The only way to resolve it is keep the structural tree of configuration and iterate the tree at runtime.

It's no doubt that will penalize performance.

@wbpcode
Copy link
Member Author

wbpcode commented Oct 25, 2024

Is this a block point for you? I can try to find some time to test how much performance degradation that the fix will bring.

If the degradation is acceptable, we can try to fix it.

@fxposter
Copy link

I don't think it's problematic for us in a way that we won't be able to adjust the log collector code, it just no longer matches the documentation.
Also, in reality keys are only added when you serialize a struct, so if you do one additional check for the value to be null and if that's the case and omit_empty_values is enabled - I think you simply can skip adding that key, separator and format value in the first place. So it will require single additional if, no? (I just skimmed over the code, so I may be wrong here, but it looks that it should work.)

@fxposter
Copy link

Also, probably not adding unnecessary strings and writing them to file can have benefits by itself.

@wbpcode
Copy link
Member Author

wbpcode commented Oct 25, 2024

It's more complex because the serialization of the keys is done when we loading the configuration, and we can never know if the value will be null or not at that time.

I will create a bug first to record this problem.

@arazdolski
Copy link

@wbpcode Did you create a bug?

@wbpcode
Copy link
Member Author

wbpcode commented Jan 7, 2025

@wbpcode Did you create a bug?

nope. I record it in my todo but forget to create a bug

@Hexta
Copy link
Contributor

Hexta commented Jan 8, 2025

@wbpcode Would you mind creating a bug?
The next Envoy release will have this formatter enabled by default, and it could cause logs parsing problems for users.

@wbpcode
Copy link
Member Author

wbpcode commented Jan 9, 2025

@wbpcode Would you mind creating a bug?

The next Envoy release will have this formatter enabled by default, and it could cause logs parsing problems for users.

ofcourse not. I will create one when I could access my computer.

But would you mind give more context about the parsing problem? The reason why the bug has no priority is because I think it's more like a minor behavior change and won't actually effect the users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

improve performance of json formatter
8 participants