Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[META] Leverage protobuf for serializing select node-to-node objects #15308

Closed
finnegancarroll opened this issue Aug 19, 2024 · 22 comments
Closed
Assignees
Labels
experimental Marks issues as experimental Meta Meta issue, not directly linked to a PR Roadmap:Cost/Performance/Scale Project-wide roadmap label Roadmap:Modular Architecture Project-wide roadmap label Search:Performance v2.17.0 v2.18.0 Issues and PRs related to version 2.18.0

Comments

@finnegancarroll
Copy link
Contributor

finnegancarroll commented Aug 19, 2024

Please describe the end goal of this project

Provide a framework and migrate some key node-to-node requests to serialize via protobuf rather than the current native stream writing (i.e. writeTo(StreamOutput out)). Starting with node-to-node transport messages maintains api compatibility while introducing protobuf implementations which have several benefits:

  • Out of the box backwards compatibility/versioning
  • Speculative performance enhancements in some areas
  • Supports future gRPC work

Initial protobuf implementation for types - FetchSearchResult, SearchHits, SearchHit

Edit: Closing this issue as we see only marginal benefits to utilizing protobuf on the transport layer. The existing writeTo() implementation is as performant as protobuf, if not more so. Adding a second implementation for objects sent over the transport layer would create a second parallel object which needs to be maintained and would only provide the backwards compatibility benefit of protobuf for that specific transport request.

Supporting References

Related component

Search:Performance

@dbwiddis
Copy link
Member

dbwiddis commented Aug 24, 2024

Having spent way too much time diving into the byte-level (or bit-level if you consider 7-bit Vints) details of these protocols, I want to make sure we're focusing on the correct things here.

Protobuf isn't some magical thing that brings about 30% improvement the way the POC implemented it, and I don't think the existing POCs tested it fully vs. the existing capabilities of the transport protocol if used properly.

In summary:

  • I think most of the performance gain in the POC was based on reducing the number of bytes used to transmit information
    • Existing StreamInput/StreamOutput already contains the capability to use variable-length ints/longs, they're just underused
    • The existing Transport Protocol using Vint/Vlong is marginally (but likely not significant) more efficient (in bandwidth terms) than protobuf when dealing with unsigned values, and Zint/Zlong is equivalent to protobuf signed values.
  • Broadly speaking, protobuf can be marginally more efficient in the case where there are a lot more "null" or "empty" values (you gain a byte in each case of a missing value, but lose a byte for every present value).
    • Protobuf needs to identify every field which takes a byte; StreamInput/StreamOutput need an boolean to implement optional.

Protobuf's primary benefit is in backwards compatibility.

There are probably additional significant benefits of gRPC but this meta issue doesn't seem to be about changing the entire protocol, so I don't think those should be considered as part of the benefits proposed.

The performance impacts are very situation specific and shouldn't be assumed; it's possible a change in streaming to variable-length integers/longs (VInt/ZInt) can achieve similar gains.

@finnegancarroll
Copy link
Contributor Author

Hi @dbwiddis, thanks for the feedback!

  • I think most of the performance gain in the POC was based on reducing the number of bytes used to transmit information

I am noticing protobuf objects I implement write more bytes to stream than the native "writeTo" serialization. Small protobuf objects are about ~20% larger but the difference shrinks as request content grows. So far I've been focusing on FetchSearchResult and it's members. If i'm understanding correctly the best use case for protobuf (only considering performance) would be something like a stats endpoint which contains a lot of ints/longs, particularly those which are not UUIDs/hashes and could be squished into a single byte representation.

As I learn more about protobuf I'm understanding any potential performance improvements just from making a change to serialization is gong to be speculative and I'll update this issue with benchmarks as I run them. Currently the concrete expected benefits would be out of the box backwards compatibility as well as providing a stepping stone for gRPC support.

@finnegancarroll
Copy link
Contributor Author

finnegancarroll commented Sep 3, 2024

Initial numbers for protobuf vs native serializers.
Run with 5 data nodes (r5.xlarge) per cluster against OS 2.16.
So far there is no apparent speedup and protobuf is as performant as the native protocol in this case. Vector workload in progress.

OSB Big 5

2.16 min distribution

|                                        90th percentile latency |                                  desc_sort_timestamp |     76.5407 |      ms |
|                                        90th percentile latency |                                                 term |     69.2465 |      ms |
|                                        90th percentile latency |                                  multi_terms-keyword |     71.7096 |      ms |
|                                        90th percentile latency |                                               scroll |      503024 |      ms |
|                                        90th percentile latency |                              query-string-on-message |     76.0508 |      ms |

FetchSearchResults as protobuf - Branch

|                                        90th percentile latency |                                  desc_sort_timestamp |     78.7184 |      ms |
|                                        90th percentile latency |                                                 term |     69.6506 |      ms |
|                                        90th percentile latency |                                  multi_terms-keyword |     69.0668 |      ms |
|                                        90th percentile latency |                                               scroll |      560605 |      ms |
|                                        90th percentile latency |                              query-string-on-message |     74.4521 |      ms |

Notes

  • Several FetchSearchResults fields could be further decomposed into protobuf. For example SearchHits.collapseFields is still encoded with it’s native writeTo implementation.
  • Initial POC work (diff viewable here) created protobuf exclusive copies of structures within the search path. This may have benefited performance by allowing inlining or reducing vtable lookups when handling transport layer responses. An interesting read on inlining potentially relevant to our use of writeTo in our native serialization.
  • Search results payload may not be large enough to see any performance difference in the above workloads. Workloads with larger search results may show a more substantial performance difference.
  • As noted above a key area where we expect benefit from protobuf is automatic variable length encoding for numeric values. FetchSearchResult is not particularly suited to this case and the same functionality exists in the native protocol (although underused in some cases).
  • This implementation deserializes the message immediately. With a slightly different implementation some message types could benefit from protobuf's lazy loading functioncality.

@dblock
Copy link
Member

dblock commented Sep 4, 2024

Let's keep in mind that while node-to-node we may not see as much benefit, client-to-server switching from JSON to protobuf should. There's also a significant improvement in developer experience if we can replace the native implementation with protobuf.

@finnegancarroll
Copy link
Contributor Author

Some additional benchmarks for a vector search workload.
Again no noticeable difference in performance.

5 data nodes (r5.xlarge) per cluster. 10000 queries against sift-128-euclidean.hdf5 data set.

FetchSearchResults as Protobuf

|                                                 Min Throughput |         prod-queries |        1.82 |  ops/s |
|                                                Mean Throughput |         prod-queries |       13.69 |  ops/s |
|                                              Median Throughput |         prod-queries |       13.84 |  ops/s |
|                                                 Max Throughput |         prod-queries |       13.93 |  ops/s |
|                                        50th percentile latency |         prod-queries |     70.1863 |     ms |
|                                        90th percentile latency |         prod-queries |     71.6104 |     ms |
|                                        99th percentile latency |         prod-queries |     73.9335 |     ms |
|                                      99.9th percentile latency |         prod-queries |     79.8549 |     ms |
|                                     99.99th percentile latency |         prod-queries |      273.26 |     ms |
|                                       100th percentile latency |         prod-queries |      546.72 |     ms |
|                                   50th percentile service time |         prod-queries |     70.1863 |     ms |
|                                   90th percentile service time |         prod-queries |     71.6104 |     ms |
|                                   99th percentile service time |         prod-queries |     73.9335 |     ms |
|                                 99.9th percentile service time |         prod-queries |     79.8549 |     ms |
|                                99.99th percentile service time |         prod-queries |      273.26 |     ms |
|                                  100th percentile service time |         prod-queries |      546.72 |     ms |
|                                                     error rate |         prod-queries |           0 |      % |
|                                                  Mean recall@k |         prod-queries |        0.97 |        |
|                                                  Mean recall@1 |         prod-queries |        0.99 |        |

2.16 Native serialization

|                                                 Min Throughput |         prod-queries |        1.63 |  ops/s |
|                                                Mean Throughput |         prod-queries |       13.49 |  ops/s |
|                                              Median Throughput |         prod-queries |       13.73 |  ops/s |
|                                                 Max Throughput |         prod-queries |       13.84 |  ops/s |
|                                        50th percentile latency |         prod-queries |      70.525 |     ms |
|                                        90th percentile latency |         prod-queries |     71.8059 |     ms |
|                                        99th percentile latency |         prod-queries |     77.1621 |     ms |
|                                      99.9th percentile latency |         prod-queries |      83.521 |     ms |
|                                     99.99th percentile latency |         prod-queries |     446.145 |     ms |
|                                       100th percentile latency |         prod-queries |     610.186 |     ms |
|                                   50th percentile service time |         prod-queries |      70.525 |     ms |
|                                   90th percentile service time |         prod-queries |     71.8059 |     ms |
|                                   99th percentile service time |         prod-queries |     77.1621 |     ms |
|                                 99.9th percentile service time |         prod-queries |      83.521 |     ms |
|                                99.99th percentile service time |         prod-queries |     446.145 |     ms |
|                                  100th percentile service time |         prod-queries |     610.186 |     ms |
|                                                     error rate |         prod-queries |           0 |      % |
|                                                  Mean recall@k |         prod-queries |        0.97 |        |
|                                                  Mean recall@1 |         prod-queries |        0.99 |        |

@getsaurabh02 getsaurabh02 added Roadmap:Cost/Performance/Scale Project-wide roadmap label Roadmap:Modular Architecture Project-wide roadmap label v2.18.0 Issues and PRs related to version 2.18.0 labels Sep 9, 2024
@Pallavi-AWS
Copy link
Member

@finnegancarroll can we run vector search workload with client-to-server switching? Thanks.

@finnegancarroll finnegancarroll changed the title [META] <Leverage protobuf for serializing select node-to-node objects> [META] Leverage protobuf for serializing select node-to-node objects Oct 1, 2024
@finnegancarroll
Copy link
Contributor Author

Experimenting with a previous POC with diff here: https://github.com/finnegancarroll/OpenSearch/pull/2/files

Benchmarking locally with 5 nodes in separate docker containers with 20% big5 workload.

POC diff:

|                                                Mean Throughput | default |        2.01 |  ops/s |
|                                              Median Throughput | default |        2.01 |  ops/s |
|                                                 Max Throughput | default |        2.01 |  ops/s |
|                                        50th percentile latency | default |      5.2044 |     ms |
|                                        90th percentile latency | default |     5.78371 |     ms |
|                                        99th percentile latency | default |     6.11664 |     ms |
|                                       100th percentile latency | default |     6.24282 |     ms |
|                                   50th percentile service time | default |     3.73635 |     ms |
|                                   90th percentile service time | default |     4.09466 |     ms |
|                                   99th percentile service time | default |     4.55763 |     ms |
|                                  100th percentile service time | default |     4.74975 |     ms |
|                                                     error rate | default |           0 |      % |
|                                                 Min Throughput |   range |        2.01 |  ops/s |
|                                                Mean Throughput |   range |        2.01 |  ops/s |
|                                              Median Throughput |   range |        2.01 |  ops/s |
|                                                 Max Throughput |   range |        2.01 |  ops/s |
|                                        50th percentile latency |   range |     33.3813 |     ms |
|                                        90th percentile latency |   range |     35.1394 |     ms |
|                                        99th percentile latency |   range |     38.2579 |     ms |
|                                       100th percentile latency |   range |     39.8031 |     ms |
|                                   50th percentile service time |   range |     31.8209 |     ms |
|                                   90th percentile service time |   range |     33.8545 |     ms |
|                                   99th percentile service time |   range |     36.5908 |     ms |
|                                  100th percentile service time |   range |     38.0003 |     ms |

OpenSearch 2.17:

|                                                 Min Throughput | default |           2 |  ops/s |
|                                                Mean Throughput | default |           2 |  ops/s |
|                                              Median Throughput | default |           2 |  ops/s |
|                                                 Max Throughput | default |           2 |  ops/s |
|                                        50th percentile latency | default |     7.04535 |     ms |
|                                        90th percentile latency | default |     7.87684 |     ms |
|                                        99th percentile latency | default |     11.5237 |     ms |
|                                       100th percentile latency | default |     17.7137 |     ms |
|                                   50th percentile service time | default |     5.49358 |     ms |
|                                   90th percentile service time | default |     6.13559 |     ms |
|                                   99th percentile service time | default |     10.3286 |     ms |
|                                  100th percentile service time | default |     16.1667 |     ms |
|                                                     error rate | default |           0 |      % |
|                                                 Min Throughput |   range |           2 |  ops/s |
|                                                Mean Throughput |   range |           2 |  ops/s |
|                                              Median Throughput |   range |           2 |  ops/s |
|                                                 Max Throughput |   range |        2.01 |  ops/s |
|                                        50th percentile latency |   range |     15.2952 |     ms |
|                                        90th percentile latency |   range |     15.9011 |     ms |
|                                        99th percentile latency |   range |     20.6263 |     ms |
|                                       100th percentile latency |   range |     24.5943 |     ms |
|                                   50th percentile service time |   range |      13.646 |     ms |
|                                   90th percentile service time |   range |     14.3388 |     ms |
|                                   99th percentile service time |   range |     19.2499 |     ms |
|                                  100th percentile service time |   range |     23.1727 |     ms |
|                                                     error rate |   range |           0 |      % |

@finnegancarroll
Copy link
Contributor Author

Protobuf increases our payload size around 5% based on term queries run on the big5 dataset. On average the following query run on the big5 dataset produces a ~9.5mb response with a native serializer and ~10mb with protobuf.

curl -X POST "http://localhost:9200/_search" \
-H "Content-Type: application/json" \
-d '{
  "size": 10000,
  "query": {
    "term": {
      "log.file.path": {
        "value": "/var/log/messages/birdknight"
      }
    }
  }
}'

Without trying to deconstruct the protobuf binary format one explanation could be protobuf losses some space efficiency by requiring each field in a protobuf object have its position marked with a "field number". In contrast the OpenSearch native node-to-node protocol simply expects to read fields in the exact pre-determined order in which they are written.

@getsaurabh02 getsaurabh02 added the experimental Marks issues as experimental label Oct 7, 2024
@finnegancarroll
Copy link
Contributor Author

Some micro benchmarks with flame graphs comparing protobuf serialization to native transport & client/server JSON serialization. Test branch can be found here: https://github.com/finnegancarroll/OpenSearch/tree/serde-micro-bench

Note: Test data is generated with SearchHitsTests::createTestItem so the content is randomized and may not accurately reflect real use cases.

While microbenchmarks are not particularly reliable this does seem to suggest protobuf is more performant in both cases. Much more so on client/server than transport layer. Serialization is likely not a large enough slice of the total latency to make an impact in OSB benchmarks.

protoWriteTo    - SearchHitsProtobufBenchmark.writeToBench  avgt    4  3.460  ms/op
nativeWriteTo   - SearchHitsProtobufBenchmark.writeToBench  avgt    2  5.991  ms/op
protoXCont      - SearchHitsProtobufBenchmark.toXContBench  avgt    2  19.393 ms/op
nativeXCont     - SearchHitsProtobufBenchmark.toXContBench  avgt    2  46.680 ms/op

flame.zip

@reta
Copy link
Collaborator

reta commented Oct 9, 2024

Thanks a lot, @finnegancarroll , I am a bit surprised (since "in general" the specialized impl is expected to be a bit faster than generalized one). I am eager to run the SearchHitsProtobufBenchmark out of curiosity, the writeToBench is commented out in your branch, does it correspond to SearchHitsProtobufBenchmark.writeToBench? (if I uncomment it). Thank you.

@finnegancarroll
Copy link
Contributor Author

finnegancarroll commented Oct 9, 2024

End to end benchmarks with protobuf on transport layer as well as the REST response for a vector workload. Environment is identical to the previous tests here (with 2.17 as baseline): #15308 (comment)

|                                                 Min Throughput |         prod-queries |        1.48 |  ops/s |
|                                                Mean Throughput |         prod-queries |       13.57 |  ops/s |
|                                              Median Throughput |         prod-queries |       13.76 |  ops/s |
|                                                 Max Throughput |         prod-queries |       13.87 |  ops/s |
|                                        50th percentile latency |         prod-queries |     70.0255 |     ms |
|                                        90th percentile latency |         prod-queries |      70.999 |     ms |
|                                        99th percentile latency |         prod-queries |     73.3903 |     ms |
|                                      99.9th percentile latency |         prod-queries |     79.4551 |     ms |
|                                     99.99th percentile latency |         prod-queries |     457.676 |     ms |
|                                       100th percentile latency |         prod-queries |     673.036 |     ms |
|                                   50th percentile service time |         prod-queries |     70.0255 |     ms |
|                                   90th percentile service time |         prod-queries |      70.999 |     ms |
|                                   99th percentile service time |         prod-queries |     73.3903 |     ms |
|                                 99.9th percentile service time |         prod-queries |     79.4551 |     ms |
|                                99.99th percentile service time |         prod-queries |     457.676 |     ms |
|                                  100th percentile service time |         prod-queries |     673.036 |     ms |

The change in serialization does not appear to impact latencies in this case. Generated some flame graphs using OS 2.17 on a local 3 node cluster for this workload to estimate how much serialization impacts latency.

Examining the coordinator node for the cluster:

FetchSearchResults.<init> - 1.27 % of cpu w/ 61 samples
SearchHits.toXContent - 1.27% w/ 61 samples

On average REST response from the coordinator is a relatively small ~13kb.

VecWorkload-2.17.zip

@finnegancarroll
Copy link
Contributor Author

Hi @reta, i've cleaned up SearchHitsProtobufBenchmark to make it easier to reproduce. To provide the micro benchmarks with test items i've added a small unit test which generates and writes instances of 'SearchHits' to a temporary directory. The whole micro benchmark test suite can be run with the following two commands:

./gradlew server:test --tests "org.opensearch.search.SearchHitsTests.testMicroBenchmarkHackGenerateTestFiles" -Dtests.security.manager=false
./gradlew -p benchmarks run --args 'SearchHitsProtobufBenchmark'

I agree it is surprising that SearchHitsProtobufBenchmark.writeToNativeBench is no faster than writing the protobuf object in this case, although they are very close. Messing with the test params and changing machines I see similar results. With more test files and iterations:

SearchHitsProtobufBenchmark.toXContNativeBench  avgt       26.244          ms/op
SearchHitsProtobufBenchmark.toXContProtoBench   avgt       13.193          ms/op
SearchHitsProtobufBenchmark.writeToNativeBench  avgt       16.033          ms/op
SearchHitsProtobufBenchmark.writeToProtoBench   avgt       12.170          ms/op

Looking through the flame graphs of SearchHitsProtobufBenchmark.writeToProtoBench and SearchHitsProtobufBenchmark.writeToNativeBench i'm wondering if the difference comes down to how SearchHit.documentFields and SearchHit.metaFields are handled. SearchHitsProtobufBenchmark.writeToNativeBench spends almost 40% of its time in StreamOutput::writeGenericValue but I don't see the same kind of time spent writing to stream in the protobuf implementation.

image
image

@reta
Copy link
Collaborator

reta commented Oct 9, 2024

Thanks @finnegancarroll

SearchHitsProtobufBenchmark.writeToNativeBench spends almost 40% of its time in StreamOutput::writeGenericValue but I don't see the same kind of time spent writing to stream in the protobuf implementation.

Very true, I came to the same conclusions, I will try to briefly look what could be the issue there (without spending too much time). Thanks a lot for cleaning up the benchmarks!

@finnegancarroll
Copy link
Contributor Author

finnegancarroll commented Oct 10, 2024

While investigating areas we might expect to benefit from protobuf I spent some time reproducing this previous POC: #10684 (comment)

Below benchmarks were run locally with 5 nodes, gradle run, and the nyc_taxis workload which is as close an environment as I can manage to the previous POC benchmarks. There is a noticeable difference in latency between these results and the original OSB numbers which could be attributed to using more of the nyc_taxis dataset or a differences in hardware.

Edit: Previous POC gives Store size |   | 0.000252848 | GB while the below is using all 24GB which likely accounts for the difference in latency.

Protobuf Proof of Concept Branch
POC branch
OSB-poc-search-protobuf.zip

90th percentile latency |  range |      240.02 |     ms |
90th percentile latency |  range |     236.208 |     ms |
90th percentile latency |  range |     240.254 |     ms |
90th percentile latency |  range |     231.929 |     ms |
90th percentile latency | default |     12.0153 |     ms |
90th percentile latency | default |     10.2515 |     ms |
90th percentile latency | default |     9.61354 |     ms |
90th percentile latency | default |     9.62332 |     ms |

Native Branch
Head of main for the above POC branch (commit d916f9c1027f5b2ccff971a66921b7c26db1688f)
OSB-d916f9c.zip

90th percentile latency |  range |     236.776 |     ms |
90th percentile latency |  range |     243.967 |     ms |
90th percentile latency | default |     9.56527 |     ms |
90th percentile latency | default |     7.67412 |     ms |

Inspecting flame graphs for the poc-search-protobuf I see nearly all cpu cycles are consumed by lucene processing the search and little time is spent on serialization.
poc-search-protobuf5NodeClusterDefaultFlameGraphs.zip
poc-search-protobuf5NodeClusterRangeFlameGraphs.zip

@dbwiddis
Copy link
Member

@finnegancarroll This is great seeing benchmarking and deep diving into the details here!

Without trying to deconstruct the protobuf binary format one explanation could be protobuf losses some space efficiency by requiring each field in a protobuf object have its position marked with a "field number". In contrast the OpenSearch native node-to-node protocol simply expects to read fields in the exact pre-determined order in which they are written.

Having deconstructed the protobuf binary format, I'll confirm this observation. :)

In short: protobuf uses a byte for each field that exists (type and index info) and 0 bytes for nonexistent fields. The binary transport protocol doesn't mark each field but assumes all fields are present in a specific order, and requires the use of a byte for a boolean "optional". So the takeaway from a "bandwidth" perspective is that if you tend to have a lot of optional information, you can save space with protobuf.

But as your tests seem to show, this marginal difference in bandwidth seems to be less significant, and maybe not the thing to focus on....

@reta
Copy link
Collaborator

reta commented Oct 11, 2024

Thanks @finnegancarroll , my apologies for the delay, just completed may part. I didn't spend too much time on native serialization but it looks like (with the benchmarks) StreamOutput::getGenericType (that is being used by metaFields and documentFields) is the bottleneck. I think we could use some clever tricks to make it faster for collections (one such example would be to capture the type of the first element in the collection since I suspect those are homogeneous but have different types). I could spend more time on that if you think it is beneficial.

But as your tests seem to show, this marginal difference in bandwidth seems to be less significant, and maybe not the thing to focus on....

This is a very valid point @dbwiddis but we also should not forget the maintenance costs. Changes in native protocol are difficult now and incur a burden on everyone (update main, backport + change version, forwardport to main, ...). Protobuf takes care of that (as far as schema is properly evolved), I think that would bring more benefits than marginal latency improvements.

@finnegancarroll
Copy link
Contributor Author

finnegancarroll commented Oct 17, 2024

Hi @reta, I spent some time exploring your above suggestion and I'm seeing benefits but only in some limited artificial cases for the moment. StreamOutput::getGenericType looks to inline getGenericType and getWriter but running benchmarks with the below gives about ~20% of our total time is spent in StreamOutput::getGenericType.

./gradlew -p benchmarks run --args ' SearchHitsProtobufBenchmark.writeToNativeBench -jvmArgs "-XX:CompileCommand=dontinline,org.opensearch.core.common.io.stream.StreamOutput::getGenericType -XX:CompileCommand=dontinline,org.opensearch.core.common.io.stream.StreamOutput::getWriter"'

(StreamOutput::getGenericType in purple)
writeGenericNoInline.zip

Stepping through with a debugger it looks like we do save on some calls to StreamOutput::getGenericType but in the case of test items taken from the unit test suite and OSB datasets i'm having difficulty finding DocumentFields that are specifically large collections.

To quickly hack in a "best case" scenario dataset I modified the unit tests such that DocumentFields could only be 100 element long integer lists. With the optimized writing implemented like this.

This gives pretty substantial speedup in microbenchmarks, especially considering the cpu now spends nearly all of it's time writing specifically DocumentFields.

// Control
SearchHitsProtobufBenchmark.writeToNativeBench  avgt    2  2205.636          ms/op
// Skip type checking for homogeneous collections
SearchHitsProtobufBenchmark.writeToNativeBench  avgt    4  658.330 ± 255.170  ms/op

Control
image

Skip type checking for homogeneous collections
image

100IntListDocFieldsFlameGraphs.zip

@reta
Copy link
Collaborator

reta commented Oct 17, 2024

Thanks a lot @finnegancarroll , it looks like we are at crossroads now: should we move forward with protobuf (since the latency wins are moderate) or (may be) move forward with optimizing the native protocol?

@finnegancarroll
Copy link
Contributor Author

Additional benchmarks comparing protobuf to native protocol with the same inflated DocumentFields size as used here.

SearchHitsProtobufBenchmark.toXContNativeBench  avgt    2  1787.097          ms/op
SearchHitsProtobufBenchmark.toXContProtoBench   avgt    2   832.177          ms/op
SearchHitsProtobufBenchmark.writeToNativeBench  avgt    2   783.957          ms/op
SearchHitsProtobufBenchmark.writeToProtoBench   avgt    2   807.185          ms/op

@reta
Copy link
Collaborator

reta commented Oct 18, 2024

Additional benchmarks comparing protobuf to native protocol with the same inflated DocumentFields size as used here.

Thanks @finnegancarroll , writeToNativeBench & writeToProtoBench are really close, I think XContent ones we could discard (at the moment) since node-to-node does not use this serialization format (only native serialization).

@dblock
Copy link
Member

dblock commented Oct 18, 2024

Thanks a lot @finnegancarroll , it looks like we are at crossroads now: should we move forward with protobuf (since the latency wins are moderate) or (may be) move forward with optimizing the native protocol?

I think we should do both. Switching to an industry standard is a long term bet on improving developer experience and an entire community optimizing the binary protocol for us.

@finnegancarroll
Copy link
Contributor Author

Closing this issue to move forward with client facing gRPC server - #16556.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experimental Marks issues as experimental Meta Meta issue, not directly linked to a PR Roadmap:Cost/Performance/Scale Project-wide roadmap label Roadmap:Modular Architecture Project-wide roadmap label Search:Performance v2.17.0 v2.18.0 Issues and PRs related to version 2.18.0
Projects
Status: New
Status: Done
Archived in project
Development

No branches or pull requests

7 participants