Cat Nodes API with Protobuf #9097

VachaShah · 2023-08-03T19:04:58Z

Description

The purpose of this draft PR is to create a new cat nodes API with protobuf as a serialization/de-serialization mechanism for node-to-node communication. We also benchmark the performance results in comparison to the original cat nodes API. This PR creates a separate path for the new API but when these changes are to be merged, a lot of code in the original API can be replaced with the newer approach. For the purpose of this POC, as of now the files are separate.

This PR represents the work done for the POC and is not be merged. The changes will be raised, reviewed and merged in incremental PRs.

Related Issues

#6844
#1287

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Vacha Shah <[email protected]>

…ream related functionality Signed-off-by: Vacha Shah <[email protected]>

…tegration Signed-off-by: Vacha Shah <[email protected]>

Signed-off-by: Vacha Shah <[email protected]>

…sAction Signed-off-by: Vacha Shah <[email protected]>

Signed-off-by: Vacha Shah <[email protected]>

…- mainly TransportService to try to fix multi node problem Signed-off-by: Vacha Shah <[email protected]>

…s commit Signed-off-by: Vacha Shah <[email protected]>

Signed-off-by: Vacha Shah <[email protected]>

…rch-project#9073) This commit refactors the following network and transport libraries to the opensearch common and core libraries respectively: * o.o.common.network.Cidrs -> :libs:opensearch-common * o.o.common.network.InetAddresses -> :libs:opensearch-common * o.o.common.network.NetworkAddress -> :libs:opensearch-common * o.o.common.transport.NetworkExceptionHelper -> :libs:opensearch-common * o.o.common.transport.PortsRange -> :libs:opensearch-common * o.o.common.transport.TransportAddress -> :libs:opensearch-core * o.o.common.transport.BoundTransportAddress -> :libs:opensearch-core * o.o.transport.TransportMessage -> :libs:opensearch-core * o.o.transport.TransportResponse -> :libs:opensearch-core The purpose is to reduce the change surface area of the core APIs to minimize impact to downstream consumers while moving toward establishing a formal API for cloud native or serverless implementations. Signed-off-by: Nicholas Walter Knize <[email protected]>

…st index deletion. (opensearch-project#8472) Signed-off-by: Harish Bhakuni <[email protected]>

…replay (opensearch-project#8578) Signed-off-by: Gaurav Bafna <[email protected]>

…search-project#9057) --------- Signed-off-by: Ashish Singh <[email protected]>

…ckpoint validation (opensearch-project#8889) * Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation. This test is now occasionally failing with replicas having 0 documents. This occurs in a couple of ways: 1. After dropping the old primary the new primary is not publishing a checkpoint to replicas unless it indexes docs from translog after flipping to primary mode. If there is nothing to index, it will not publish a checkpoint, but the other replica could have never sync'd with the original primary and be left out of date. - This PR fixes this by force publishing a checkpoint after the new primary flips to primary mode. 2. The replica receives a checkpoint post failover and cancels its sync with the former primary that is still active, recognizing a primary term bump. However this cancellation is async and immediately starting a new replication event could fail as its still replicating. - This PR fixes this by attempting to process the latest received checkpoint on failure, if the shard is not failed and still behind. This PR also introduces a few changes to ensure the accuracy of the ReplicationCheckpoint tracked on primary & replicas. - Ensure the checkpoint stored in SegmentReplicationTarget is the checkpoint passed from the primary and not locally computed. This ensures checks for primary term are accurate and not using a locally compued operationPrimaryTerm. - Introduces a refresh listener for both primary & replica to update the ReplicationCheckpoint and store it in replicationTracker post refresh rather than redundantly computing when accessed. - Removes unnecessary onCheckpointPublished method used to start replication timers manually. This will happen automatically on primaries once its local cp is updated. Signed-off-by: Marc Handalian <[email protected]> * Handle NoSuchFileException when attempting to delete decref'd files. To avoid divergent logic with remote store, we always incref/decref the segmentinfos.files(true) which includes the segments_n file. Decref to 0 will attempt to delete the file from the store and its possible this _n file does not yet exist. This change will ignore if we get a noSuchFile while attempting to delete. Signed-off-by: Marc Handalian <[email protected]> * Add more unit tests. Signed-off-by: Marc Handalian <[email protected]> * Clean up IndexShardTests.testCheckpointReffreshListenerWithNull Signed-off-by: Marc Handalian <[email protected]> * Remove unnecessary catch for NoSuchFileException. Signed-off-by: Marc Handalian <[email protected]> * Add another test for non segrep. Signed-off-by: Marc Handalian <[email protected]> * PR Feedback. Signed-off-by: Marc Handalian <[email protected]> * re-compute replication checkpoint on primary promotion. Signed-off-by: Marc Handalian <[email protected]> --------- Signed-off-by: Marc Handalian <[email protected]>

Signed-off-by: Vacha Shah <[email protected]>

github-actions · 2023-08-03T19:27:32Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/21815/
CommitID: c05f5aa
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

opensearch-trigger-bot · 2023-08-03T19:38:39Z

Compatibility status:



> Task :checkCompatibility
Incompatible components: [https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/performance-analyzer.git]
Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/ml-commons.git]

BUILD SUCCESSFUL in 33m 18s

gashutos · 2023-08-14T09:16:00Z

Strictly speaking from performance dimension,
I was just benchmarking netty http layer for search path. We hardly spend 1 or 2 ms in netty layer per request.
Any benchmark available for protobuf ?

opensearch-trigger-bot · 2023-09-13T15:21:32Z

This PR is stalled because it has been open for 30 days with no activity. Remove stalled label or comment or this will be closed in 7 days.

dblock · 2023-10-02T20:55:32Z

@VachaShah Given #6844 (comment), do you have any ideas about how to bring this into OpenSearch as maybe an experimental feature? What are next steps?

VachaShah · 2023-10-02T23:12:16Z

@VachaShah Given #6844 (comment), do you have any ideas about how to bring this into OpenSearch as maybe an experimental feature? What are next steps?

I am working on getting the changes from the POC in #9097 to merge in the repo. We are also working on getting the numbers for APIs like search.

peternied · 2023-10-12T17:33:33Z

libs/core/src/main/java/org/opensearch/core/transport/TransportMessage.java

+    * Constructs a new transport message with the data from the {@link byte[]}. This is
+    * currently a no-op
+    */
+    public TransportMessage(byte[] in) {}


Could we avoid adding this change by reusing the existing StreamInput?

peternied

This is really exciting work - thanks for getting this out there.

For the purpose of this POC, as of now the files are separate.

What our your thoughts on the end state for this change? I'm resistant to merge so much duplicate code.

peternied · 2023-10-12T17:35:41Z

libs/core/src/main/java/org/opensearch/core/common/io/stream/ProtobufWriteable.java

+        * @param out Output to write the {@code value} too
+        * @param value The value to add
+        */
+        void write(OutputStream out, V value) throws IOException;


Its strange than the writer is via an OutputStream, but the reader isn't a symetrical version that uses InputStream. Can we align the types to be consistant, be it bytes[] or *Stream?

peternied · 2023-10-12T17:53:24Z

server/src/main/java/org/opensearch/tasks/ProtobufTask.java

+*
+* @opensearch.internal
+*/
+public class ProtobufTask {


This class is nearly identical to the existing Task, why do we need a largely identical protobuf version? This seems to imply there is coupling between the existing OpenSearch data models and their serialized form. Can we decompose these relationships more?

peternied

~ Duplicate comment ~

VachaShah · 2023-10-12T18:42:31Z

This is really exciting work - thanks for getting this out there.

For the purpose of this POC, as of now the files are separate.

What our your thoughts on the end state for this change? I'm resistant to merge so much duplicate code.

Hi @peternied, thank you for your comments! This PR is not be merged, its a draft PR for the work done to get numbers for which I created a parallel API _cat/nodes_protobuf to compare with the original version. The changes to be merged will go in incremental PRs, which I will raise later. This PR is just out here in the draft state for the work done for the POC.

VachaShah · 2023-10-12T18:44:43Z

This is really exciting work - thanks for getting this out there.

For the purpose of this POC, as of now the files are separate.

What our your thoughts on the end state for this change? I'm resistant to merge so much duplicate code.

Hi @peternied, thank you for your comments! This PR is not be merged, its a draft PR for the work done to get numbers for which I created a parallel API _cat/nodes_protobuf to compare with the original version. The changes to be merged will go in incremental PRs, which I will raise later. This PR is just out here in the draft state for the work done for the POC.

I think I was not clear in the description of the PR, so I added a comment about this as well.

dblock · 2023-10-13T19:10:08Z

Any parts of this PR that can be merged? Otherwise I think we can document this in #6844 and close it? (And nice work!)

VachaShah · 2023-10-17T18:09:54Z

Any parts of this PR that can be merged? Otherwise I think we can document this in #6844 and close it? (And nice work!)

Thank you @dblock! I will raise incremental PRs for this, so closing this one now. I will make sure to update #6844 to document this.

VachaShah added 30 commits August 2, 2023 17:13

Adding BaseWriteable and ProtobufWriteable

9be33d7

Signed-off-by: Vacha Shah <[email protected]>

Adding proto messages for TaskId and TaskResourceStats

0034b0a

Signed-off-by: Vacha Shah <[email protected]>

Adding Task related classes with protobuf integration

cdb6806

Signed-off-by: Vacha Shah <[email protected]>

Adding ProtobufStreamInput and ProtobufStreamOutput for additional st…

c8c2f53

…ream related functionality Signed-off-by: Vacha Shah <[email protected]>

Adding TransportMessage and TransportRequest classes with protobuf in…

a2be2fa

…tegration Signed-off-by: Vacha Shah <[email protected]>

Fixing build and precommit

c0bbf92

Signed-off-by: Vacha Shah <[email protected]>

Adding protobuf integrations for client, transport, request

551fe82

Signed-off-by: Vacha Shah <[email protected]>

Fixing build and integrating protobuf for classes related to RestNode…

a1ab589

…sAction Signed-off-by: Vacha Shah <[email protected]>

Fixing node crashes

8e57034

Signed-off-by: Vacha Shah <[email protected]>

Fixes

a5ce258

Signed-off-by: Vacha Shah <[email protected]>

Fixing nodes api response for protobuf

0d76b3b

Signed-off-by: Vacha Shah <[email protected]>

ProtobufClusterState to ClusterState

e6f6190

Signed-off-by: Vacha Shah <[email protected]>

Fixing cluster manager x to *

b94a49e

Signed-off-by: Vacha Shah <[email protected]>

Eliminating a lot of protobuf classes to merge with original classes …

cb320c7

…- mainly TransportService to try to fix multi node problem Signed-off-by: Vacha Shah <[email protected]>

This fixes the single node calls after the refactoring in the previou…

13c4568

…s commit Signed-off-by: Vacha Shah <[email protected]>

Trying serialization and deserialization across nodes

b659260

Signed-off-by: Vacha Shah <[email protected]>

Debugging and fixing compile errors

e96f3e0

Signed-off-by: Vacha Shah <[email protected]>

Trying proto message serde across nodes with tests and example request

496cdbb

Signed-off-by: Vacha Shah <[email protected]>

Changes for multi node - working on single and multi node

8aab04b

Signed-off-by: Vacha Shah <[email protected]>

Calculating time

b583179

Signed-off-by: Vacha Shah <[email protected]>

Removing sysouts

515110a

Signed-off-by: Vacha Shah <[email protected]>

Cleaning up code

f94f165

Signed-off-by: Vacha Shah <[email protected]>

Ignoring proto generated classes

35d8906

Signed-off-by: Vacha Shah <[email protected]>

Cleaning up more code

d511930

Signed-off-by: Vacha Shah <[email protected]>

Cleaning up CodedInputStream, CodedOutputStream and TryWriteable

562e6c3

Signed-off-by: Vacha Shah <[email protected]>

Performance improvements

d5f3a0e

Signed-off-by: Vacha Shah <[email protected]>

Fixing compile errors after merging with main

2f423b9

Signed-off-by: Vacha Shah <[email protected]>

Renaming proto messages and related classes

eeeb40f

Signed-off-by: Vacha Shah <[email protected]>

Fixing precommit failures

8f5a327

Signed-off-by: Vacha Shah <[email protected]>

Cleaning up code

d409e71

Signed-off-by: Vacha Shah <[email protected]>

VachaShah and others added 8 commits August 3, 2023 07:07

Improvements

967d26b

Signed-off-by: Vacha Shah <[email protected]>

[Snapshot Interop] Add Logic in Lock Manager to cleanup stale data po…

d73fd6a

…st index deletion. (opensearch-project#8472) Signed-off-by: Harish Bhakuni <[email protected]>

Avoid duplicate indexing in case of SegRep enabled indices' translog …

f3a17fc

…replay (opensearch-project#8578) Signed-off-by: Gaurav Bafna <[email protected]>

Fix flaky test testStatsOnShardUnassigned in RemoteStoreStatsIT (open…

64a7457

…search-project#9057) --------- Signed-off-by: Ashish Singh <[email protected]>

Getting latest changes from main

46aaea2

Signed-off-by: Vacha Shah <[email protected]>

Merge branch 'main' into poc-cat-nodes-protobuf

c05f5aa

VachaShah mentioned this pull request Sep 4, 2023

Introduce protobuf serialization/deserialization support for transport requests/responses #9737

Closed

6 tasks

opensearch-trigger-bot bot added the stalled Issues that have stalled label Sep 13, 2023

VachaShah mentioned this pull request Oct 2, 2023

[RFC] Protobuf in OpenSearch #6844

Open

opensearch-trigger-bot bot removed the stalled Issues that have stalled label Oct 3, 2023

peternied reviewed Oct 12, 2023

View reviewed changes

peternied requested changes Oct 12, 2023

View reviewed changes

peternied reviewed Oct 12, 2023

View reviewed changes

VachaShah closed this Oct 17, 2023

VachaShah mentioned this pull request Oct 18, 2023

[META] Protobuf for Search API #10684

Open

4 tasks

finnegancarroll mentioned this pull request Aug 19, 2024

[META] Leverage protobuf for serializing select node-to-node objects #15308

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cat Nodes API with Protobuf #9097

Cat Nodes API with Protobuf #9097

VachaShah commented Aug 3, 2023 •

edited

Loading

github-actions bot commented Aug 3, 2023

opensearch-trigger-bot bot commented Aug 3, 2023

gashutos commented Aug 14, 2023 •

edited

Loading

opensearch-trigger-bot bot commented Sep 13, 2023

dblock commented Oct 2, 2023

VachaShah commented Oct 2, 2023

peternied Oct 12, 2023

peternied left a comment

peternied Oct 12, 2023

peternied Oct 12, 2023

peternied left a comment •

edited

Loading

VachaShah commented Oct 12, 2023

VachaShah commented Oct 12, 2023

dblock commented Oct 13, 2023 •

edited

Loading

VachaShah commented Oct 17, 2023

Cat Nodes API with Protobuf #9097

Cat Nodes API with Protobuf #9097

Conversation

VachaShah commented Aug 3, 2023 • edited Loading

Description

Related Issues

Check List

github-actions bot commented Aug 3, 2023

Gradle Check (Jenkins) Run Completed with:

opensearch-trigger-bot bot commented Aug 3, 2023

gashutos commented Aug 14, 2023 • edited Loading

opensearch-trigger-bot bot commented Sep 13, 2023

dblock commented Oct 2, 2023

VachaShah commented Oct 2, 2023

peternied Oct 12, 2023

Choose a reason for hiding this comment

peternied left a comment

Choose a reason for hiding this comment

peternied Oct 12, 2023

Choose a reason for hiding this comment

peternied Oct 12, 2023

Choose a reason for hiding this comment

peternied left a comment • edited Loading

Choose a reason for hiding this comment

VachaShah commented Oct 12, 2023

VachaShah commented Oct 12, 2023

dblock commented Oct 13, 2023 • edited Loading

VachaShah commented Oct 17, 2023

VachaShah commented Aug 3, 2023 •

edited

Loading

gashutos commented Aug 14, 2023 •

edited

Loading

peternied left a comment •

edited

Loading

dblock commented Oct 13, 2023 •

edited

Loading