[RFS] Refactored DocumentReindexer, HTTP client usage, added a OpenSearchClient #623

chelma · 2024-05-01T14:49:49Z

Description

Checkpoint commit. Refactored the RFS code:
- Cleaned up DocumentReindexer error handling and logging
- Made reactor-netty the sole HTTP client provider
- Created an OpenSearchClient layer above the RestClient

Issues Resolved

https://opensearch.atlassian.net/browse/MIGRATIONS-1691

Testing

Updated existing unit tests
Ran locally

Check List

New functionality includes testing
- All tests pass, including unit test, integration test and doctest
New functionality has been documented
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Chris Helma <[email protected]>

codecov · 2024-05-01T14:52:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.90%. Comparing base (0c6d19d) to head (c3e13a1).
Report is 24 commits behind head on main.

❗ Current head c3e13a1 differs from pull request most recent head 75691b9. Consider uploading reports for the commit 75691b9 to get more accurate results

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #623      +/-   ##
============================================
- Coverage     75.93%   75.90%   -0.04%     
- Complexity     1496     1536      +40     
============================================
  Files           165      168       +3     
  Lines          6362     6395      +33     
  Branches        573      570       -3     
============================================
+ Hits           4831     4854      +23     
- Misses         1150     1167      +17     
+ Partials        381      374       -7

Flag	Coverage Δ
unittests	`75.90% <ø> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

gregschohn · 2024-05-01T15:23:18Z

RFS/src/main/java/com/rfs/ReindexFromSnapshot.java


-                        logger.info("Shard reindexing completed");
+                        int targetShardId = shardId; // Define in local context for the lambda


super minor style thing - you can make this explicitly final and keep the same name other than some convention (like prepending the word 'final') to show that you're only doing it for the lambda.

Thanks - done.

gregschohn · 2024-05-01T15:30:32Z

RFS/src/main/java/com/rfs/common/DocumentReindexer.java

-                    h.set("Authorization", "Basic " + encodedCredentials);
-                }
-            });
+    public static Mono<Void> reindex(String indexName, Flux<Document> documentStream, ConnectionDetails targetConnection) throws Exception {   


Returning something for every future operation makes a ton of sense. Take a look at my PR to overhaul our async promises. Returning these so that they can be chained, even if just for monitoring is a good move.
I'm not familiar with the reactor client, but there may be a longer term question about whether you want/need to embrace reactor primitives vs more universal ones (like CompletableFuture), which we have significant diagnostic support for at the moment. Longer term (PR-614 switched out the netty future stuff for CompletableFutures, so it's a straightforward if borderline arduous task).

I'm not sure myself, yet. I think there is a limit to how far up the call stack this stuff will propagate; I've hit that limit in some local changes I have that aren't in this PR. Definitely something to keep an eye on, agree!

Took a look at the referenced PR; don't think the RFS behavior is complex enough to need that level of abstraction. I could be proven wrong, but it feels so far like we can get by with just using the Reactor primitives?

Having a chain of 2-3 dependencies may be complex enough if you want to make sure that things aren't wedging forever or unintentionally holding up other calls. I'm not sure where you're at, but if you have a web-client, you'll have more than that. The reason to use something like TrackedFutures is that if you have > 1K activities per second, you're not really going to want to log enough details that if one goes wrong, you can find out what the issue is. That would be slow and expensive (in CPU/storage + user time). The approach that the replayer now takes is to keep those graphs around, then render only the ones that are becoming a problem.

It would be good to check the impact of TrackedFutures before putting it into more parts of the solution (it hasn't gone into the proxy yet).

This thread of comment came up because of the return type. If you're worried about how high your reactor client is going, you could convert a mono object to a CompletableFuture (like what is done in NettyFutureBinders).

I've actually found how high up the Rector types go at this point; it's not as high as I feared and actually makes sense (to me, at least). You'll get a good look when I put out the next PR soon w/ the RfsWorker coded up to perform the Snapshot Phase.

gregschohn · 2024-05-01T15:32:24Z

RFS/src/main/java/com/rfs/common/DocumentReindexer.java

-            .flatMap(bulkJson -> sendBulkRequest(client, targetUrl, bulkJson)) // Send the request
+            .flatMap(bulkJson -> client.sendBulkRequest(indexName, bulkJson) // Send the request
+                .doOnSuccess(unused -> logger.debug("Batch succeeded"))
+                .doOnError(error -> logger.error("Batch failed", error))


You'll want to include some identifier so that you can determine the cause and impact of the failure.

In this case, the error message includes the response body of the failed POST. I think that should be sufficient?

Oh, I missed the error, sorry. If error is an exception, you could also do logger.atError(()->"Label").setCause(error).log

I'll take another pass at all the logging stuff a bit later, if you don't mind; want to get the core functionality in place right now.

gregschohn · 2024-05-01T15:34:01Z

RFS/src/main/java/com/rfs/common/DocumentReindexer.java


-        documentStream
+        return documentStream


reactor does look like a pretty nice and powerful library. I'm wondering if we should think about using it for the replayer, especially as we get more serious about retries. I'm wondering if it supports scheduling.

Yeah, I think it does support scheduling. It's a pub-sub model, and you can write your own subscriber model and pull from the publisher however you want. At least, that's my understanding from the docs.

gregschohn · 2024-05-01T15:35:00Z

RFS/src/main/java/com/rfs/common/FileSystemSnapshotCreator.java

@@ -9,14 +9,14 @@ public class FileSystemSnapshotCreator extends SnapshotCreator {
    private static final Logger logger = LogManager.getLogger(FileSystemSnapshotCreator.class);
    private static final ObjectMapper mapper = new ObjectMapper();

-    private final ConnectionDetails connectionDetails;
+    private final OpenSearchClient client;


nit: what kind of client is this for - it's to upload source cluster data, source cluster metadata, or RFS metadata?

Quite literally, a client to interact w/ OpenSearch as a layer above the RestClient. So, yes. We may have another client layer abstraction above it (such as the OpenSearchCmsClient class I have in local changes that uses an OpenSearchClient to interact w/ the Coordinating Metadata Store).

That tells me what the class is. I was asking about what this particular instance was being used for within this class. I was trying to understand how this class worked.

Well, it's used to take a snapshot from source cluster to the local filesytem. Should be self-evident from the class/method names? If not, open to suggestions for ways to make it more obvious.

gregschohn · 2024-05-01T15:42:10Z

RFS/src/main/java/com/rfs/common/RestClient.java


-        // Report the results
-        int responseCode = conn.getResponseCode();
+    public Response get(String path, boolean quietLogging) {


where do you use the blocking calls? Given that it looks like everything is async, why not go all the way :)

We use the blocking calls where performance isn't important but ordering is - like whether to do a PUT based on the response from a GET.

Hmmm. I was expecting to see some of those be chained.

I'm not sure how the threading model works with reactor, but with a netty client, all IO for a given channel runs on one thread. Blocking on that thread can have profound, negative consequences (deadlock being one of them). It looks like that isn't the case here, but since netty is being used under the hood, I'm curious how its interacting with this level.

gregschohn · 2024-05-01T15:43:04Z

RFS/src/main/java/com/rfs/common/RestClient.java

-            byte[] encodedAuth = Base64.getEncoder().encode(auth.getBytes(StandardCharsets.UTF_8));
-            String authHeaderValue = "Basic " + new String(encodedAuth);
-            conn.setRequestProperty("Authorization", authHeaderValue);
+    private void logResponse(Response response, boolean quietLogging) {


Should this be protected so that others can leverage the class with different policies on logging?

I ripped it out in my local changes, consider this function gone.

gregschohn · 2024-05-01T15:48:08Z

RFS/src/main/java/com/rfs/common/OpenSearchClient.java

+        }
+
+        public boolean hasFailedOperations() {
+            return body.contains("\"errors\":true");


Leave a TODO to strengthen this so that you're looking in the right place or a comment as to why it works? You might already be immune to checking within a document if the documents are already json encoded/escaped.

Left a comment; hope it's satisfactory.

gregschohn · 2024-05-01T15:49:24Z

RFS/src/main/java/com/rfs/common/OpenSearchClient.java

+            client.put(objectPath, settings.toString(), false);
+            return true;
+        } else if (response.code == HttpURLConnection.HTTP_OK) {
+            logger.warn(objectPath + " already exists. Skipping creation.");


if this was an idempotent operation, should this be a warning log message?

Yeah, you're right, probably shouldn't be a WARN. Will update.

gregschohn · 2024-05-01T15:50:10Z

RFS/src/main/java/com/rfs/common/OpenSearchClient.java

+        if (response.code == HttpURLConnection.HTTP_NOT_FOUND) {
+            client.put(objectPath, settings.toString(), false);
+            return true;
+        } else if (response.code == HttpURLConnection.HTTP_OK) {


Is this the right status code? It doesn't seem like it from the log line below

It should be the correct status code, as it's for the GET not the PUT.

Signed-off-by: Chris Helma <[email protected]>

peternied · 2024-05-02T02:41:34Z

RFS/src/main/java/com/rfs/common/OpenSearchClient.java

+    /*
+     * Create a legacy template if it does not already exist; return true if created, false otherwise.
+     */
+    public boolean createLegacyTemplateIdempotent(String templateName, ObjectNode settings){


Design Nit-pick: Separate use of idempotent and non-idempotent functionality with intermediate classes so the call looks like client.idempotent().createLegacyTemplate(...) vs client.stateful().createSnapshot(...)

Along these lines, expose the client.idempotent() interface at certain levels of the project to be really sure [developers] aren't doing the wrong thing.

Good suggestion; we'll see how the code develops. This isn't the only thing in the class I'm not feeling confident in so I expect there to be continuing changes to it.

peternied · 2024-05-02T02:47:04Z

RFS/src/main/java/com/rfs/common/RestClient.java

+            .baseUrl(connectionDetails.url)
+            .headers(h -> {
+                h.add("Content-Type", "application/json");
+                if (connectionDetails.authType == ConnectionDetails.AuthType.BASIC) {


Design Nitpick: We are going to need a better way to handle authentication in the future such as sigv4 and other methods. IMO this is the case for using some kind of preconstructed client for interacting with the cluster.

It would be good for the sake of our users to try to keep as much unity in our command line options as possible, but I haven't thought of a good way to do that. I'm wary to try to mix and match different params into one object for a single Beust class.

Probably; I think @gregschohn has some tasks created to look into auth. Not required for the current iteration so I didn't invest the time in it.

gregschohn · 2024-05-02T03:22:17Z

RFS/src/main/java/com/rfs/common/LuceneDocumentsReader.java

@@ -36,6 +39,7 @@ public Flux<Document> readDocuments(Path luceneFilesBasePath, String indexName,
                    reader.close();
                } catch (IOException e) {
                    logger.error("Failed to close IndexReader", e);


you can also write this as logger.atError().setCause(e).setMessage(()->"...").log().
From my read, when I've looked at this in the past, the multi-arg error function is a pretty loose convention that just happens to work with log4j2, but it isn't really an slf4j standard.

I'll do logging pass a bit later, want to get the core functionality in place first.

peternied

Looks like a good starting point, happy to see this merged

sumobrian

Approving on @peternied's behalf since he was not on codeowners/maintainers list at the time he approved.

chelma added 3 commits April 29, 2024 09:53

Cleaned up DocumentReindexer error handling and logging

2ad4da9

Signed-off-by: Chris Helma <[email protected]>

Made reactor-netty the sole HTTP client provider

4552e66

Signed-off-by: Chris Helma <[email protected]>

Created an OpenSearchClient layer above the RestClient

c3e13a1

Signed-off-by: Chris Helma <[email protected]>

chelma requested review from AndreKurait, gregschohn, kartg, lewijacn, mikaylathompson, okhasawn and sumobrian as code owners May 1, 2024 14:49

gregschohn reviewed May 1, 2024

View reviewed changes

Updates per PR comments

75691b9

Signed-off-by: Chris Helma <[email protected]>

peternied reviewed May 2, 2024

View reviewed changes

gregschohn reviewed May 2, 2024

View reviewed changes

peternied approved these changes May 2, 2024

View reviewed changes

sumobrian approved these changes May 2, 2024

View reviewed changes

chelma merged commit 4bfdb0f into opensearch-project:main May 2, 2024
4 of 5 checks passed

chelma deleted the MIGRATIONS-1691 branch May 2, 2024 17:47


		logger.info("Shard reindexing completed");
		int targetShardId = shardId; // Define in local context for the lambda

[RFS] Refactored DocumentReindexer, HTTP client usage, added a OpenSearchClient #623

[RFS] Refactored DocumentReindexer, HTTP client usage, added a OpenSearchClient #623

Conversation

chelma commented May 1, 2024

Description

Issues Resolved

Testing

Check List

codecov bot commented May 1, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregschohn May 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peternied May 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregschohn May 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peternied left a comment

Choose a reason for hiding this comment

sumobrian left a comment

Choose a reason for hiding this comment

codecov bot commented May 1, 2024 •

edited

Loading

gregschohn May 2, 2024 •

edited

Loading

peternied May 2, 2024 •

edited

Loading

gregschohn May 2, 2024 •

edited

Loading