Refactor lookups behavior while loading/dropping the containers #14806

pranavbhole · 2023-08-14T00:25:33Z

Description

This PR is major change is lookup container loading and dropping behavior and avoid the lookup inconsistencies in update container call. It also introduces the user defined SLA of loading the lookups, loading and dropping behavior will avail this SLA and wait for new container to load.

Load Container call:
Starts the new container, if cache is loaded then drop the old container if exists and start serving from new container.
If cache is not loaded then wait for container to come up with startRetries times and then declare the new container as failed and kill new container, still continue serving from old container.

Drop Container call:
We get drop call from 2 flow, Update and Delete call from lookups. Drop Notice also accepts the optional loadedContainer.
LoadedContainer is only passed in update flow. We make sure that loadedContainer is loaded with cache before dropping the existing container. If cache is not loaded, wait for given SLA for loadedContainer to load.
If cache is loaded in loadedContainer then go ahead and drop it.

Release note

This PR has:

server/src/test/java/org/apache/druid/query/lookup/LookupReferencesManagerTest.java

@@ -256,11 +259,12 @@
    ).addChunk(strResult);
    EasyMock.expect(druidLeaderClient.go(request)).andReturn(responseHolder);
    EasyMock.replay(druidLeaderClient);
+    LookupExtractorFactoryContainer container = new LookupExtractorFactoryContainer("0", lookupExtractorFactory);


clintropolis · 2023-08-15T20:52:31Z

processing/src/main/java/org/apache/druid/query/lookup/LookupExtractorFactory.java

+  /**
+   * awaitToInitialise blocks and wait for the cache to initialize fully.
+   */
+  void awaitToInitialise() throws InterruptedException, TimeoutException;


nit: I think "initialize" is the most standard US english spelling over "initialise", I find hundreds of matches in java codes for "initialize" and none for "initialise". Additionally, we have some other methods named awaitInitialization() on things like broker server view so maybe that is a good name

clintropolis · 2023-08-15T20:54:56Z

processing/src/main/java/org/apache/druid/query/lookup/LookupExtractorFactory.java

+  /**
+   * @return true if cache is loaded and lookup is queryable else returns false
+   */
+  boolean isCacheLoaded();


to be more consistent with the other method name to block waiting for this to be true, i recommend isInitialized or something

clintropolis · 2023-08-15T21:01:26Z

server/src/main/java/org/apache/druid/query/lookup/LookupReferencesManager.java

+          RetryUtils.retry(() -> {
+                             lookupExtractorFactoryContainer.getLookupExtractorFactory().awaitToInitialise();
+                             return null;
+                           }, e -> true,


i dont' think this needs to be a blocker, but it would be nice if we could distinguish errors which are retryable from errors which are not, but i imagine there is too much variety in the errors which could be thrown for us to do this and we would probably need to standardize the error states for lookup implementations to do this effectively.

clintropolis · 2023-08-15T21:05:34Z

server/src/test/java/org/apache/druid/query/lookup/LookupReferencesManagerTest.java

@@ -526,6 +530,7 @@ public void testRealModeWithMainThread() throws Exception
    LookupExtractorFactory lookupExtractorFactory = EasyMock.createMock(LookupExtractorFactory.class);
    EasyMock.expect(lookupExtractorFactory.start()).andReturn(true).once();
    EasyMock.expect(lookupExtractorFactory.destroy()).andReturn(true).once();
+    EasyMock.expect(lookupExtractorFactory.isCacheLoaded()).andReturn(true).anyTimes();


it would be nice to add some tests that simulate sad paths where the new lookup does not load to ensure proper behavior where the new container is destroyed and old container is still operational, and also tests where new doesn't initially load and so goes through the retry path but then successfully loads and the old is destroyed

clintropolis · 2023-08-24T23:59:18Z

...ed-global/src/main/java/org/apache/druid/query/lookup/namespace/JdbcExtractionNamespace.java

@@ -73,6 +75,7 @@ public JdbcExtractionNamespace(
      @JsonProperty(value = "filter") @Nullable final String filter,
      @Min(0) @JsonProperty(value = "pollPeriod") @Nullable final Period pollPeriod,
      @JsonProperty(value = "maxHeapPercentage") @Nullable final Long maxHeapPercentage,
+      @JsonProperty(value = "loadTimeout") @Nullable final Long loadTimeout,


we should update the docs if we are adding a new parameter

clintropolis · 2023-08-30T05:56:54Z

...ction-namespace/src/main/java/org/apache/druid/query/lookup/KafkaLookupExtractorFactory.java

+  @Override
+  public void awaitInitialization()
+  {
+


i guess this doesn't need await because its continuously updated instead of swapped? if that is the case could you leave a comment? or if not the case, explain why it doesn't need to wait on stuff?

clintropolis · 2023-08-30T05:58:23Z

...lookups-cached-single/src/main/java/org/apache/druid/server/lookup/LoadingLookupFactory.java

+  @Override
+  public void awaitInitialization()
+  {
+


is there a reason the lookups-cached-single implementations don't implement these methods? I don't think it necessarily needs to be a blocker for them to be implemented for this PR if they should someday implement it, but a comment on why or why not would be nice to leave here in the code

soumyava · 2023-11-06T22:51:33Z

...cached-global/src/main/java/org/apache/druid/query/lookup/namespace/ExtractionNamespace.java

+
+  default long getLoadTimeoutMills()
+  {
+    return 60 * 1000;


QQ. Is there any reason we have kept this to be 60 sec but the one for default on Jdbc extraction namespace is 120 sec ?

yeah, default loadTime for lookups is 60 secs that will be enough time to load all non jdbc lookups like uri, maps, and other loaders. Overriding this loadTime for JDBC to 2 mins as it can take little longer to load entries if they in millions. Also documented it so that users can customize it for their use cases of big lookups.

clintropolis · 2023-11-06T23:10:15Z

...ction-namespace/src/main/java/org/apache/druid/query/lookup/KafkaLookupExtractorFactory.java

+  @Override
+  public void awaitInitialization()
+  {
+    // Kafka lookup do not need await on initialization as it is realtime kafka lookups.


would updating the topic or some other config of the kafka lookup also result in a new cache being initialized, but because we just return true here we potentially don't wait on it? I guess i'm wanting to make sure that it isn't a problem that this one wouldn't wait until it has read the topic and populated the cache for the first time, or is it not a problem for some other reason (i tried to remember how all this stuff works, but i'm still not certain 🙃 )

For first time, KafkaLookupExtractorFactory has lifecycle and process will wait for started latch in start method.
Today, On updating the config, it deletes the old lookup immediately and it kicks the start lifecycle again. I just wanted to keep the same behavior as this change was mainly to cover jdbc use case, thus overridden await methods.

But you have great point, I think we should await for kafka, reading from kafka and populating is asynchronus and ideally it should await on initialization as well to avoid the disruption in the looking serving requests.
New behavior will be:
kafka lookup will wait until forever (as there is no definite loadTimeout that we have currently for kafka lookups) to drain all messages from kafka topic and it will be initialized fully and then it will drop the old container and starting serving from new one, else it will continue serving from old lookup.

Question is: should we do this change along with same PR or not. I would prefer to do this in next followup PR as it needs bunch of unit tests case to cover the the wait scenario until topic is drained.

i think its fine to do as a follow-up, thanks for looking into it

…he#14806)

pranavbhole force-pushed the refactorLookupsBehavior branch from 329e3a7 to 4b42abb Compare August 14, 2023 00:35

github-advanced-security bot found potential problems Aug 14, 2023

View reviewed changes

clintropolis reviewed Aug 15, 2023

View reviewed changes

clintropolis added the Area - Lookups label Aug 15, 2023

pranavbhole force-pushed the refactorLookupsBehavior branch 3 times, most recently from 23c2c61 to dfa7e3b Compare August 24, 2023 19:08

pranavbhole added 2 commits August 24, 2023 12:09

Refactor lookups behavior while loading/dropping the containers

383eb3d

Adding tests and addressing comments

913768a

pranavbhole force-pushed the refactorLookupsBehavior branch from dfa7e3b to 913768a Compare August 24, 2023 19:09

clintropolis reviewed Aug 30, 2023

View reviewed changes

pranavbhole added 2 commits October 31, 2023 14:46

Merge branch 'master' into refactorLookupsBehavior

57d2a19

Renaming config params and addressing comments

e6b1ce1

github-actions bot added the Area - Documentation label Nov 2, 2023

pranavbhole added 2 commits November 2, 2023 17:02

Merge branch 'master' into refactorLookupsBehavior

7a8a26b

Fixing log

b30022f

soumyava reviewed Nov 6, 2023

View reviewed changes

clintropolis reviewed Nov 7, 2023

View reviewed changes

clintropolis approved these changes Nov 7, 2023

View reviewed changes

clintropolis merged commit e2fde8c into apache:master Nov 7, 2023
82 checks passed

pranavbhole deleted the refactorLookupsBehavior branch November 7, 2023 18:11

CaseyPan pushed a commit to CaseyPan/druid that referenced this pull request Nov 17, 2023

Refactor lookups behavior while loading/dropping the containers (apac…

578c9fa

…he#14806)

LakshSingla added this to the 29.0.0 milestone Jan 29, 2024

LakshSingla mentioned this pull request Feb 13, 2024

[DRAFT] 29.0.0 release notes #15896

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor lookups behavior while loading/dropping the containers #14806

Refactor lookups behavior while loading/dropping the containers #14806

pranavbhole commented Aug 14, 2023 •

edited

Loading

clintropolis Aug 15, 2023

clintropolis Aug 15, 2023

clintropolis Aug 15, 2023

clintropolis Aug 15, 2023

clintropolis Aug 24, 2023

pranavbhole Nov 6, 2023

clintropolis Aug 30, 2023

pranavbhole Nov 6, 2023

clintropolis Aug 30, 2023

pranavbhole Nov 6, 2023

soumyava Nov 6, 2023

pranavbhole Nov 6, 2023

clintropolis Nov 6, 2023

pranavbhole Nov 7, 2023 •

edited

Loading

clintropolis Nov 7, 2023

Refactor lookups behavior while loading/dropping the containers #14806

Refactor lookups behavior while loading/dropping the containers #14806

Conversation

pranavbhole commented Aug 14, 2023 • edited Loading

Description

Release note

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pranavbhole Nov 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pranavbhole commented Aug 14, 2023 •

edited

Loading

pranavbhole Nov 7, 2023 •

edited

Loading