Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure cosmos diagnostics memory leak fix #28343

Conversation

kushagraThapar
Copy link
Member

@kushagraThapar kushagraThapar commented Apr 15, 2022

Problems ->

  • Current version of CosmosDiagnostics contains reference to storeResponseStatistics which contains reference to StoreResponse which contains CosmosDiagnostics hence creating a circular dependency in CosmosDiagnostics
  • Current version of CosmosDiagnostics contains reference to StoreResponse - with full byte[] content, never freeing up those byte[] content even after the response has been received from the service.
  • This increases memory consumption in case of retries from SDK, since the request is maintained throughout retries, containing all the storeResponse with full byte[] of contents.
  • Current version of CosmosDiagnostics contains reference to GlobalEndpointManager which contains reference to some caches, schedulers, ExecutorService, and various other things, which makes it difficult for JVM to collect garbage diagnostics or responses.
  • Current version of CosmosException refers to CosmosDiagnostics, which refers to DiagnosticsClientContext which is actually a reference to RxDocumentClientImpl - which exposes all information include caches, Store Models, and what not.

Solution ->

  • Removed CosmosDiagnostics from StoreResponse as it is not required there.
  • For CosmosResponse types, moved CosmosDiagnostics to RxDocumentServiceResponse initializing during creation of response.
  • Created serializable versions of StoreResponse, StoreResult, and CosmosException, and performing a deep copy with only instance members required for diagnostics.
  • Removed GlobalEndpointManager reference from CosmosDiagnostics and just passed it around for capturing regionsContacted
  • Only exposed DiagnosticsClientConfig in CosmosDiagnostics - which is only for configs serialization.

@azure-sdk
Copy link
Collaborator

API change check for com.azure:azure-cosmos

API changes have been detected in com.azure:azure-cosmos. You can review API changes here

API changes

+         @Warning public static CosmosDiagnostics createCosmosDiagnostics(DiagnosticsClientContext diagnosticsClientContext) 
-         @Warning public static CosmosDiagnostics createCosmosDiagnostics(DiagnosticsClientContext diagnosticsClientContext, GlobalEndpointManager globalEndpointManager) 
-         @Warning public static void recordGatewayResponse(CosmosDiagnostics cosmosDiagnostics, RxDocumentServiceRequest rxDocumentServiceRequest, StoreResponse storeResponse, CosmosException exception) 
-         @Warning public static void recordResponse(CosmosDiagnostics cosmosDiagnostics, RxDocumentServiceRequest request, StoreResult storeResult) 
+         @Warning public static void recordGatewayResponse(CosmosDiagnostics cosmosDiagnostics, RxDocumentServiceRequest rxDocumentServiceRequest, StoreResponse storeResponse, CosmosException exception, GlobalEndpointManager globalEndpointManager) 
+         @Warning public static void recordResponse(CosmosDiagnostics cosmosDiagnostics, RxDocumentServiceRequest request, StoreResult storeResult, GlobalEndpointManager globalEndpointManager) 

@kushagraThapar
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@kushagraThapar
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@kushagraThapar
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Netyyyy
Copy link
Member

Netyyyy commented Apr 20, 2022

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Pull request contains merge conflicts.

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -15,4 +16,6 @@ Mono<AddressInformation[]> resolveAsync(
boolean forceRefreshPartitionAddresses);

int updateAddresses(URI serverKey);

GlobalEndpointManager getGlobalEndpointManager();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss when you are in the office - I don't quite get why we would need to expose the GlobalEndpointManager here? There should only be one instance of it - why keep the reference in the AddressResolver?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, worth discussing.

@@ -33,7 +32,7 @@ public static HttpClient getOrCreateInstance(HttpClientConfig httpClientConfig,
}

counter.incrementAndGet();
diagnosticsClientConfig.withGatewayHttpClientConfig(sharedGatewayHttpClient.effectiveHttpClientConfig);
diagnosticsClientConfig.withGatewayHttpClientConfig(httpClientConfig.toDiagnosticsString());
Copy link
Member

@xinlian12 xinlian12 Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use sharedGatewayHttpClient.effectiveHttpClientConfig.toDiagnosticsString()

Any possibility that we are using a different httpClientConfig through out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the effectiveHttpClientConfig from SharedGatewayHttpClient class as it was referencing the same thing. Worth taking another look.

jsonGenerator.flush();
ObjectNode objectNode = (ObjectNode) objectMapper.readTree(jsonWriter.toString());

assertThat(objectNode.get("id").asInt()).isEqualTo(1);
assertThat(objectNode.get("machineId").asText()).isEqualTo(machineId);
assertThat(objectNode.get("numberOfClients").asInt()).isEqualTo(2);
assertThat(objectNode.get("consistencyCfg").asText()).isEqualTo("(consistency: null, mm: false, prgns: [])");
assertThat(objectNode.get("consistencyCfg").asText()).isEqualTo("(consistency: null, mm: false, prgns: [null])");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the change here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will discuss offline, had to change the default preferredRegions to null string if empty.

Copy link
Member

@xinlian12 xinlian12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@Netyyyy
Copy link
Member

Netyyyy commented Apr 21, 2022

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@aayush3011
Copy link
Member

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog entries for azure-cosmos and both spark versions are missing - otherwise LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants