Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cosmos] split proof queries sync client #22237

Merged
merged 84 commits into from
Jan 25, 2022
Merged

[Cosmos] split proof queries sync client #22237

merged 84 commits into from
Jan 25, 2022

Conversation

simorenoh
Copy link
Member

@simorenoh simorenoh commented Dec 20, 2021

Addressing the current gap in the cosmos sdk not being split proof

The gone_retry_policy is used exclusively to refresh the partition key range cache and raise the exception immediately in order to repair the document producers and not waste time retrying several times.

One note on these changes is that I still need to remove certain print statements and comments but want to get input first so I can keep logging the right things in case some further iteration is needed.

I also left the testing file that I was using (test_xq.py) in order to walk through the process for triggering a partition split that I was using, but I will remove this file once the changes are approved.

simorenoh and others added 30 commits August 13, 2021 13:14
* Removed some stuff

* Looking at constructors

* Updated request

* Added client close

* working client creation

Co-authored-by: simorenoh <[email protected]>
database read works, but ignored exception is returned:
Fatal error on SSL transport
NoneType has no attribute 'send' (_loop._proactor.send)
RuntimeError: Event loop is closed
Unclosed connector/ connection
Added methods needed to use async with when initializing client, but logs output "Exception ignored... Runtime Error: Event loop is closed"
missing upsert and other resources
missing read_all_items and both query methods for container class
Complete item CRUD functionality - only missing queries
Query functionality and container query methods
also fixed README and added examples_async
# return False to raise error to multi_execution_aggregator and repair document producer
self.client.refresh_routing_map_provider()
return False
if self.current_retry_attempt_count < self._max_retry_attempt_count:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we only checking the current retry attempt count in this if condition, and not when updating the partition key range cache.
Let's say if we keep getting gone exception, this if condition will never hit, and we will never get out of this retry policy.

Copy link
Member Author

@simorenoh simorenoh Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was returning a False immediately for sub status 1002 in order to further handle the exception on a first-try basis, since we need the exception to bubble up in order to repair the execution context.

I think this is the way to go, but I see what you mean. I'll add a flag that only returns False pre-emptively if it's the first time we run into a 1002 with that instance of the gone retry policy, and if it's not the first time then continues with the normal attempts.

@@ -129,6 +136,32 @@ def fetch_next_block(self):

raise NotImplementedError("You should use pipeline's fetch_next_block.")

def _repair_document_producer(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add some documentation to this function, like what it does and how it repairs the document producers.

@@ -0,0 +1,82 @@
import time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have we added this test in samples ?
Also, I see license header is missing for this one and some other samples, we should add it.

Copy link
Member Author

@simorenoh simorenoh Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this in the PR description, but I'm just keeping this here for now while I get approvals and will remove afterwards.
Edit: I have now added a test to follow the way the Java SDK tests for a partition split.

@check-enforcer
Copy link

This pull request is protected by Check Enforcer.

What is Check Enforcer?

Check Enforcer helps ensure all pull requests are covered by at least one check-run (typically an Azure Pipeline). When all check-runs associated with this pull request pass then Check Enforcer itself will pass.

Why am I getting this message?

You are getting this message because Check Enforcer did not detect any check-runs being associated with this pull request within five minutes. This may indicate that your pull request is not covered by any pipelines and so Check Enforcer is correctly blocking the pull request being merged.

What should I do now?

If the check-enforcer check-run is not passing and all other check-runs associated with this PR are passing (excluding license-cla) then you could try telling Check Enforcer to evaluate your pull request again. You can do this by adding a comment to this pull request as follows:
/check-enforcer evaluate
Typically evaulation only takes a few seconds. If you know that your pull request is not covered by a pipeline and this is expected you can override Check Enforcer using the following command:
/check-enforcer override
Note that using the override command triggers alerts so that follow-up investigations can occur (PRs still need to be approved as normal).

What if I am onboarding a new service?

Often, new services do not have validation pipelines associated with them, in order to bootstrap pipelines for a new service, you can issue the following command as a pull request comment:
/azp run prepare-pipelines
This will run a pipeline that analyzes the source tree and creates the pipelines necessary to build and validate your pull request. Once the pipeline has been created you can trigger the pipeline using the following comment:
/azp run python - [service] - ci

@simorenoh simorenoh requested review from kushagraThapar and removed request for xinlian12 January 19, 2022 18:10
Copy link
Member

@simplynaveen20 simplynaveen20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please have a look to the comments I added , apart from that LGTM

self.client.refresh_routing_map_provider()
self.refresh_partition_key_range_cache = False
return False
return True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we want to retry after one attempt without refreshing the cache . PartitionKeyRangeGoneRetryPolicy in V4 Java SDK only retrying one time https://github.com/Azure/azure-sdk-for-java/blob/c4c068cc0d2061df9a4fcc49419acc10be223d4a/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/PartitionKeyRangeGoneRetryPolicy.java#L60

print("created items, waiting 4m before queries")
print("change offer to 11k manually")
print("--------------------------------")
time.sleep(240)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from .http_constants import HttpHeaders, StatusCodes, SubStatusCodes


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove extra line

@@ -32,8 +32,10 @@
from . import _resource_throttle_retry_policy
from . import _default_retry_policy
from . import _session_retry_policy
from . import _gone_retry_policy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: usually the import statements should be sorted in java & SDK, do we need to sort in Python as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but this is something that I need to do probably in every file, will take care of all of it in one PR later

Copy link
Member

@xinlian12 xinlian12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix 👍

@simorenoh simorenoh merged commit 18981ce into Azure:main Jan 25, 2022
@simorenoh simorenoh deleted the split-proof-query branch January 25, 2022 18:21
rakshith91 pushed a commit to rakshith91/azure-sdk-for-python that referenced this pull request Apr 10, 2022
* initial commit

* Client Constructor (Azure#20310)

* Removed some stuff

* Looking at constructors

* Updated request

* Added client close

* working client creation

Co-authored-by: simorenoh <[email protected]>

* read database

database read works, but ignored exception is returned:
Fatal error on SSL transport
NoneType has no attribute 'send' (_loop._proactor.send)
RuntimeError: Event loop is closed
Unclosed connector/ connection

* Update simon_testfile.py

* with coroutine

Added methods needed to use async with when initializing client, but logs output "Exception ignored... Runtime Error: Event loop is closed"

* Update simon_testfile.py

* small changes

* async with returns no exceptions

* async read container

* async item read

* cleaning up

* create item/ database methods

* item delete working

* docs replace functionality

missing upsert and other resources

* upsert functionality

missing read_all_items and both query methods for container class

* missing query methods

* CRUD for udf, sproc, triggers

* initial query logic + container methods

* missing some execution logic and tests

* oops

* fully working queries

* small fix to query_items()

also fixed README and added examples_async

* Update _cosmos_client_connection_async.py

* Update _cosmos_client_connection.py

* documentation update

* updated MIT dates and get_user_client() description

* Update CHANGELOG.md

* Delete simon_testfile.py

* leftover retry utility

* Update README.md

* docs and removed six package

* changes based on comments

still missing discussion resolution on SSL verification and tests for async functionality under test module (apart from samples which are basically end to end tests)

* small change in type hints

* updated readme

* fixes based on conversations

* added missing type comments

* update changelog for ci pipeline

* added typehints, moved params into keywords, added decorators, made _connection_policy private

* changes based on sync with central sdk

* remove is_system_key from scripts (only used in execute_sproc)

is_system_key verifies that an empty partition key is properly dealt with if ['partitionKey']['systemKey'] exists in the container options - however, we do not allow containers to be created with empty partition key values in the python sdk, so the functionality is needless

* Revert "remove is_system_key from scripts (only used in execute_sproc)"

Reverting last commit, will find way to init is_system_key for now

* async script proxy using composition

* pylint

* capitalized constants

* Apply suggestions from code review

Clarifying comments for README

Co-authored-by: Gahl Levy <[email protected]>

* closing python code snippet

* last doc updates

* Update sdk/cosmos/azure-cosmos/CHANGELOG.md

Co-authored-by: Simon Moreno <[email protected]>

* version update

* cosmos updates for release

* current state

gone_retry_policy might end up being unneccesary, based on what we feel is best from an arch standpoint

* working split proof, need to remove prints

* improving comments and removing print statements

* removed last prints and used constants

* Update CHANGELOG.md

* small fixes based on comments

* addressed more comments

* added test, made slight changes

* rename test and small changes

* pylint

* pylintpylintpylint

* moved partition_range_gone check to exceptions since makes more sense

* re use code

Co-authored-by: annatisch <[email protected]>
Co-authored-by: Gahl Levy <[email protected]>
Co-authored-by: Travis Prescott <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants