Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix lin reads timeouts and AssignUid recursion in Zero #3203

Merged
merged 18 commits into from
Mar 27, 2019

Conversation

manishrjain
Copy link
Contributor

@manishrjain manishrjain commented Mar 25, 2019

  • The way we were doing linerizable reads was creating a new request context for every try. If the request times out, a new one was being used. When things got slower, this caused a traffic jam of new requests compounding because earlier ones timed out and retried.

  • This PR fixes that by reusing the same requestCtx, so on a retry it would still be OK to receive response from the previous tries -- this allows wrapper requests to be considered done and things to move forward.

  • AssignUids was causing a recursion where a Zero follower (A) forwarded the request to who it thought was the leader (B). But, if B was not (or no longer) the leader, it would try to forward it as well -- maybe to A or others. This caused a lot of recursive requests. This PR fixes that so a forwarded request is not forwarded any further, but outright rejected if received by a Zero follower.


This change is Reviewable

… processing the request much more smoothly because we are done as soon as we find the first activeRctx come back, instead of flooding Raft with unique requests which created a traffic jam.
@@ -217,6 +223,7 @@ func (w *RaftServer) RaftMessage(server pb.Raft_RaftMessageServer) error {
}
data := batch.Payload.Data

ctx, cancel := context.WithTimeout(ctx, time.Minute)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cancel function is not used on all paths (possible context leak) (from govet)

Copy link
Contributor

@martinmr martinmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 6 files at r1, 5 of 5 files at r2.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @golangcibot and @manishrjain)


conn/raft_server.go, line 226 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

the cancel function is not used on all paths (possible context leak) (from govet)

Can you check if this warning is actionable?


dgraph/cmd/zero/assign.go, line 187 at r2 (raw file):

Forward

Maybe "Forwarded" is a better name?

Copy link
Contributor Author

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @golangcibot and @manishrjain)


conn/raft_server.go, line 226 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

the cancel function is not used on all paths (possible context leak) (from govet)

Done.

Copy link
Contributor Author

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 7 files reviewed, 2 unresolved discussions (waiting on @golangcibot and @martinmr)


dgraph/cmd/zero/assign.go, line 187 at r2 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…
Forward

Maybe "Forwarded" is a better name?

Done.

@manishrjain manishrjain merged commit 9c6c7b8 into master Mar 27, 2019
@manishrjain manishrjain deleted the mrjn/improving-zero branch March 27, 2019 02:52
manishrjain added a commit that referenced this pull request Mar 27, 2019
- The way we were doing linerizable reads was creating a new request context for every try. If the request times out, a new one was being used. When things got slower, this caused a traffic jam of new requests compounding because earlier ones timed out and retried.

- This PR fixes that by reusing the same `requestCtx`, so on a retry it would still be OK to receive response from the previous tries -- this allows wrapper requests to be considered done and things to move forward.

- AssignUids was causing a recursion where a Zero follower (A) forwarded the request to who it thought was the leader (B). But, if B was not (or no longer) the leader, it would try to forward it as well -- maybe to A or others. This caused a lot of recursive requests. This PR fixes that so a forwarded request is not forwarded any further, but outright rejected if received by a Zero follower.

Changes:
* Avoid a recursive AssignUids call where one Zero forwards to another, forwards back to first and so on.
* Reuse the activeRctx when asking for ReadIndex. This allows delays in processing the request much more smoothly because we are done as soon as we find the first activeRctx come back, instead of flooding Raft with unique requests which created a traffic jam.
* Move requestCtx to verbosity 3. Add a way to close RaftMessage stream if nothing has been sent for a while.
* Add a timeout so raft.Step does not block indefinitely.
* Add raft.Ready warning in Alpha as well. Make raft.Step error obvious.
* Add warnings in both Zero and Alpha about slow disk.
* No need to do select case on pushing to readStateCh. It is typically empty.
dna2github pushed a commit to dna2fork/dgraph that referenced this pull request Jul 19, 2019
- The way we were doing linerizable reads was creating a new request context for every try. If the request times out, a new one was being used. When things got slower, this caused a traffic jam of new requests compounding because earlier ones timed out and retried.

- This PR fixes that by reusing the same `requestCtx`, so on a retry it would still be OK to receive response from the previous tries -- this allows wrapper requests to be considered done and things to move forward.

- AssignUids was causing a recursion where a Zero follower (A) forwarded the request to who it thought was the leader (B). But, if B was not (or no longer) the leader, it would try to forward it as well -- maybe to A or others. This caused a lot of recursive requests. This PR fixes that so a forwarded request is not forwarded any further, but outright rejected if received by a Zero follower.

Changes:
* Avoid a recursive AssignUids call where one Zero forwards to another, forwards back to first and so on.
* Reuse the activeRctx when asking for ReadIndex. This allows delays in processing the request much more smoothly because we are done as soon as we find the first activeRctx come back, instead of flooding Raft with unique requests which created a traffic jam.
* Move requestCtx to verbosity 3. Add a way to close RaftMessage stream if nothing has been sent for a while.
* Add a timeout so raft.Step does not block indefinitely.
* Add raft.Ready warning in Alpha as well. Make raft.Step error obvious.
* Add warnings in both Zero and Alpha about slow disk.
* No need to do select case on pushing to readStateCh. It is typically empty.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants