Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

THRIFT-5233: Handle I/O timeouts in go library #2181

Merged
merged 1 commit into from
Jun 18, 2020

Conversation

fishy
Copy link
Member

@fishy fishy commented Jun 13, 2020

Client: go

As discussed in the JIRA ticket, this commit changes how we handle I/O
timeouts in the go library.

This is a breaking change that adds context to all Read*, Write*, and
Skip functions to TProtocol, along with the compiler change to support
that, and also adds context to TStandardClient.Recv, TDeserializer,
TStruct, and a few others.

Along with the function signature changes, this commit also implements
context cancellation check in the following TProtocol's ReadMessageBegin
implementations:

  • TBinaryProtocol
  • TCompactProtocol
  • THeaderProtocol

In those ReadMessageBegin implementations, if the passed in context
object has a deadline attached, it will keep retrying the I/O timeout
errors, until the deadline on the context object passed. They won't
retry I/O timeout errors if the passed in context does not have a
deadline attached (still return on the first error).

  • Did you create an Apache Jira ticket? (not required for trivial changes)
  • If a ticket exists: Does your pull request title follow the pattern "THRIFT-NNNN: describe my issue"?
  • Did you squash your changes to a single commit? (not required, but preferred)
  • Did you do your best to avoid breaking changes? If one was needed, did you label the Jira ticket with "Breaking-Change"?
  • If your change does not involve any code, include [skip ci] anywhere in the commit message to free up build resources.

@fishy fishy force-pushed the go-protocol-context branch 2 times, most recently from 6ed2cfc to e598828 Compare June 13, 2020 23:52
}
// For anything else, don't retry
break
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for loop is how we implement context deadline check for TBinaryProtocol, as the first read in ReadMessageBegin is ReadI32 which calls readAll.

}
// For anything else, don't retry
break
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for loop is how we implement context deadline check in TCompactProtocol.

}
// For anything else, do not retry
break
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for loop is how we implement context deadline check in THeaderProtocol, as this is the first read ReadMessageBegin (if t.reedReadFrame returned false above, then the actual ReadMessageBegin will call the underlying TBinaryProtocol.ReadMessageBegin or TCompactProtocol.ReadMessageBegin, which already handled it).

@@ -64,6 +64,10 @@ func (p *tTransportException) Unwrap() error {
return p.err
}

func (p *tTransportException) Timeout() bool {
return p.typeId == TIMED_OUT
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also newly added to make isTimeoutError implementation easier (so it does not need to try to check for TTransportException and unwrap it).

@fishy fishy marked this pull request as ready for review June 14, 2020 01:33
@fishy
Copy link
Member Author

fishy commented Jun 14, 2020

@dcelasun OK the CI is "passing" now so it's ready for review. I marked the actual logical changes above, everything else are basically just for propagating context.

@dcelasun
Copy link
Member

I think you missed the client side :)

# servicestest/container_test-remote
gopath/src/servicestest/container_test-remote/container_test-remote.go:164:42: not enough arguments in call to containerStruct0.ReadField1
	have (thrift.TProtocol)
	want (context.Context, thrift.TProtocol)
gopath/src/servicestest/container_test-remote/container_test-remote.go:190:42: not enough arguments in call to containerStruct0.ReadField1
	have (thrift.TProtocol)
	want (context.Context, thrift.TProtocol)

@fishy fishy force-pushed the go-protocol-context branch from e598828 to 4dca38e Compare June 15, 2020 15:48
@fishy
Copy link
Member Author

fishy commented Jun 15, 2020

I think you missed the client side :)

# servicestest/container_test-remote
gopath/src/servicestest/container_test-remote/container_test-remote.go:164:42: not enough arguments in call to containerStruct0.ReadField1
	have (thrift.TProtocol)
	want (context.Context, thrift.TProtocol)
gopath/src/servicestest/container_test-remote/container_test-remote.go:190:42: not enough arguments in call to containerStruct0.ReadField1
	have (thrift.TProtocol)
	want (context.Context, thrift.TProtocol)

Fixed. Sorry I missed that one. We totally ignored those -remote directories in our build system as we don't use them.

@fishy fishy force-pushed the go-protocol-context branch from 4dca38e to 7e91865 Compare June 15, 2020 17:16
@fishy
Copy link
Member Author

fishy commented Jun 15, 2020

Just realized that I also need to fix lib/go/test/tests/ 🤦 . On it.

@fishy fishy force-pushed the go-protocol-context branch from 7e91865 to a19c240 Compare June 15, 2020 19:18
Client: go

As discussed in the JIRA ticket, this commit changes how we handle I/O
timeouts in the go library.

This is a breaking change that adds context to all Read*, Write*, and
Skip functions to TProtocol, along with the compiler change to support
that, and also adds context to TStandardClient.Recv, TDeserializer,
TStruct, and a few others.

Along with the function signature changes, this commit also implements
context cancellation check in the following TProtocol's ReadMessageBegin
implementations:

- TBinaryProtocol
- TCompactProtocol
- THeaderProtocol

In those ReadMessageBegin implementations, if the passed in context
object has a deadline attached, it will keep retrying the I/O timeout
errors, until the deadline on the context object passed. They won't
retry I/O timeout errors if the passed in context does not have a
deadline attached (still return on the first error).
@fishy fishy force-pushed the go-protocol-context branch from a19c240 to e79f764 Compare June 15, 2020 20:28
@fishy
Copy link
Member Author

fishy commented Jun 15, 2020

Just realized that I also need to fix lib/go/test/tests/ 🤦 . On it.

Fixed. I think that fixed all the issues, but please do let me know if I missed anything else.

@zerosnake0
Copy link

In my POV, this context argument should not be added, at least not the way it currently is.

  1. It's a breaking change

  2. the context argument is not really being used in the current implementation, the golang io interface does not use context neither (proposal: io/v2: add Context parameter to Reader, etc. golang/go#20280)

  3. The context here in the Process method represents only the context of current processing which should be canceled right away after processing before continue (which is not)

  4. The abortion of the connection should be treated correctly by the transport internally

  5. Imagine that your server has a global context (which exists in the current implementation, however not really being used) which represents the lifecycle of the server, and when it's shut down, the IO operations should be finished gracefully instead of being interrupted instantly. The only thing which should affect the IO operations should be the IO timeout which should be configured inside the server or transport

@fishy fishy deleted the go-protocol-context branch November 6, 2020 05:57
@fishy
Copy link
Member Author

fishy commented Nov 6, 2020

@zerosnake0 I replied some of your points below. I would also suggest you to read the associated JIRA ticket, which provides slightly more background about what problem this change solves.

  1. It's a breaking change

Unlike Go1, there's no backward compatibility guarantee in thrift library. If you take a look at https://github.com/apache/thrift/blob/master/CHANGES.md, breaking changes are added in every release.

  1. the context argument is not really being used in the current implementation, the golang io interface does not use context neither (golang/go#20280)

It's used here, here. and here.

@zerosnake0
Copy link

zerosnake0 commented Nov 6, 2020

@zerosnake0 I replied some of your points below. I would also suggest you to read the associated JIRA ticket, which provides slightly more background about what problem this change solves.

  1. It's a breaking change

Unlike Go1, there's no backward compatibility guarantee in thrift library. If you take a look at https://github.com/apache/thrift/blob/master/CHANGES.md, breaking changes are added in every release.

  1. the context argument is not really being used in the current implementation, the golang io interface does not use context neither (golang/go#20280)

It's used here, here. and here.

  1. do breaking change only when necessary

  2. I've already checked all usage of the context before I posted this, that's why i said "not really". Furthermore, for the later two usage you wrote, i do not think that this kind of loop handling should be processed inside the protocol implementation

@fishy
Copy link
Member Author

fishy commented Nov 6, 2020

Resolving a real issue sounds like a fitting case of "necessary", won't you think?

In the JIRA ticket me and Can discussed how to implement it, including the alternative of doing it on transport level (TTransport). Doing it on transport level would also be a breaking change, and we made a choice to do a breaking change on TProtocol over TTransport.

On hindsight yes the loop retry fits more naturally in TTransport, but some problems of that are:

  1. As Can described in the JIRA ticket discussion, TProtocol implementations in the wild are much rarer than TTranposrt implementations.
  2. Adding context into TTransport.Read will make it no longer an implementation of io.Reader, which breaks a lot of other things.
  3. In the vast majority of cases this is really only needed for the first read (vs. every read).

@zerosnake0
Copy link

Resolving a real issue sounds like a fitting case of "necessary", won't you think?

In the JIRA ticket me and Can discussed how to implement it, including the alternative of doing it on transport level (TTransport). Doing it on transport level would also be a breaking change, and we made a choice to do a breaking change on TProtocol over TTransport.

On hindsight yes the loop retry fits more naturally in TTransport, but some problems of that are:

  1. As Can described in the JIRA ticket discussion, TProtocol implementations in the wild are much rarer than TTranposrt implementations.
  2. Adding context into TTransport.Read will make it no longer an implementation of io.Reader, which breaks a lot of other things.
  3. In the vast majority of cases this is really only needed for the first read (vs. every read).
  1. When i say put timeout in transport, i do not mean to put context to it.

I suggest you take a look at the golang official net/http implementation before any furthur discussion

@dcelasun
Copy link
Member

dcelasun commented Nov 6, 2020

I suggest you take a look at the golang official net/http implementation before any furthur discussion

If I may, "go read net/http" is not a productive attitude. You are arguing against a change that was already discussed, agreed to and merged. That's fine, it's even welcome. But the onus is on you to convince us there is a better way and that we should revert this. You didn't even address @fishy's first and second point.

@zerosnake0
Copy link

zerosnake0 commented Nov 6, 2020

I suggest you take a look at the golang official net/http implementation before any furthur discussion

If I may, "go read net/http" is not a productive attitude. You are arguing against a change that was already discussed, agreed to and merged. That's fine, it's even welcome. But the onus is on you to convince us there is a better way and that we should revert this. You didn't even address @fishy's first and second point.

I surely read the first point and I agree with the "rarer implementation point" so I said nothing. Same for the 3rd point.

And why i mentioned net/http because there is something similar for this kind of issue and I'm not sure if you know it. You quote my phrase in just three word "go read net/http" when I typed such a long phrase. I apologize if there is any word misusage.

What's more, this commit is related with many PR done by @fishy (for example, for the connectivity check), that's why I want to post here and discuss with him before doing anything

@fishy
Copy link
Member Author

fishy commented Nov 6, 2020

@zerosnake0 so my guess is by "net/http" you mean client.go and transport.go. Yes there are things we can learn from there, but there are also a lot of key differences between thrift's transport and http transport:

  1. In http a transport is a connection pool. It's not only a connection pool to a single server, it's a connection pool to all servers. It's safe for concurrent uses, and is multiplexed so that they can use a lot of channel selections to handle timeout. In thrift a TSocket is a single connection. It's not safe for concurrent use, and every I/O operation is blocking. You can't just create a new goroutine for every blocking I/O operation so that you can throw it away when timeout happens, without a centralized pool management (like in http) that will cause a lot of goroutine leaks if the socket timeout is configured too high (or even not configured at all).

  2. In thrift a TTransport is an abstract layer. It could be a remote I/O (TSocket), It could be some additional feature wrapping a remote I/O (TFramedTransport + TSocket). It could also be something completely in memory (TMemoryBuffer). TProtocol implementations, in general (with certain exceptions), don't make any assumption on the underlying TTransport and works with all situations. The actual client/server code also almost never interact with TTransport directly, and they only call the TProtocol code. To do something similar to http library does, TProtocol needs to pass an additional deadline/timeout into every call to the TTransport. This can be done by either:

    a. Add ctx to every TTransport read/write functions
    b. Have an additional function to set the deadline/timeout before every TTransport read/write function call.

Both options are breaking changes to TTransport, both options were already discussed in the JIRA ticket, and the reason we picked the current implemented option over them can be found there as well.

If we completely rewrite the whole thrift go library now, we probably will choose an implementation similar to how http library does over something we have right now. But we are not doing that complete rewrite and the current implementation is the least breaking one, as far as we could tell.

If you have a better approach, I would love to see something more concrete from it (e.g. code/PR, or some more detailed design on how to do it).

The connectivity check change happened before this change, and is not depending on this change, btw.

@zerosnake0
Copy link

zerosnake0 commented Nov 7, 2020

@zerosnake0 so my guess is by "net/http" you mean client.go and transport.go. Yes there are things we can learn from there, but there are also a lot of key differences between thrift's transport and http transport:

  1. In http a transport is a connection pool. It's not only a connection pool to a single server, it's a connection pool to all servers. It's safe for concurrent uses, and is multiplexed so that they can use a lot of channel selections to handle timeout. In thrift a TSocket is a single connection. It's not safe for concurrent use, and every I/O operation is blocking. You can't just create a new goroutine for every blocking I/O operation so that you can throw it away when timeout happens, without a centralized pool management (like in http) that will cause a lot of goroutine leaks if the socket timeout is configured too high (or even not configured at all).
  2. In thrift a TTransport is an abstract layer. It could be a remote I/O (TSocket), It could be some additional feature wrapping a remote I/O (TFramedTransport + TSocket). It could also be something completely in memory (TMemoryBuffer). TProtocol implementations, in general (with certain exceptions), don't make any assumption on the underlying TTransport and works with all situations. The actual client/server code also almost never interact with TTransport directly, and they only call the TProtocol code. To do something similar to http library does, TProtocol needs to pass an additional deadline/timeout into every call to the TTransport. This can be done by either:
    a. Add ctx to every TTransport read/write functions
    b. Have an additional function to set the deadline/timeout before every TTransport read/write function call.

Both options are breaking changes to TTransport, both options were already discussed in the JIRA ticket, and the reason we picked the current implemented option over them can be found there as well.

If we completely rewrite the whole thrift go library now, we probably will choose an implementation similar to how http library does over something we have right now. But we are not doing that complete rewrite and the current implementation is the least breaking one, as far as we could tell.

If you have a better approach, I would love to see something more concrete from it (e.g. code/PR, or some more detailed design on how to do it).

The connectivity check change happened before this change, and is not depending on this change, btw.

Sorry I didn't specify which file to look yesterday, I mean more to server.go, in the readRequest method the approch is more like the 2.b that you pointed. Personally I think 2.b will be more reasonable because we need to set timeout specifically every time the IO reads/writes the request/response

Another reason is that currently the only reason cancels the context is the disconnection, which will surely affect all your IO operations after.

I will try to work something out and let you know

@fishy
Copy link
Member Author

fishy commented Nov 7, 2020

Another reason is that currently the only reason cancels the context is the disconnection

This is not true. The feature this change implemented is mainly for clients to use (servers don't care about read timeouts unless they don't intent to support long connections/client pools). It's very common for clients to set a timeout in the context object.

@taraspos
Copy link

taraspos commented Mar 4, 2021

@fishy I have a couple of clarification questions about the expected behavior of the timeouts. Sorry if I misunderstood something.

  1. The logic handles only the retries? What about canceling the reads that are taking longer than the deadline?
    For example:
    • If I set a context deadline of 1 second and i/o timeout appears, before that, a read will be retried. Correct?
    • If the context deadline set to 1 second, but read takes 5 seconds, it will not be canceled after 1 second and will not be retried later. Correct?
  2. Is the logic applied only to the ReadMessageBegin? What about Read and ReadMessageEnd?
    For example:
    • Context deadline is set to 5 seconds and i/o timeout happens after 1 second of ReadMessageBegin, so it will be retried. Right?
    • Context deadline is set to 5 seconds, and ReadMessageBegin succeeds in 1 second, but the further Read fails with the i/o timeout in 1 more second (so 3 seconds left until the deadline). Will it be retried?
  3. This logic was added to the TBinaryProtocol, TCompactProtocol, THeaderProtocol but not to the TJSONProtocol, right?

Thanks!

@fishy
Copy link
Member Author

fishy commented Mar 4, 2021

@trane9991

  1. Yes for both
  2. This is because the way the library/framework works, you would only get io timeouts on the first reads (ReadMessageBegin). For client side write (server side read), if you look at TStandardClient's implementation, we always call WriteMessageBegin, Write, WriteMessageEnd, Flush without delays between them (
    if err := oprot.WriteMessageBegin(ctx, method, CALL, seqId); err != nil {
    return err
    }
    if err := args.Write(ctx, oprot); err != nil {
    return err
    }
    if err := oprot.WriteMessageEnd(ctx); err != nil {
    return err
    }
    return oprot.Flush(ctx)
    ). For server side write (client side read), it's the same in the compiler generated Process functions. The only "delay" that could cause io timeouts are the delaye before calling WriteMessageBegin. For server writes those delays are just the actual handler handling the request and generating the response. I described that in the jira ticket.
  3. Right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants