THRIFT-5233: Handle I/O timeouts in go library #2181

fishy · 2020-06-13T17:26:12Z

Client: go

As discussed in the JIRA ticket, this commit changes how we handle I/O
timeouts in the go library.

This is a breaking change that adds context to all Read*, Write*, and
Skip functions to TProtocol, along with the compiler change to support
that, and also adds context to TStandardClient.Recv, TDeserializer,
TStruct, and a few others.

Along with the function signature changes, this commit also implements
context cancellation check in the following TProtocol's ReadMessageBegin
implementations:

TBinaryProtocol
TCompactProtocol
THeaderProtocol

In those ReadMessageBegin implementations, if the passed in context
object has a deadline attached, it will keep retrying the I/O timeout
errors, until the deadline on the context object passed. They won't
retry I/O timeout errors if the passed in context does not have a
deadline attached (still return on the first error).

Did you create an Apache Jira ticket? (not required for trivial changes)
If a ticket exists: Does your pull request title follow the pattern "THRIFT-NNNN: describe my issue"?
Did you squash your changes to a single commit? (not required, but preferred)
Did you do your best to avoid breaking changes? If one was needed, did you label the Jira ticket with "Breaking-Change"?
If your change does not involve any code, include [skip ci] anywhere in the commit message to free up build resources.

fishy · 2020-06-14T00:04:09Z

lib/go/thrift/binary_protocol.go

+		}
+		// For anything else, don't retry
+		break
+	}


This for loop is how we implement context deadline check for TBinaryProtocol, as the first read in ReadMessageBegin is ReadI32 which calls readAll.

fishy · 2020-06-14T00:04:47Z

lib/go/thrift/compact_protocol.go

+		}
+		// For anything else, don't retry
+		break
+	}


This for loop is how we implement context deadline check in TCompactProtocol.

fishy · 2020-06-14T00:06:55Z

lib/go/thrift/header_transport.go

+		}
+		// For anything else, do not retry
+		break
+	}


This for loop is how we implement context deadline check in THeaderProtocol, as this is the first read ReadMessageBegin (if t.reedReadFrame returned false above, then the actual ReadMessageBegin will call the underlying TBinaryProtocol.ReadMessageBegin or TCompactProtocol.ReadMessageBegin, which already handled it).

fishy · 2020-06-14T00:08:24Z

lib/go/thrift/transport_exception.go

@@ -64,6 +64,10 @@ func (p *tTransportException) Unwrap() error {
 	return p.err
 }

+func (p *tTransportException) Timeout() bool {
+	return p.typeId == TIMED_OUT
+}


This is also newly added to make isTimeoutError implementation easier (so it does not need to try to check for TTransportException and unwrap it).

fishy · 2020-06-14T01:34:00Z

@dcelasun OK the CI is "passing" now so it's ready for review. I marked the actual logical changes above, everything else are basically just for propagating context.

dcelasun · 2020-06-15T13:52:48Z

I think you missed the client side :)

# servicestest/container_test-remote
gopath/src/servicestest/container_test-remote/container_test-remote.go:164:42: not enough arguments in call to containerStruct0.ReadField1
	have (thrift.TProtocol)
	want (context.Context, thrift.TProtocol)
gopath/src/servicestest/container_test-remote/container_test-remote.go:190:42: not enough arguments in call to containerStruct0.ReadField1
	have (thrift.TProtocol)
	want (context.Context, thrift.TProtocol)

fishy · 2020-06-15T15:49:19Z

I think you missed the client side :)

# servicestest/container_test-remote
gopath/src/servicestest/container_test-remote/container_test-remote.go:164:42: not enough arguments in call to containerStruct0.ReadField1
	have (thrift.TProtocol)
	want (context.Context, thrift.TProtocol)
gopath/src/servicestest/container_test-remote/container_test-remote.go:190:42: not enough arguments in call to containerStruct0.ReadField1
	have (thrift.TProtocol)
	want (context.Context, thrift.TProtocol)

Fixed. Sorry I missed that one. We totally ignored those -remote directories in our build system as we don't use them.

fishy · 2020-06-15T19:13:44Z

Just realized that I also need to fix lib/go/test/tests/ 🤦 . On it.

Client: go As discussed in the JIRA ticket, this commit changes how we handle I/O timeouts in the go library. This is a breaking change that adds context to all Read*, Write*, and Skip functions to TProtocol, along with the compiler change to support that, and also adds context to TStandardClient.Recv, TDeserializer, TStruct, and a few others. Along with the function signature changes, this commit also implements context cancellation check in the following TProtocol's ReadMessageBegin implementations: - TBinaryProtocol - TCompactProtocol - THeaderProtocol In those ReadMessageBegin implementations, if the passed in context object has a deadline attached, it will keep retrying the I/O timeout errors, until the deadline on the context object passed. They won't retry I/O timeout errors if the passed in context does not have a deadline attached (still return on the first error).

fishy · 2020-06-15T21:42:38Z

Just realized that I also need to fix lib/go/test/tests/ 🤦 . On it.

Fixed. I think that fixed all the issues, but please do let me know if I missed anything else.

zerosnake0 · 2020-11-06T05:09:07Z

In my POV, this context argument should not be added, at least not the way it currently is.

It's a breaking change
the context argument is not really being used in the current implementation, the golang io interface does not use context neither (proposal: io/v2: add Context parameter to Reader, etc. golang/go#20280)
The context here in the Process method represents only the context of current processing which should be canceled right away after processing before continue (which is not)
The abortion of the connection should be treated correctly by the transport internally
Imagine that your server has a global context (which exists in the current implementation, however not really being used) which represents the lifecycle of the server, and when it's shut down, the IO operations should be finished gracefully instead of being interrupted instantly. The only thing which should affect the IO operations should be the IO timeout which should be configured inside the server or transport

fishy · 2020-11-06T06:15:51Z

@zerosnake0 I replied some of your points below. I would also suggest you to read the associated JIRA ticket, which provides slightly more background about what problem this change solves.

It's a breaking change

Unlike Go1, there's no backward compatibility guarantee in thrift library. If you take a look at https://github.com/apache/thrift/blob/master/CHANGES.md, breaking changes are added in every release.

the context argument is not really being used in the current implementation, the golang io interface does not use context neither (golang/go#20280)

It's used here, here. and here.

zerosnake0 · 2020-11-06T06:31:31Z

@zerosnake0 I replied some of your points below. I would also suggest you to read the associated JIRA ticket, which provides slightly more background about what problem this change solves.

It's a breaking change

Unlike Go1, there's no backward compatibility guarantee in thrift library. If you take a look at https://github.com/apache/thrift/blob/master/CHANGES.md, breaking changes are added in every release.

the context argument is not really being used in the current implementation, the golang io interface does not use context neither (golang/go#20280)

It's used here, here. and here.

do breaking change only when necessary
I've already checked all usage of the context before I posted this, that's why i said "not really". Furthermore, for the later two usage you wrote, i do not think that this kind of loop handling should be processed inside the protocol implementation

fishy · 2020-11-06T06:57:27Z

Resolving a real issue sounds like a fitting case of "necessary", won't you think?

In the JIRA ticket me and Can discussed how to implement it, including the alternative of doing it on transport level (TTransport). Doing it on transport level would also be a breaking change, and we made a choice to do a breaking change on TProtocol over TTransport.

On hindsight yes the loop retry fits more naturally in TTransport, but some problems of that are:

As Can described in the JIRA ticket discussion, TProtocol implementations in the wild are much rarer than TTranposrt implementations.
Adding context into TTransport.Read will make it no longer an implementation of io.Reader, which breaks a lot of other things.
In the vast majority of cases this is really only needed for the first read (vs. every read).

zerosnake0 · 2020-11-06T07:01:58Z

Resolving a real issue sounds like a fitting case of "necessary", won't you think?

In the JIRA ticket me and Can discussed how to implement it, including the alternative of doing it on transport level (TTransport). Doing it on transport level would also be a breaking change, and we made a choice to do a breaking change on TProtocol over TTransport.

On hindsight yes the loop retry fits more naturally in TTransport, but some problems of that are:

As Can described in the JIRA ticket discussion, TProtocol implementations in the wild are much rarer than TTranposrt implementations.

Adding context into TTransport.Read will make it no longer an implementation of io.Reader, which breaks a lot of other things.

In the vast majority of cases this is really only needed for the first read (vs. every read).

When i say put timeout in transport, i do not mean to put context to it.

I suggest you take a look at the golang official net/http implementation before any furthur discussion

dcelasun · 2020-11-06T07:34:39Z

I suggest you take a look at the golang official net/http implementation before any furthur discussion

If I may, "go read net/http" is not a productive attitude. You are arguing against a change that was already discussed, agreed to and merged. That's fine, it's even welcome. But the onus is on you to convince us there is a better way and that we should revert this. You didn't even address @fishy's first and second point.

zerosnake0 · 2020-11-06T07:39:40Z

I suggest you take a look at the golang official net/http implementation before any furthur discussion

If I may, "go read net/http" is not a productive attitude. You are arguing against a change that was already discussed, agreed to and merged. That's fine, it's even welcome. But the onus is on you to convince us there is a better way and that we should revert this. You didn't even address @fishy's first and second point.

I surely read the first point and I agree with the "rarer implementation point" so I said nothing. Same for the 3rd point.

And why i mentioned net/http because there is something similar for this kind of issue and I'm not sure if you know it. You quote my phrase in just three word "go read net/http" when I typed such a long phrase. I apologize if there is any word misusage.

What's more, this commit is related with many PR done by @fishy (for example, for the connectivity check), that's why I want to post here and discuss with him before doing anything

fishy · 2020-11-06T20:27:44Z

@zerosnake0 so my guess is by "net/http" you mean client.go and transport.go. Yes there are things we can learn from there, but there are also a lot of key differences between thrift's transport and http transport:

In http a transport is a connection pool. It's not only a connection pool to a single server, it's a connection pool to all servers. It's safe for concurrent uses, and is multiplexed so that they can use a lot of channel selections to handle timeout. In thrift a TSocket is a single connection. It's not safe for concurrent use, and every I/O operation is blocking. You can't just create a new goroutine for every blocking I/O operation so that you can throw it away when timeout happens, without a centralized pool management (like in http) that will cause a lot of goroutine leaks if the socket timeout is configured too high (or even not configured at all).
In thrift a TTransport is an abstract layer. It could be a remote I/O (TSocket), It could be some additional feature wrapping a remote I/O (TFramedTransport + TSocket). It could also be something completely in memory (TMemoryBuffer). TProtocol implementations, in general (with certain exceptions), don't make any assumption on the underlying TTransport and works with all situations. The actual client/server code also almost never interact with TTransport directly, and they only call the TProtocol code. To do something similar to http library does, TProtocol needs to pass an additional deadline/timeout into every call to the TTransport. This can be done by either:

a. Add ctx to every TTransport read/write functions
b. Have an additional function to set the deadline/timeout before every TTransport read/write function call.

Both options are breaking changes to TTransport, both options were already discussed in the JIRA ticket, and the reason we picked the current implemented option over them can be found there as well.

If we completely rewrite the whole thrift go library now, we probably will choose an implementation similar to how http library does over something we have right now. But we are not doing that complete rewrite and the current implementation is the least breaking one, as far as we could tell.

If you have a better approach, I would love to see something more concrete from it (e.g. code/PR, or some more detailed design on how to do it).

The connectivity check change happened before this change, and is not depending on this change, btw.

zerosnake0 · 2020-11-07T05:11:32Z

@zerosnake0 so my guess is by "net/http" you mean client.go and transport.go. Yes there are things we can learn from there, but there are also a lot of key differences between thrift's transport and http transport:

In http a transport is a connection pool. It's not only a connection pool to a single server, it's a connection pool to all servers. It's safe for concurrent uses, and is multiplexed so that they can use a lot of channel selections to handle timeout. In thrift a TSocket is a single connection. It's not safe for concurrent use, and every I/O operation is blocking. You can't just create a new goroutine for every blocking I/O operation so that you can throw it away when timeout happens, without a centralized pool management (like in http) that will cause a lot of goroutine leaks if the socket timeout is configured too high (or even not configured at all).

In thrift a TTransport is an abstract layer. It could be a remote I/O (TSocket), It could be some additional feature wrapping a remote I/O (TFramedTransport + TSocket). It could also be something completely in memory (TMemoryBuffer). TProtocol implementations, in general (with certain exceptions), don't make any assumption on the underlying TTransport and works with all situations. The actual client/server code also almost never interact with TTransport directly, and they only call the TProtocol code. To do something similar to http library does, TProtocol needs to pass an additional deadline/timeout into every call to the TTransport. This can be done by either:
a. Add ctx to every TTransport read/write functions
b. Have an additional function to set the deadline/timeout before every TTransport read/write function call.

Both options are breaking changes to TTransport, both options were already discussed in the JIRA ticket, and the reason we picked the current implemented option over them can be found there as well.

If we completely rewrite the whole thrift go library now, we probably will choose an implementation similar to how http library does over something we have right now. But we are not doing that complete rewrite and the current implementation is the least breaking one, as far as we could tell.

If you have a better approach, I would love to see something more concrete from it (e.g. code/PR, or some more detailed design on how to do it).

The connectivity check change happened before this change, and is not depending on this change, btw.

Sorry I didn't specify which file to look yesterday, I mean more to server.go, in the readRequest method the approch is more like the 2.b that you pointed. Personally I think 2.b will be more reasonable because we need to set timeout specifically every time the IO reads/writes the request/response

Another reason is that currently the only reason cancels the context is the disconnection, which will surely affect all your IO operations after.

I will try to work something out and let you know

fishy · 2020-11-07T06:05:55Z

Another reason is that currently the only reason cancels the context is the disconnection

This is not true. The feature this change implemented is mainly for clients to use (servers don't care about read timeouts unless they don't intent to support long connections/client pools). It's very common for clients to set a timeout in the context object.

taraspos · 2021-03-04T12:58:40Z

@fishy I have a couple of clarification questions about the expected behavior of the timeouts. Sorry if I misunderstood something.

The logic handles only the retries? What about canceling the reads that are taking longer than the deadline?
For example:
- If I set a context deadline of 1 second and i/o timeout appears, before that, a read will be retried. Correct?
- If the context deadline set to 1 second, but read takes 5 seconds, it will not be canceled after 1 second and will not be retried later. Correct?
Is the logic applied only to the ReadMessageBegin? What about Read and ReadMessageEnd?
For example:
- Context deadline is set to 5 seconds and i/o timeout happens after 1 second of ReadMessageBegin, so it will be retried. Right?
- Context deadline is set to 5 seconds, and ReadMessageBegin succeeds in 1 second, but the further Read fails with the i/o timeout in 1 more second (so 3 seconds left until the deadline). Will it be retried?
This logic was added to the TBinaryProtocol, TCompactProtocol, THeaderProtocol but not to the TJSONProtocol, right?

Thanks!

fishy · 2021-03-04T17:09:04Z

@trane9991

Yes for both

This is because the way the library/framework works, you would only get io timeouts on the first reads (ReadMessageBegin). For client side write (server side read), if you look at TStandardClient's implementation, we always call WriteMessageBegin, Write, WriteMessageEnd, Flush without delays between them (

thrift/lib/go/thrift/client.go

Lines 44 to 53 in a8c041d

    
           if err := oprot.WriteMessageBegin(ctx, method, CALL, seqId); err != nil { 
        
           	return err 
        
           } 
        
           if err := args.Write(ctx, oprot); err != nil { 
        
           	return err 
        
           } 
        
           if err := oprot.WriteMessageEnd(ctx); err != nil { 
        
           	return err 
        
           } 
        
           return oprot.Flush(ctx)

). For server side write (client side read), it's the same in the compiler generated Process functions. The only "delay" that could cause io timeouts are the delaye before calling WriteMessageBegin. For server writes those delays are just the actual handler handling the request and generating the response. I described that in the jira ticket.

Right.

fishy force-pushed the go-protocol-context branch 2 times, most recently from 6ed2cfc to e598828 Compare June 13, 2020 23:52

fishy commented Jun 14, 2020

View reviewed changes

fishy marked this pull request as ready for review June 14, 2020 01:33

fishy force-pushed the go-protocol-context branch from e598828 to 4dca38e Compare June 15, 2020 15:48

fishy force-pushed the go-protocol-context branch from 4dca38e to 7e91865 Compare June 15, 2020 17:16

fishy force-pushed the go-protocol-context branch from 7e91865 to a19c240 Compare June 15, 2020 19:18

fishy force-pushed the go-protocol-context branch from a19c240 to e79f764 Compare June 15, 2020 20:28

dcelasun merged commit bcae3bb into apache:master Jun 18, 2020

darabos mentioned this pull request Jun 22, 2020

Breaking change in Thrift library xitongsys/parquet-go#292

Closed

fishy deleted the go-protocol-context branch November 6, 2020 05:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

THRIFT-5233: Handle I/O timeouts in go library #2181

THRIFT-5233: Handle I/O timeouts in go library #2181

fishy commented Jun 13, 2020 •

edited

Loading

fishy Jun 14, 2020

fishy Jun 14, 2020

fishy Jun 14, 2020

fishy Jun 14, 2020

fishy commented Jun 14, 2020

dcelasun commented Jun 15, 2020

fishy commented Jun 15, 2020

fishy commented Jun 15, 2020

fishy commented Jun 15, 2020

zerosnake0 commented Nov 6, 2020

fishy commented Nov 6, 2020

zerosnake0 commented Nov 6, 2020 •

edited

Loading

fishy commented Nov 6, 2020

zerosnake0 commented Nov 6, 2020

dcelasun commented Nov 6, 2020

zerosnake0 commented Nov 6, 2020 •

edited

Loading

fishy commented Nov 6, 2020

zerosnake0 commented Nov 7, 2020 •

edited

Loading

fishy commented Nov 7, 2020

taraspos commented Mar 4, 2021

fishy commented Mar 4, 2021

THRIFT-5233: Handle I/O timeouts in go library #2181

THRIFT-5233: Handle I/O timeouts in go library #2181

Conversation

fishy commented Jun 13, 2020 • edited Loading

fishy Jun 14, 2020

Choose a reason for hiding this comment

fishy Jun 14, 2020

Choose a reason for hiding this comment

fishy Jun 14, 2020

Choose a reason for hiding this comment

fishy Jun 14, 2020

Choose a reason for hiding this comment

fishy commented Jun 14, 2020

dcelasun commented Jun 15, 2020

fishy commented Jun 15, 2020

fishy commented Jun 15, 2020

fishy commented Jun 15, 2020

zerosnake0 commented Nov 6, 2020

fishy commented Nov 6, 2020

zerosnake0 commented Nov 6, 2020 • edited Loading

fishy commented Nov 6, 2020

zerosnake0 commented Nov 6, 2020

dcelasun commented Nov 6, 2020

zerosnake0 commented Nov 6, 2020 • edited Loading

fishy commented Nov 6, 2020

zerosnake0 commented Nov 7, 2020 • edited Loading

fishy commented Nov 7, 2020

taraspos commented Mar 4, 2021

fishy commented Mar 4, 2021

fishy commented Jun 13, 2020 •

edited

Loading

zerosnake0 commented Nov 6, 2020 •

edited

Loading

zerosnake0 commented Nov 6, 2020 •

edited

Loading

zerosnake0 commented Nov 7, 2020 •

edited

Loading