ProduceSync returning "context canceled" error when parent context hasn't been cancelled #756

saholman · 2024-06-05T15:46:31Z

Many times throughout the day the following sequence of logs occur:

msg: "random error while producing, requeueing unattempted request", err: "ILLEGAL_SASL_STATE: Request is not valid given the current SASL state."
Msg: "produce request failed triggering metadata update", err: "ILLEGAL_SASL_STATE: Request is not valid given the current SASL state."
Msg: "metadata update triggered" why: "failed produce request triggering metadata update"
A few seconds later we get 10 or so instances of the ProduceSync method returning a "context cancelled" error. The parent context we are passing in isn't getting cancelled.

These ProduceSync errors are impacting our system. Is there an internal context that is being cancelled? Is there a reason these particular errors aren't being retried?

twmb · 2024-06-07T16:35:09Z

Do you have debug logs of this happening?

saholman · 2024-06-18T22:14:43Z

"2024-06-18T20:37:48.051Z DEBUG [email protected]/kzap.go:105 sasl expiry limit reached, reauthenticating {""broker"": ""1""}"
"2024-06-18T20:37:48.051Z DEBUG [email protected]/kzap.go:105 issuing SASLHandshakeRequest {""broker"": ""1""}"
"2024-06-18T20:37:48.051Z DEBUG [email protected]/kzap.go:105 wrote SASLHandshake v1 {""broker"": ""1"", ""bytes_written"": 30, ""write_wait"": ""12.03µs"", ""time_to_write"": ""50.431µs"", ""err"": null}"
"2024-06-18T20:37:48.052Z DEBUG [email protected]/kzap.go:105 read SASLHandshake v1 {""broker"": ""1"", ""bytes_read"": 40, ""read_wait"": ""42.972µs"", ""time_to_read"": ""1.091305ms"", ""err"": null}"
"2024-06-18T20:37:48.052Z DEBUG [email protected]/kzap.go:105 beginning sasl authentication {""broker"": ""1"", ""addr"": ""b-1-public.REDACTED.amazonaws.com:9198"", ""mechanism"": ""AWS_MSK_IAM"", ""authenticate"": true}"
"2024-06-18T20:37:48.052Z DEBUG [email protected]/kzap.go:105 issuing SASLAuthenticate {""broker"": ""1"", ""version"": 2, ""step"": 0}"
"2024-06-18T20:37:48.053Z DEBUG [email protected]/kzap.go:105 wrote SASLAuthenticate v2 {""broker"": ""1"", ""bytes_written"": 1755, ""write_wait"": ""8.298µs"", ""time_to_write"": ""45.558µs"", ""err"": null}"
"2024-06-18T20:37:48.065Z DEBUG [email protected]/kzap.go:105 read SASLAuthenticate v2 {""broker"": ""1"", ""bytes_read"": 98, ""read_wait"": ""31.063µs"", ""time_to_read"": ""12.384996ms"", ""err"": null}"
"2024-06-18T20:37:48.065Z DEBUG [email protected]/kzap.go:105 sasl has a limited lifetime {""broker"": ""1"", ""session_lifetime"": ""936ms"", ""lifetime_pessimism"": ""1s"", ""reauthenticate_in"": ""-64ms""}"
"2024-06-18T20:37:48.065Z DEBUG [email protected]/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 643, ""write_wait"": ""14.047337ms"", ""time_to_write"": ""47.9µs"", ""err"": null}"
"2024-06-18T20:37:48.069Z DEBUG [email protected]/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""48.54µs"", ""time_to_read"": ""4.039864ms"", ""err"": null}"
"2024-06-18T20:37:48.069Z DEBUG [email protected]/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778238=>778239}]""}"
"2024-06-18T20:37:48.088Z DEBUG [email protected]/kzap.go:105 sasl expiry limit reached, reauthenticating {""broker"": ""1""}"
"2024-06-18T20:37:48.088Z DEBUG [email protected]/kzap.go:105 issuing SASLHandshakeRequest {""broker"": ""1""}"
"2024-06-18T20:37:48.088Z DEBUG [email protected]/kzap.go:105 wrote SASLHandshake v1 {""broker"": ""1"", ""bytes_written"": 30, ""write_wait"": ""12.261µs"", ""time_to_write"": ""67.692µs"", ""err"": null}"
"2024-06-18T20:37:48.106Z DEBUG [email protected]/kzap.go:105 read SASLHandshake v1 {""broker"": ""1"", ""bytes_read"": 14, ""read_wait"": ""57.922µs"", ""time_to_read"": ""17.648534ms"", ""err"": null}"
"2024-06-18T20:37:48.106Z WARN [email protected]/kzap.go:111 random error while producing, requeueing unattempted request {""broker"": ""1"", ""err"": ""ILLEGAL_SASL_STATE: Request is not valid given the current SASL state.""}"
github.com/twmb/franz-go/plugin/kzap.(*Logger).Log
/go/pkg/mod/github.com/twmb/franz-go/plugin/[email protected]/kzap.go:111
github.com/twmb/franz-go/pkg/kgo.(*wrappedLogger).Log
/go/pkg/mod/github.com/twmb/[email protected]/pkg/kgo/logger.go:123
github.com/twmb/franz-go/pkg/kgo.(*sink).handleReqClientErr
/go/pkg/mod/github.com/twmb/[email protected]/pkg/kgo/sink.go:531
github.com/twmb/franz-go/pkg/kgo.(*sink).handleReqResp
/go/pkg/mod/github.com/twmb/[email protected]/pkg/kgo/sink.go:581
github.com/twmb/franz-go/pkg/kgo.(*sink).produce.func3
/go/pkg/mod/github.com/twmb/[email protected]/pkg/kgo/sink.go:369
github.com/twmb/franz-go/pkg/kgo.(*sink).handleSeqResps
/go/pkg/mod/github.com/twmb/[email protected]/pkg/kgo/sink.go:411
"2024-06-18T20:37:48.106Z INFO [email protected]/kzap.go:109 produce request failed, triggering metadata update {""broker"": ""1"", ""err"": ""ILLEGAL_SASL_STATE: Request is not valid given the current SASL state.""}"
"2024-06-18T20:37:48.106Z DEBUG [email protected]/kzap.go:105 retry batches processed {""wanted_metadata_update"": true, ""triggering_metadata_update"": true, ""should_backoff"": false}"
"2024-06-18T20:37:48.106Z INFO [email protected]/kzap.go:109 metadata update triggered {""why"": ""failed produce request triggered metadata update""}"
"2024-06-18T20:37:48.191Z DEBUG [email protected]/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 777, ""write_wait"": ""33.683µs"", ""time_to_write"": ""61.676µs"", ""err"": null}"
"2024-06-18T20:37:48.192Z DEBUG [email protected]/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""95.167µs"", ""time_to_read"": ""1.037898ms"", ""err"": null}"
"2024-06-18T20:37:48.192Z DEBUG [email protected]/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778240=>778241}]""}"
"2024-06-18T20:37:48.389Z DEBUG [email protected]/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 780, ""write_wait"": ""36.824µs"", ""time_to_write"": ""45.222µs"", ""err"": null}"
"2024-06-18T20:37:48.390Z DEBUG [email protected]/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""69.801µs"", ""time_to_read"": ""768.197µs"", ""err"": null}"
"2024-06-18T20:37:48.390Z DEBUG [email protected]/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778243=>778244}]""}"
"2024-06-18T20:37:48.591Z DEBUG [email protected]/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 780, ""write_wait"": ""23.634µs"", ""time_to_write"": ""46.818µs"", ""err"": null}"
"2024-06-18T20:37:48.592Z DEBUG [email protected]/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""61.724µs"", ""time_to_read"": ""786.672µs"", ""err"": null}"
"2024-06-18T20:37:48.592Z DEBUG [email protected]/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778246=>778247}]""}"
"2024-06-18T20:37:48.650Z DEBUG [email protected]/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 644, ""write_wait"": ""27.785µs"", ""time_to_write"": ""46.696µs"", ""err"": null}"
"2024-06-18T20:37:48.650Z DEBUG [email protected]/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""86.61µs"", ""time_to_read"": ""694.89µs"", ""err"": null}"
"2024-06-18T20:37:48.650Z DEBUG [email protected]/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778248=>778249}]""}"
"2024-06-18T20:37:48.893Z DEBUG [email protected]/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 780, ""write_wait"": ""30.491µs"", ""time_to_write"": ""53.437µs"", ""err"": null}"
"2024-06-18T20:37:48.894Z DEBUG [email protected]/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""80.824µs"", ""time_to_read"": ""780.472µs"", ""err"": null}"
"2024-06-18T20:37:48.894Z DEBUG [email protected]/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778251=>778252}]""}"
"2024-06-18T20:37:48.953Z DEBUG [email protected]/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 644, ""write_wait"": ""24.65µs"", ""time_to_write"": ""44.14µs"", ""err"": null}"
"2024-06-18T20:37:48.954Z DEBUG [email protected]/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""69.11µs"", ""time_to_read"": ""775.095µs"", ""err"": null}"
"2024-06-18T20:37:48.954Z DEBUG [email protected]/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778253=>778254}]""}"
"2024-06-18T20:37:49.203Z DEBUG [email protected]/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 777, ""write_wait"": ""32.71µs"", ""time_to_write"": ""48.865µs"", ""err"": null}"
"2024-06-18T20:37:49.204Z DEBUG [email protected]/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""83.708µs"", ""time_to_read"": ""851.99µs"", ""err"": null}"
"2024-06-18T20:37:49.204Z DEBUG [email protected]/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778256=>778257}]""}"
"2024-06-18T20:37:49.267Z DEBUG [email protected]/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 641, ""write_wait"": ""25.185µs"", ""time_to_write"": ""45.943µs"", ""err"": null}"
"2024-06-18T20:37:49.268Z DEBUG [email protected]/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""75.01µs"", ""time_to_read"": ""752.085µs"", ""err"": null}"
"2024-06-18T20:37:49.268Z DEBUG [email protected]/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778258=>778259}]""}"
"2024-06-18T20:37:49.490Z DEBUG [email protected]/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 780, ""write_wait"": ""25.02µs"", ""time_to_write"": ""42.43µs"", ""err"": null}"
"2024-06-18T20:37:49.491Z DEBUG [email protected]/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""70.766µs"", ""time_to_read"": ""848.823µs"", ""err"": null}"
"2024-06-18T20:37:49.491Z DEBUG [email protected]/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778260=>778261}]""}"
"2024-06-18T20:37:49.549Z DEBUG [email protected]/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 644, ""write_wait"": ""25.806µs"", ""time_to_write"": ""45.988µs"", ""err"": null}"
"2024-06-18T20:37:49.549Z DEBUG [email protected]/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""86.452µs"", ""time_to_read"": ""694.985µs"", ""err"": null}"
"2024-06-18T20:37:49.549Z DEBUG [email protected]/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778262=>778263}]""}"
"2024-06-18T20:37:49.800Z DEBUG [email protected]/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 780, ""write_wait"": ""24.503µs"", ""time_to_write"": ""44.782µs"", ""err"": null}"
"2024-06-18T20:37:49.801Z DEBUG [email protected]/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""76.126µs"", ""time_to_read"": ""818.682µs"", ""err"": null}"
"2024-06-18T20:37:49.801Z DEBUG [email protected]/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778266=>778267}]""}"
"2024-06-18T20:37:49.859Z DEBUG [email protected]/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 644, ""write_wait"": ""32.997µs"", ""time_to_write"": ""47.074µs"", ""err"": null}"
"2024-06-18T20:37:49.860Z DEBUG [email protected]/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""85.297µs"", ""time_to_read"": ""915.296µs"", ""err"": null}"
""2024-06-18T20:37:50.052Z DEBUG [email protected]/kzap.go:105 sasl expiry limit reached, reauthenticating {""broker"": ""3""}"
"2024-06-18T20:37:50.052Z DEBUG [email protected]/kzap.go:105 issuing SASLHandshakeRequest {""broker"": ""3""}"
"2024-06-18T20:37:50.052Z DEBUG [email protected]/kzap.go:105 wrote SASLHandshake v1 {""broker"": ""3"", ""bytes_written"": 30, ""write_wait"": ""11.888µs"", ""time_to_write"": ""54.789µs"", ""err"": null}"
"2024-06-18T20:37:50.052Z DEBUG [email protected]/kzap.go:105 read SASLHandshake v1 {""broker"": ""3"", ""bytes_read"": 40, ""read_wait"": ""96.019µs"", ""time_to_read"": ""568.061µs"", ""err"": null}"
"2024-06-18T20:37:50.052Z DEBUG [email protected]/kzap.go:105 beginning sasl authentication {""broker"": ""3"", ""addr"": ""b-3-public.REDACTED.amazonaws.com:9198"", ""mechanism"": ""AWS_MSK_IAM"", ""authenticate"": true}"
"2024-06-18T20:37:50.092Z DEBUG [email protected]/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 777, ""write_wait"": ""22.265µs"", ""time_to_write"": ""44.407µs"", ""err"": null}"
"2024-06-18T20:37:50.092Z DEBUG [email protected]/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""58.667µs"", ""time_to_read"": ""877.818µs"", ""err"": null}"
"2024-06-18T20:37:50.093Z DEBUG [email protected]/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778271=>778272}]""}"
"2024-06-18T20:37:50.133Z DEBUG [email protected]/kzap.go:105 issuing SASLAuthenticate {""broker"": ""3"", ""version"": 2, ""step"": 0}"
"2024-06-18T20:37:50.134Z DEBUG [email protected]/kzap.go:105 wrote SASLAuthenticate v2 {""broker"": ""3"", ""bytes_written"": 1755, ""write_wait"": ""9.053µs"", ""time_to_write"": ""95.662µs"", ""err"": null}"
"2024-06-18T20:37:50.151Z DEBUG [email protected]/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 641, ""write_wait"": ""26.78µs"", ""time_to_write"": ""41.639µs"", ""err"": null}"
"2024-06-18T20:37:50.152Z DEBUG [email protected]/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""61.357µs"", ""time_to_read"": ""733.487µs"", ""err"": null}"
"2024-06-18T20:37:50.152Z DEBUG [email protected]/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778273=>778274}]""}"
"2024-06-18T20:37:50.258Z DEBUG [email protected]/kzap.go:105 read SASLAuthenticate v2 {""broker"": ""3"", ""bytes_read"": 294, ""read_wait"": ""50.968µs"", ""time_to_read"": ""124.369308ms"", ""err"": null}"
"2024-06-18T20:37:50.258Z DEBUG [email protected]/kzap.go:105 sasl reauth failed, retrying once on new connection {""broker"": ""3"", ""err"": ""Cannot change principals during re-authentication from IAM.arn:aws:sts::REDACTED:assumed-role/REDACTED/REDACTED: IAM.arn:aws:sts::REDACTED:assumed-role/REDACTED/REDACTED: SASL_AUTHENTICATION_FAILED: SASL Authentication failed.""}"
"2024-06-18T20:37:50.258Z DEBUG [email protected]/kzap.go:105 opening connection to broker {""addr"": ""b-3-public.REDACTED.amazonaws.com:9198"", ""broker"": ""3""}"
"2024-06-18T20:37:50.267Z DEBUG [email protected]/kzap.go:105 connection opened to broker {""addr"": ""b-3-public.REDACTED.amazonaws.com:9198"", ""broker"": ""3""}"
"2024-06-18T20:37:50.267Z DEBUG [email protected]/kzap.go:105 issuing SASLHandshakeRequest {""broker"": ""3""}"
"2024-06-18T20:37:50.267Z DEBUG [email protected]/kzap.go:105 wrote SASLHandshake v1 {""broker"": ""3"", ""bytes_written"": 30, ""write_wait"": ""8.466µs"", ""time_to_write"": ""32.194µs"", ""err"": null}"
"2024-06-18T20:37:50.267Z DEBUG [email protected]/kzap.go:105 read SASLHandshake v1 {""broker"": ""3"", ""bytes_read"": 40, ""read_wait"": ""36.113µs"", ""time_to_read"": ""458.292µs"", ""err"": null}"
"2024-06-18T20:37:50.267Z DEBUG [email protected]/kzap.go:105 beginning sasl authentication {""broker"": ""3"", ""addr"": ""b-3-public.REDACTED.amazonaws.com:9198"", ""mechanism"": ""AWS_MSK_IAM"", ""authenticate"": true}"
"2024-06-18T20:37:50.267Z DEBUG [email protected]/kzap.go:105 issuing SASLAuthenticate {""broker"": ""3"", ""version"": 2, ""step"": 0}"
"2024-06-18T20:37:50.267Z DEBUG [email protected]/kzap.go:105 wrote SASLAuthenticate v2 {""broker"": ""3"", ""bytes_written"": 1755, ""write_wait"": ""6.579µs"", ""time_to_write"": ""42.996µs"", ""err"": null}"
"2024-06-18T20:37:50.279Z DEBUG [email protected]/kzap.go:105 read SASLAuthenticate v2 {""broker"": ""3"", ""bytes_read"": 98, ""read_wait"": ""30.854µs"", ""time_to_read"": ""11.105484ms"", ""err"": null}"
"2024-06-18T20:37:50.279Z DEBUG [email protected]/kzap.go:105 sasl has a limited lifetime {""broker"": ""3"", ""session_lifetime"": ""59m59.722s"", ""lifetime_pessimism"": ""1m23.054s"", ""reauthenticate_in"": ""58m36.668s""}"
"2024-06-18T20:37:50.279Z DEBUG [email protected]/kzap.go:105 connection initialized successfully {""addr"": ""b-3-public.REDACTED.amazonaws.com:9198"", ""broker"": ""3""}"
"2024-06-18T20:37:50.279Z DEBUG [email protected]/kzap.go:105 wrote Metadata v12 {""broker"": ""3"", ""bytes_written"": 83, ""write_wait"": ""226.997748ms"", ""time_to_write"": ""47.72µs"", ""err"": null}"
"2024-06-18T20:37:50.291Z DEBUG [email protected]/kzap.go:105 read Metadata v12 {""broker"": ""3"", ""bytes_read"": 418, ""read_wait"": ""41.567µs"", ""time_to_read"": ""12.729919ms"", ""err"": null}"
"2024-06-18T20:37:50.292Z DEBUG [email protected]/kzap.go:105 metadata refresh has identical topic partition data {""topic"": ""REDACTED"", ""partition"": 0, ""leader"": 1, ""leader_epoch"": 150}"
"2024-06-18T20:37:50.292Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": null}"
"2024-06-18T20:37:50.293Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}"
"2024-06-18T20:37:50.293Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}"
"2024-06-18T20:37:50.293Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}"
"2024-06-18T20:37:50.293Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}"
"2024-06-18T20:37:50.293Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}"
"2024-06-18T20:37:50.293Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}"
"2024-06-18T20:37:50.293Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}"
"2024-06-18T20:37:50.294Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}"
"2024-06-18T20:37:50.294Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}"
"2024-06-18T20:37:50.294Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}"
"2024-06-18T20:37:50.294Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}"
"2024-06-18T20:37:50.294Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}"
"2024-06-18T20:37:50.294Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}"
"2024-06-18T20:37:50.294Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": null}"
"2024-06-18T20:37:50.294Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": null}"
"2024-06-18T20:37:50.294Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": null}"
"2024-06-18T20:37:50.303Z DEBUG [email protected]/kzap.go:105 opening connection to broker {""addr"": ""b-1-public.REDACTED.amazonaws.com:9198"", ""broker"": ""1""}"
"2024-06-18T20:37:50.353Z DEBUG [email protected]/kzap.go:105 connection opened to broker {""addr"": ""b-1-public.REDACTED.amazonaws.com:9198"", ""broker"": ""1""}"
"2024-06-18T20:37:50.353Z DEBUG [email protected]/kzap.go:105 issuing SASLHandshakeRequest {""broker"": ""1""}"
"2024-06-18T20:37:50.353Z DEBUG [email protected]/kzap.go:105 wrote SASLHandshake v1 {""broker"": ""1"", ""bytes_written"": 30, ""write_wait"": ""10.279µs"", ""time_to_write"": ""33.919µs"", ""err"": null}"
"2024-06-18T20:37:50.354Z DEBUG [email protected]/kzap.go:105 read SASLHandshake v1 {""broker"": ""1"", ""bytes_read"": 40, ""read_wait"": ""36.204µs"", ""time_to_read"": ""953.121µs"", ""err"": null}"
"2024-06-18T20:37:50.354Z DEBUG [email protected]/kzap.go:105 beginning sasl authentication {""broker"": ""1"", ""addr"": ""b-1-public.REDACTED.amazonaws.com:9198"", ""mechanism"": ""AWS_MSK_IAM"", ""authenticate"": true}"
"2024-06-18T20:37:50.354Z DEBUG [email protected]/kzap.go:105 issuing SASLAuthenticate {""broker"": ""1"", ""version"": 2, ""step"": 0}"
"2024-06-18T20:37:50.354Z DEBUG [email protected]/kzap.go:105 wrote SASLAuthenticate v2 {""broker"": ""1"", ""bytes_written"": 1755, ""write_wait"": ""7.747µs"", ""time_to_write"": ""47.471µs"", ""err"": null}"
"2024-06-18T20:37:50.381Z DEBUG [email protected]/kzap.go:105 read SASLAuthenticate v2 {""broker"": ""1"", ""bytes_read"": 98, ""read_wait"": ""28.946µs"", ""time_to_read"": ""26.771741ms"", ""err"": null}"
"2024-06-18T20:37:50.381Z DEBUG [email protected]/kzap.go:105 sasl has a limited lifetime {""broker"": ""1"", ""session_lifetime"": ""59m59.62s"", ""lifetime_pessimism"": ""2m44.771s"", ""reauthenticate_in"": ""57m14.849s""}"
"2024-06-18T20:37:50.381Z DEBUG [email protected]/kzap.go:105 connection initialized successfully {""addr"": ""b-1-public.REDACTED.amazonaws.com:9198"", ""broker"": ""1""}"
"2024-06-18T20:37:49.860Z DEBUG [email protected]/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778268=>778269}]""}"

You can see that they are now returning a "context deadline exceeded" error instead of a "context canceled" error because we added a parent context with a timeout of 500ms. It was taking several seconds to get the "context canceled" so we added this to fail faster so we can retry.

But the flow is the same. I can't tell in this scenario if the produce request that are failing were going to broker 1 or broker 3. If it's broker 3, then we appear to cancel and not retry produce requests for the original broker connection. See #249 for the change that creates a new connection when SASL authentication fails. If it's broker 1, then it seems more related to the ILLEGAL_SASL_STATE due to the short session.

twmb · 2024-06-23T03:41:50Z

There is no canceling in the produce path of the codebase:

@m4x3r:~/src/twmb/franz-go/pkg/kgo
$ rg 'ancel\('
client.go
457:	ctx, cancel := context.WithCancel(context.Background())
1031:	sessCloseCancel()
1036:	cl.ctxCancel()
1318:	ctx, cancel := context.WithCancel(ctx)
1322:		defer cancel()
2031:	ctx, cancel := context.WithCancel(ctx)
2032:	defer cancel()

group_test.go
184:		cancel()

txn_test.go
173:		cancel()

consumer.go
1448:	ctx, cancel := context.WithCancel(c.cl.ctx)
1614:	session.cancel()

source.go
788:		ctx, cancel = context.WithCancel(consumerSession.ctx)
790:	defer cancel()

consumer_group.go
240:		ctx, cancel = context.WithCancel(context.Background())
241:		cancel()
276:	ctx, cancel := context.WithCancel(c.cl.ctx)
497:	g.cancel()
846:	ctx, cancel := context.WithCancel(g.ctx)
848:		defer cancel() // potentially kill offset fetching
2812:	commitCtx, commitCancel := context.WithCancel(ctx) // enable ours to be canceled and waited for
2829:				commitCancel()
2837:		defer commitCancel()
2843:				priorCancel()

consumer_direct_test.go
34:	defer cancel()
108:	defer cancel()
156:	cancel()
163:	defer cancel()
232:		cancel()
328:	ctx, cancel := context.WithCancel(context.Background())
344:	defer cancel()
374:			cancel()
406:	ctx, cancel := context.WithCancel(context.Background())
422:	defer cancel()
452:			cancel()
499:		cancel()
523:	cancel()

txn.go
1163:		g.commitCancel() // cancel any prior commit
1173:	commitCtx, commitCancel := context.WithCancel(g.ctx) // enable ours to be canceled and waited for
1183:				commitCancel()
1191:		defer commitCancel()
1197:				priorCancel()

The cancel calls in client.go are unrelated or irrelevant here. Everything else is related to consuming.

Without looking at your code, my guess is that you are using the same context across multiple Produce calls and are canceling it after some messages return, which causes buffered records to be canceled too.

The logs above look normal, minus the SASL problems that we've discussed in a few issues. If you really want to see where the cancel is coming from, I think you could write a wrapper context.Context interface that calls debug.PrintStack on cancel before calling the inner (original) cancel func. I'd be pretty curious to know what the debug logs say.

I'm not really sure what to do in the client at this point for the SASL problems, considering this is almost the same behavior as the Kafka Java client itself. IMO, AWS needs to have some answer to the SASL problems -- like, propose a solution, or change what they allow. Given what I saw when I implemented the AWS_MSK_IAM auth 3yr ago, the broker side seemed a bit odd at the time.

twmb · 2024-07-10T01:51:15Z

Let me know if you have a bit more of an idea on this one -- from my audit of the code, nothing should be causing other records to return context.Canceled.

saholman · 2024-07-15T15:34:54Z

The underlying cause of this was the SASL "session too short" errors (#731). The solution posted there has also resolved this problem. Thanks!

twmb added the waiting label Jun 7, 2024

twmb closed this as completed Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProduceSync returning "context canceled" error when parent context hasn't been cancelled #756

ProduceSync returning "context canceled" error when parent context hasn't been cancelled #756

saholman commented Jun 5, 2024 •

edited

Loading

twmb commented Jun 7, 2024

saholman commented Jun 18, 2024 •

edited

Loading

twmb commented Jun 23, 2024

twmb commented Jul 10, 2024

saholman commented Jul 15, 2024

ProduceSync returning "context canceled" error when parent context hasn't been cancelled #756

ProduceSync returning "context canceled" error when parent context hasn't been cancelled #756

Comments

saholman commented Jun 5, 2024 • edited Loading

twmb commented Jun 7, 2024

saholman commented Jun 18, 2024 • edited Loading

twmb commented Jun 23, 2024

twmb commented Jul 10, 2024

saholman commented Jul 15, 2024

saholman commented Jun 5, 2024 •

edited

Loading

saholman commented Jun 18, 2024 •

edited

Loading