Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GetRecords hangs #301

Closed
fantyz opened this issue Jun 30, 2015 · 29 comments
Closed

GetRecords hangs #301

fantyz opened this issue Jun 30, 2015 · 29 comments
Labels
service-api This issue is due to a problem in a service API, not the SDK implementation.

Comments

@fantyz
Copy link

fantyz commented Jun 30, 2015

Hi,

I'm having some problems with my application that is reading records from Kinesis. It seems to be randomly hanging on the GetRecords request until either Kinesis kills the connection or the subsequent GetRecords will fail due to an expired iterator.

I have a stream with four shards. Total incoming data is around 500 kb/s. I have an application that starts a worker per shard- each worker is split into 5 parts, each communicating with the next over its own channel.

The KinesisReader is very straight forward:

...
for {
    // get records
    ts_start = time.Now()
    records, err := GetKinesisRecordsAndUpdateIterator(kinesisClient, iterator)
    if err != nil {
        log.Error("[WORKER #%v | KINESIS READER] Failed to get records (took=%v): %v", r.workerId, time.Since(ts_start), err)
        return
    }
    log.Info("[WORKER #%v | KINESIS READER] Read %v records from Kinesis (took=%v)", r.workerId, len(records), time.Since(ts_start))
    ...
}
...

func GetKinesisRecordsAndUpdateIterator(client *kinesis.Kinesis, iterator *string) ([]*kinesis.Record, error) {
    getRecordsInput := &kinesis.GetRecordsInput{
        ShardIterator: iterator,
    }
    resp, err := client.GetRecords(getRecordsInput)
    if err != nil {
        return nil, err
    }
    *iterator = *resp.NextShardIterator

    return resp.Records, nil
}

If it does not read any records it will go to sleep for 10 seconds. It will kill the worker on any errors and start a new one (hence it has failed 165 times in the last 12 hours).

The three other workers happily continue reading records while this is happening.

This results in the following log examples:

2015/06/30 06:25:07.073380 INFO [WORKER #157 | KINESIS READER] Read 1 records from Kinesis (took=30.845742ms)
2015/06/30 06:25:07.089282 INFO [WORKER #157 | KINESIS READER] Read 0 records from Kinesis (took=15.8282ms)
2015/06/30 06:31:50.780351 INFO [WORKER #157 | KINESIS READER] Read 21 records from Kinesis (took=6m33.690849627s)
2015/06/30 06:31:51.185088 ERROR [WORKER #157 | KINESIS READER] Failed to get records (took=73.276055ms): ExpiredIteratorException: Iterator expired. The iterator was created at time Tue Jun 30 06:25:17 UTC 2015 while right now it is Tue Jun 30 06:31:51 UTC 2015 which is further in the future than the tolerated delay of 300000 milliseconds.
status code: 400, request id: []
2015/06/30 06:37:24.780944 INFO [WORKER #165 | KINESIS READER] Read 0 records from Kinesis (took=20.868442ms)
06:37:34.977293 INFO [WORKER #165 | KINESIS READER] Read 25 records from Kinesis (took=196.090744ms)
2015/06/30 06:37:35.228885 INFO [WORKER #165 | KINESIS READER] Read 2 records from Kinesis (took=33.813113ms)
2015/06/30 06:37:35.258778 INFO [WORKER #165 | KINESIS READER] Read 0 records from Kinesis (took=24.589261ms)
2015/06/30 06:37:45.439487 INFO [WORKER #165 | KINESIS READER] Read 33 records from Kinesis (took=180.496173ms)
2015/06/30 06:37:45.650751 INFO [WORKER #165 | KINESIS READER] Read 1 records from Kinesis (took=35.829663ms)
2015/06/30 06:37:45.674542 INFO [WORKER #165 | KINESIS READER] Read 0 records from Kinesis (took=23.662061ms)
2015/06/30 06:42:58.103110 ERROR [WORKER #165 | KINESIS READER] Failed to get records (took=5m2.42835127s): SerializationError: failed decoding JSON RPC response
caused by: read tcp 54.239.19.115:443: connection reset by peer

The second one is by far the most frequent example.

Am I doing something wrong here? Any suggestions as to what might be wrong?

@jasdel
Copy link
Contributor

jasdel commented Jul 1, 2015

At first glance I think there might be an issue with the iterator *string being given to GetKinesisRecordsAndUpdateIterator Could the iterator be set by another goroutine or set in a loop?

I know this can sometimes cause unexpected results and might be the cause of the expired iterators

for i := 0; i < 3; i++ {
    go func() {
        fmt.Println("Got", i)
    }()
}

// outputs
// Got 3
// Got 3
// Got 3

http://play.golang.org/p/eqW58QXz5x

In a situation like this a new variable needs to be used within the loop.

for i := 0; i < 3; i++ {
    myIdx := i
    go func() {
        fmt.Println("Got", myIdx)
    }()
}

http://play.golang.org/p/E9e7g6NxMy

I'm not sure about the connection reset by peer issue. Is this running on a local system or EC2 instance? I'm ask because GetKinesisRecordsAndUpdateIterator took over 5 minutes to return. Though the other connections are completed very quickly

@jasdel jasdel added guidance Question that needs advice or information. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Jul 1, 2015
@fantyz
Copy link
Author

fantyz commented Jul 20, 2015

Sorry for taking a while to get back to you.

Each worker "own" a shard and as such aren't sharing anything like iterators or the likes. I'm fairly confident that the two issues are the one and same (eg. a GetRecords request taking very long to finish causing either the http request being reset by peer or actually finishing but taking longer to do so than the 5 minutes which is the maximum allowed age of an iterator).

It is running on a m3.medium instance in EC2. It has no problems keeping up with the load (<20% CPU usage, plenty of memory left). I've tried a few different instance types (including much larger instance types as well) but the problem persists.

@jasdel
Copy link
Contributor

jasdel commented Jul 24, 2015

Hi @fantyz have you been able to reproduce this issue with a simple single goroutine test app? Also I'd be curious your SDK client is retrying the failed connection multiple times with exponential backoff before exhausting all retry changes. This might shed some light in what is going on. I would suggest enable SDK logging. Using the aws.Config's LogLevel value above 0 will enable request logging.

svc := kinesis.New(&aws.Config{LogLevel: 1})
//... use the service.

@jasdel
Copy link
Contributor

jasdel commented Aug 5, 2015

Hi @fantyz With our latest changes to aws.Config the best way to see the request and response wire data.

svc := kinesis.New(aws.NewConfig().WithLogLevel(aws.LogDebug))
// ... use the service.

@fantyz
Copy link
Author

fantyz commented Aug 6, 2015

Thanks- I hope to find time to investigate soon. I'll be back! :)

On Thu, Aug 6, 2015 at 1:03 AM, Jason Del Ponte [email protected]
wrote:

Hi @fantyz https://github.com/fantyz With our latest changes to
aws.Config the best way to see the request and response wire data.

svc := kinesis.New(aws.NewConfig().WithLogLevel(aws.LogDebug))// ... use the service.


Reply to this email directly or view it on GitHub
#301 (comment).

Best regards,
Kasper Middelboe Petersen

Lead Backend Developer

SYBO Games ApS
Jorcks Passage 1A, 4th.
1162 Copenhagen K

@jasdel
Copy link
Contributor

jasdel commented Oct 8, 2015

Hi @fantyz did you get a chance to gather log outputs when you were encountering this issue?

I'm going to close this issue, please reopen if you are still encountering this problem.

@jasdel jasdel closed this as completed Oct 8, 2015
@fantyz
Copy link
Author

fantyz commented Oct 8, 2015

Yea, we still have it. I've just not had the chance yet to dig further into it. I'll be back :)

@fantyz
Copy link
Author

fantyz commented Jul 5, 2016

@jasdel I don't have permissions to reopen the issue.

I'm using aws-sdk-go v1.0.11.

I've had a chance to spend some time on this now.

I've made a very simple Kinesis reader and had it running for a short while. As it is a transient problem its not easy to catch it happening. The closest has been a 30 second delay. As shown below the actual delay is happening after the response from the Kinesis API has been received..?

Any ideas how to proceed from here?

...
Jul 05 13:17:06 pipeline.log:  2016/07/05 11:17:06 DEBUG: Response kinesis/GetRecords Details:
Jul 05 13:17:06 pipeline.log:  ---[ RESPONSE ]--------------------------------------
Jul 05 13:17:06 pipeline.log:  <skipping body>
Jul 05 13:17:06 pipeline.log:  -----------------------------------------------------
Jul 05 13:17:06 pipeline.log:  2016/07/05 11:17:06 [DEBUGGER] Read 9 records from Kinesis (took=26.05296ms, lag=2016-07-05 11:17:06 +0000 UTC)
Jul 05 13:17:06 pipeline.log:  2016/07/05 11:17:06 [DEBUGGER] Sleeping for 973.94704ms
Jul 05 13:17:07 pipeline.log:  2016/07/05 11:17:07 DEBUG: Request kinesis/GetRecords Details:
Jul 05 13:17:07 pipeline.log:  ---[ REQUEST POST-SIGN ]-----------------------------
Jul 05 13:17:07 pipeline.log:  POST / HTTP/1.1 
Jul 05 13:17:07 pipeline.log:  Host: kinesis.us-east-1.amazonaws.com 
Jul 05 13:17:07 pipeline.log:  User-Agent: aws-sdk-go/1.0.11 (go1.6; linux; amd64) 
Jul 05 13:17:07 pipeline.log:  Content-Length: 268 
Jul 05 13:17:07 pipeline.log:  Authorization: AWS4-HMAC-SHA256 Credential=ASIAJ3ZTEIMWNPP6ZTRQ/20160705/us-east-1/kinesis/aws4_request, SignedHeaders=host;x-amz-date;x-amz-security-token;x-amz-target, Signature=7e8845e185a8604c23c68d653595c9da9fdaee252d4afe0d1140ad3efa120b13 
Jul 05 13:17:07 pipeline.log:  Content-Type: application/x-amz-json-1.1 
Jul 05 13:17:07 pipeline.log:  X-Amz-Content-Sha256: 8f03fc179d188c84ff902eac26fd43a717507ee881ea42b9d245cb5467bad255 
Jul 05 13:17:07 pipeline.log:  X-Amz-Date: 20160705T111707Z 
Jul 05 13:17:07 pipeline.log:  X-Amz-Security-Token: FQoDYXdzEIv//////////wEaDOa/V5X5iX9RQ/lqNSKcAyfpYz91GDFn5mDadwdxseZ6faPdxyljA9lvieptBicUM7Pw91WAbC1Aa+yqXUmjA90KaOUst8zhYmY6KZOgJ+hbI1OJMDJcmOY0kRbjca/Xf7B8n3Mh4ukTKTsymidrn0FfkKctvzv/VBoFYBM7cpdAon2HKSOqk7AeeCPfAn9D+BiGtvQUd0gW4h/k5ZU0qTLoeN1d/fnMcGu+PsxosVVfgSyoNKOMI45A77KsazjFbym3fqFb4UVPoL+1UEMcGJzwtIPozYKTPPISsBUEYoH1SxN++XTynJvjAhlZsuVb4GuKMMwB51NS1CsRcghYTenIqHMwOXe7lSzY4HX3zvLJRs2mD+4vVvoqDuaBaiM5Fx08Pa1IXkhU80lf68/rxR/uvYtGKzzkW323rpDlg9toSckd0OGRwHMiFCZfR3s9u563EJx2E4CRTry7uRzFmW5FWFON4l4PKkAsLYSO14lgbDT/ol0HiFL/Nmo5HpECX8aONIvgMhGyJf5xQjEroRXYfPJOWPFv4QpIb381nrW0J9t/ipMirgcuovoor4XuuwU= 
Jul 05 13:17:07 pipeline.log:  X-Amz-Target: Kinesis_20131202.GetRecords 
Jul 05 13:17:07 pipeline.log:  Accept-Encoding: gzip 
Jul 05 13:17:07 pipeline.log: {"ShardIterator":"AAAAAAAAAAHHHk8OCHycfvVztfWnpG0G23yT/4e0q+ULejP0Y2PPKdTw4CJAFZbHpWYfmtfpkLjnCJDyKK1rWGR3GpGaguggr5Y1Mzgb3PfQ3boK7LWYT4Zztn+OJLXWKD5s/4wkWClyKPihUiGmXYTwoTi0zkpcXzwkIM4ofpbkYsBFSZDbymyvZ/fawst1PnpgfOCeH15KwSHqFGrGT5qU38VwVefy3fXkKClKl+vbyHJpC7BIFQ=="}
Jul 05 13:17:07 pipeline.log:  -----------------------------------------------------
Jul 05 13:17:07 pipeline.log:  2016/07/05 11:17:07 DEBUG: Response kinesis/GetRecords Details:
Jul 05 13:17:07 pipeline.log:  ---[ RESPONSE ]--------------------------------------
Jul 05 13:17:07 pipeline.log:  <skipping body>
Jul 05 13:17:07 pipeline.log:  -----------------------------------------------------
Jul 05 13:17:38 pipeline.log:  2016/07/05 11:17:38 [DEBUGGER] Read 2 records from Kinesis (took=30.79373588s, lag=2016-07-05 11:17:07 +0000 UTC)
...

I modified aws-sdk-go/aws/client/client.go to allow using aws.LogDebugWithHTTPBody without actually printing the succesful responses as aws.LogDebugWithRequestErrors is outputting a lot of UnknownError: unknown rather than the actual error.

func logResponse(r *request.Request) {
    var msg = "no response data"

    if r.HTTPResponse.StatusCode == 200 {  
        // Temporary hack to avoid outputting the body of a succesful GetRecords
        msg = "<skipping body>"
    } else if r.HTTPResponse != nil {
        logBody := r.Config.LogLevel.Matches(aws.LogDebugWithHTTPBody)
        dumpedBody, _ := httputil.DumpResponse(r.HTTPResponse, logBody)
        msg = string(dumpedBody)
    } else if r.Error != nil {
        msg = r.Error.Error()
    }
    r.Config.Logger.Log(fmt.Sprintf(logRespMsg, r.ClientInfo.ServiceName, r.Operation.Name, msg))
}

My kinesis reader application looks like this:

package main

import (
    "fmt"
    "log"
    "time"

    "github.com/aws/aws-sdk-go/aws"
    //  "github.com/aws/aws-sdk-go/aws/credentials"
    "github.com/aws/aws-sdk-go/aws/session"
    "github.com/aws/aws-sdk-go/service/kinesis"
)

/*
var kc = kinesis.New(session.New(aws.NewConfig().
    WithRegion("us-east-1").
    WithLogLevel(aws.LogDebugWithRequestErrors | aws.LogDebugWithHTTPBody).
    WithCredentials(credentials.NewSharedCredentials("", "prod")),
))
*/

var kc = kinesis.New(session.New(aws.NewConfig().WithRegion("us-east-1").WithLogLevel(aws.LogDebugWithRequestErrors | aws.LogDebugWithHTTPBody)))

func GetKinesisTrimHorizonIterator(streamName, shardId string) (string, error) {
    getShardIteratorInput := &kinesis.GetShardIteratorInput{
        StreamName:        &streamName,
        ShardId:           &shardId,
        ShardIteratorType: aws.String("TRIM_HORIZON"),
        //ShardIteratorType: aws.String("LATEST"),
    }
    resp, err := kc.GetShardIterator(getShardIteratorInput)
    if err != nil {
        return "", fmt.Errorf("Failed to get kinesis shard iterator (TRIM_HORIZON) (streamName=%v, shardId=%v, err=%v)", streamName, shardId, err)
    }
    return *resp.ShardIterator, nil
}

func GetKinesisRecords(iterator string) ([]*kinesis.Record, string, error) {
    getRecordsInput := &kinesis.GetRecordsInput{
        ShardIterator: &iterator,
    }
    resp, err := kc.GetRecords(getRecordsInput)
    if err != nil {
        return nil, "", fmt.Errorf("Failed to get records from kinesis (iterator=%v, err=%v)", iterator, err)
    }

    return resp.Records, *resp.NextShardIterator, nil
}

func main() {
    minReadDuration := 1 * time.Second
    streamName := "<stream>"
    shardId := "<shard>"

outer:
    for {
        log.Printf("[DEBUGGER] Getting kinesis iterator")
        iterator, err := GetKinesisTrimHorizonIterator(streamName, shardId)
        if err != nil {
            log.Printf("ERROR [DEBUGGER] Unable to get kinesis shard iterator: %v", err)
            return
        }
        log.Printf("[DEBUGGER] Got iterator %v", iterator)

        var ts_start, ts_last, lag time.Time
        var records []*kinesis.Record
        ts_last = time.Now()

        for {
            // get records
            ts_start = time.Now()
            records, iterator, err = GetKinesisRecords(iterator)
            if err != nil {
                log.Printf("[DEBUGGER] Unable to get data (requestTime=%v, lastIterationRequestTime=%v, err=%v)", time.Since(ts_start), ts_start.Sub(ts_last), err)
                continue outer
            }
            ts_last = ts_start

            if len(records) > 0 {
                lag = *(records[len(records)-1]).ApproximateArrivalTimestamp
            }

            took := time.Since(ts_start)
            log.Printf("[DEBUGGER] Read %v records from Kinesis (took=%v, lag=%v)", len(records), took, lag)
            if took > 1*time.Minute {
                log.Printf("[DEBUGGER] Request took more than 1 minute!")
            }
            sleepDuration := minReadDuration - took
            if sleepDuration > 0 {
                log.Printf("[DEBUGGER] Sleeping for %v", sleepDuration)
                time.Sleep(sleepDuration)
            }
        }
    }
}

@fantyz
Copy link
Author

fantyz commented Jul 6, 2016

I left the debug application running over night. It contains a handful of examples of the GetRecords taking more than 1 minute to respond. The worst one caught here is more than 12 minutes.

...
Jul 05 17:03:40 pipeline.log:  2016/07/05 15:03:40 [DEBUGGER] Sleeping for 974.869046ms
Jul 05 17:03:41 pipeline.log:  2016/07/05 15:03:41 DEBUG: Request kinesis/GetRecords Details:
Jul 05 17:03:41 pipeline.log:  ---[ REQUEST POST-SIGN ]-----------------------------
Jul 05 17:03:41 pipeline.log:  POST / HTTP/1.1 
Jul 05 17:03:41 pipeline.log:  Host: kinesis.us-east-1.amazonaws.com 
Jul 05 17:03:41 pipeline.log:  User-Agent: aws-sdk-go/1.0.11 (go1.6; linux; amd64) 
Jul 05 17:03:41 pipeline.log:  Content-Length: 268 
Jul 05 17:03:41 pipeline.log:  Authorization: AWS4-HMAC-SHA256 Credential=ASIAJ3ZTEIMWNPP6ZTRQ/20160705/us-east-1/kinesis/aws4_request, SignedHeaders=host;x-amz-date;x-amz-security-token;x-amz-target, Signature=319589d86e3dd9dc3ecc015078bb68f403782f5b6158d7c2e73ce1455d0812ea 
Jul 05 17:03:41 pipeline.log:  Content-Type: application/x-amz-json-1.1 
Jul 05 17:03:41 pipeline.log:  X-Amz-Content-Sha256: 2830397eb712ad24bf5685487bb6f2c873c355217dc42ed9029c380f826a00dc 
Jul 05 17:03:41 pipeline.log:  X-Amz-Date: 20160705T150341Z 
Jul 05 17:03:41 pipeline.log:  X-Amz-Security-Token: FQoDYXdzEIv//////////wEaDOa/V5X5iX9RQ/lqNSKcAyfpYz91GDFn5mDadwdxseZ6faPdxyljA9lvieptBicUM7Pw91WAbC1Aa+yqXUmjA90KaOUst8zhYmY6KZOgJ+hbI1OJMDJcmOY0kRbjca/Xf7B8n3Mh4ukTKTsymidrn0FfkKctvzv/VBoFYBM7cpdAon2HKSOqk7AeeCPfAn9D+BiGtvQUd0gW4h/k5ZU0qTLoeN1d/fnMcGu+PsxosVVfgSyoNKOMI45A77KsazjFbym3fqFb4UVPoL+1UEMcGJzwtIPozYKTPPISsBUEYoH1SxN++XTynJvjAhlZsuVb4GuKMMwB51NS1CsRcghYTenIqHMwOXe7lSzY4HX3zvLJRs2mD+4vVvoqDuaBaiM5Fx08Pa1IXkhU80lf68/rxR/uvYtGKzzkW323rpDlg9toSckd0OGRwHMiFCZfR3s9u563EJx2E4CRTry7uRzFmW5FWFON4l4PKkAsLYSO14lgbDT/ol0HiFL/Nmo5HpECX8aONIvgMhGyJf5xQjEroRXYfPJOWPFv4QpIb381nrW0J9t/ipMirgcuovoor4XuuwU= 
Jul 05 17:03:41 pipeline.log:  X-Amz-Target: Kinesis_20131202.GetRecords 
Jul 05 17:03:41 pipeline.log:  Accept-Encoding: gzip 
Jul 05 17:03:41 pipeline.log: {"ShardIterator":"AAAAAAAAAAEeoL/YYr2CCrFAWl2N+UrspY+TGDxNGwq6pmWhHA85F4wbXQW3WZeE4F+nEcbpt4uGzjMZJDn6EsO7eZ4uSOHsZB7jsJKl891zoSLW7/px0TpBFye0th8WGv/Xy2NLNwU0R0FhdUA/rj2vVgH7o8BaNXhQn61yOSsj2sGj5hcsXG6lTw6GKBR+jm7DJtThN3DVCkzP0VDr5rYxxmW+bN6voqXK95YhsbXg9GpSdU/qrA=="}
Jul 05 17:03:41 pipeline.log:  -----------------------------------------------------
Jul 05 17:03:41 pipeline.log:  2016/07/05 15:03:41 DEBUG: Response kinesis/GetRecords Details:
Jul 05 17:03:41 pipeline.log:  ---[ RESPONSE ]--------------------------------------
Jul 05 17:03:41 pipeline.log:  <skipping body>
Jul 05 17:03:41 pipeline.log:  -----------------------------------------------------
Jul 05 17:15:46 pipeline.log:  2016/07/05 15:15:46 [DEBUGGER] Read 8 records from Kinesis (took=12m4.798487323s, lag=2016-07-05 15:03:41 +0000 UTC)
Jul 05 17:15:46 pipeline.log:  2016/07/05 15:15:46 [DEBUGGER] Request took more than 1 minute!
...

@jasdel
Copy link
Contributor

jasdel commented Jul 6, 2016

Hi thanks for the updated information @fantyz, and the sample and output logs. The section that is marked as <skipping body> in the log is cut out by you correct? It would be very interesting to see the output response headers at status code. From your description it sounds like the SDK knows that an error occurs but it doesn't know how to handle this error.

I'll try to reproduce this on my end. Are you using Kinesis Streams or Firehose? In addition about how many records do you have in your stream, and about how large are the records?

@jasdel jasdel reopened this Jul 6, 2016
@jasdel jasdel removed the guidance Question that needs advice or information. label Jul 6, 2016
@fantyz
Copy link
Author

fantyz commented Jul 7, 2016

@jasdel yes, but it is the minor hack I did to the SDK shown in my reply above. It's putting the <skipping body> if and only if the status code is 200.

The GetRecords that takes a very long time succeeds and returns actual records and not an error (which matches the single request to kinesis from the SDK and the subsequent <skipping body> on the 200 response. I have no idea what could be taking time after that.

I'm using Kinesis streams. Our records vary a lot in size average of around 10kb but does have cases of records of 1 mb (analytics coming in from clients that potentially could have been offline for a while batching up a lot of analytics). Currently I think the throughput is around 10-15 mb/minute total on this system.

@jasdel jasdel added the bug This issue is a bug. label Jul 7, 2016
@jasdel
Copy link
Contributor

jasdel commented Jul 7, 2016

Thanks for the update and additional info @fantyz. Updating this issue to a bug, because the SDK should at least provide more information about the error causing the retry. I'll work to see if I can reproduce this on my end using a similar setup to what you described.

Copying from our gitter discussion:

{"__type":"ProvisionedThroughputExceededException","message":"Rate exceeded for shard shardId-000000000003 in stream brim-prod-analytics under account xxx."}

ProvisionedThroughputExceededException is enumerated as a throttled exception, which overrides the fact that the server's response is a 400 status code. By default the SDK only retries 5xx status codes. Unless the error code overrides the request to be retried.

When your code receives a ThroughputExceededException error is the HTTP status code a 400 also? If so the SDK needs to add ThroughputExceededException to our list of throttle exceptions. This will tell the SDK it is allowed to retry that request.

The delay you're seeing with GetRecords may be related to a Provisioned Throughput set too load for your use case.

@fantyz
Copy link
Author

fantyz commented Jul 7, 2016

@jasdel these are two unrelated issues though!

This issue is about the seemingly random hang I'm getting on GetRecords. That one has no errors involved with it. The logs shows the request succeeds but takes an inordinary large amount of time to complete.

The other issue you list popped up while trying to debug the first one. I think a new issue is in its place for that one! :)

@jasdel
Copy link
Contributor

jasdel commented Jul 7, 2016

In the case of the eventual successful request that was delayed, was retry logging enabled? If not I'm thinking that the request were just being throttled and retried multiple times. The alternative would be that Kinesis is holding the connection open. and delaying a response. But this delay would be limited by the Go default HTTP timeouts of about a minute.

If you can provide me with a request ID I can forward this on to the Kinesis service team. In addition it might be helpful to ask on the Kinesis AWS Forums, since others may of experienced issues similar to this.

@fantyz
Copy link
Author

fantyz commented Jul 7, 2016

According to the log timestamps the SDK have the full body at hand and is able to print it out (although I bypassed that to avoid flooding the logs with data) before the delay incurs. I don't think it has anything to do with the actual connection?

You can see the jump from 17:03:41 to 17:15:46 is after getting the response.

Jul 05 17:03:41 pipeline.log:  ---[ RESPONSE ]--------------------------------------
Jul 05 17:03:41 pipeline.log:  <skipping body>
Jul 05 17:03:41 pipeline.log:  -----------------------------------------------------
Jul 05 17:15:46 pipeline.log:  2016/07/05 15:15:46 [DEBUGGER] Read 8 records from Kinesis (took=12m4.798487323s, lag=2016-07-05 15:03:41 +0000 UTC)

@fantyz
Copy link
Author

fantyz commented Jul 7, 2016

Retrying logging was not explicitly enabled, but does it have to be when request logging is? I've seen many examples clearly being able to see the retries from the requests being logged with the current settings- the individual requests are shown (including the failing ones). The above example only have one single request that succeeds, but the SDK does not return from the function call for a significant amount of time afterwards.

How do I find the request id? I'll dig it out tomorrow then, although I'm not sure what the Kinesis service team can do as the GetRecords request is responded to in a timely manner (as shown by the logs)?

It looks to me like it is printing the body of the succesful GetRecords API request around 12 minutes before returning the function call or am I missing something obvious here?

@jasdel
Copy link
Contributor

jasdel commented Jul 7, 2016

The base aws.LogDebug will log each request/response made including retries. The biggest difference between LogDebug and LogDebugWithRequestErrors is the error will be logged along with if the request will be retried. I agree it sounds like the requests aren't being retried in this case.

The debug responses you're seeing do not include the HTTP body, and LogDebugWithHTTPBody is not set as the log level, correct? The first thing that comes to mind is that the SDK is waiting for the response to download from the Kinesis endpoint. The GetRecords call won't be able to return until the full response body is downloaded and parsed. Are you able to see the Content-Length of the responses? From the logs listed above it also looks like the connection is being closed/broken preventing download of the full response in some cases.

The request id should be visible in the logged response's HTTP Headers. With the request ID I can work with the Kinesis team to hopefully investigate why the request is taking so long to download.

In addition how many shards is your stream set for?

@jasdel
Copy link
Contributor

jasdel commented Jul 7, 2016

I'm looking into reproduce the delay your seeing but not yet able to. I created a scripts to put data to Kinesis, and retrieve it.

In my tests the retrieves are taking about 3s on average. I haven't seen the delay yet, but will let it run for a while.

@jasdel
Copy link
Contributor

jasdel commented Jul 8, 2016

Hi @fantyz I let my test run overnight, but no luck reproducing the issue. The longest delay I found was 22s.

@fantyz
Copy link
Author

fantyz commented Jul 8, 2016

It varies a lot how often it happens. Not more than a couple of times a day normally though. However I updated my debugger with getting the status code and content length as well as printing the size of the records earlier today. I think its been running for ~8 hours and have one occurrence:

As shown, the jump in time happens after the printing of the response but before the function returns.

Jul 08 19:03:52 pipeline.log:  2016/07/08 17:03:52 [DEBUGGER] Read 5 records from Kinesis (took=38.532997ms, lag=2016-07-08 17:03:52 +0000 UTC)
Jul 08 19:03:52 pipeline.log:  2016/07/08 17:03:52 [DEBUGGER] Sleeping for 961.467003ms
Jul 08 19:03:53 pipeline.log:  2016/07/08 17:03:53 DEBUG: Request kinesis/GetRecords Details:
Jul 08 19:03:53 pipeline.log:  ---[ REQUEST POST-SIGN ]-----------------------------
Jul 08 19:03:53 pipeline.log:  POST / HTTP/1.1 
Jul 08 19:03:53 pipeline.log:  Host: kinesis.us-east-1.amazonaws.com 
Jul 08 19:03:53 pipeline.log:  User-Agent: aws-sdk-go/1.0.11 (go1.6; linux; amd64) 
Jul 08 19:03:53 pipeline.log:  Content-Length: 268 
Jul 08 19:03:53 pipeline.log:  Authorization: AWS4-HMAC-SHA256 Credential=ASIAJX3ZO3GO2HYO6E7A/20160708/us-east-1/kinesis/aws4_request, SignedHeaders=host;x-amz-date;x-amz-security-token;x-amz-target, Signature=bf252509dca02b933b71360faff9361428a65983cc3a0eae0bbe95cf1e2f1903 
Jul 08 19:03:53 pipeline.log:  Content-Type: application/x-amz-json-1.1 
Jul 08 19:03:53 pipeline.log:  X-Amz-Content-Sha256: b7326a3657919307e21c8012ffaddcc4533a7dc5f9489569efe2e29db1d7ed91 
Jul 08 19:03:53 pipeline.log:  X-Amz-Date: 20160708T170353Z 
Jul 08 19:03:53 pipeline.log:  X-Amz-Security-Token: FQoDYXdzENf//////////wEaDN5V67Ir7x/AAehQBSKcAyvnugTclVN25and56pVD5jIerpza+Hgovo1ih2oNa0U5BwjtsBsfmbY7lG+JrR505MNpXv+bTcS0mf4+WglBf+r017YQhJ9xg1CjDpDglDnqFNwy1FVpP7EXamKO1sIDu/9A/XlrsQMCf6kAowaEeYxwZbJnsqW5qjH9PsV33vd7rrXzPNmx7WPVuj7uiFnTxuZp+V+PypYM4Q/xYDBS7OeObVbEjMHDR3KHVWYPUyAwIjqHcmFnOPwJP0PR2qtBAvbgSB9pUNBXXMuUSt7SBaHG1qBIoLjHZ4zmTapiNp2AhyDgTEsjRc//m8D0oxhPslt6xABwBFvM2XVwpTPDWoLlyz2HjgbrJmPNVlzBnUcuMxHaUlio1ZmZSsonAJ/vfDUYyLfoFqZvTEkxDJJub+gJ4LUdmuTJ8nj4bt00lsYYFeNLUerpCXigPTg9rnThT7VGKHF3hfxUdlEJkmEZTH7tnscoSi0nw5/q3ipBuafvPGyLL8XDQBlGP9KXbxgzet2ulAYY+ZbwLt1xPaS5xj/uukHGU4AhOKRGH8optz+uwU= 
Jul 08 19:03:53 pipeline.log:  X-Amz-Target: Kinesis_20131202.GetRecords 
Jul 08 19:03:53 pipeline.log:  Accept-Encoding: gzip 
Jul 08 19:03:53 pipeline.log: {"ShardIterator":"AAAAAAAAAAHprr6wW7jVknGCzR4EDZJBoEzT360fUy2PMMxEye7ibI7wAQ3ybnNRpcCbi6zMZSfATOqyP33dNnbnfisvoMJW8flafcyuoeUjynEjIujDDyg9j3HaracaaXf0235vK4BEwqapmP36MvcccRHo58FA8bOTQ/lk32R/w6bR04vXUV/JANW6bTnvEiijg4Q5Yo5U0k4I/GUarUscf7GIl1iECRV84vZWWUB3p+kUGHs97A=="}
Jul 08 19:03:53 pipeline.log:  -----------------------------------------------------
Jul 08 19:03:53 pipeline.log:  2016/07/08 17:03:53 DEBUG: Response kinesis/GetRecords Details:
Jul 08 19:03:53 pipeline.log:  ---[ RESPONSE ]--------------------------------------
Jul 08 19:03:53 pipeline.log:  HTTP/1.1 200 OK 
Jul 08 19:03:53 pipeline.log:  Content-Length: 49633 
Jul 08 19:03:53 pipeline.log:  Content-Type: application/x-amz-json-1.1 
Jul 08 19:03:53 pipeline.log:  Date: Fri, 08 Jul 2016 17:03:52 GMT 
Jul 08 19:03:53 pipeline.log:  Server: Apache-Coyote/1.1 
Jul 08 19:03:53 pipeline.log:  X-Amz-Id-2: 6z52FP2ur7fSTxaTtaoIqkHiKvD1GFG5/iZoxzJBjWqlZubJAmc4M2cqydK3KazYUquPV8BlkhA2Q2mpusChYaiWAoiA3ipq 
Jul 08 19:03:53 pipeline.log:  X-Amzn-Requestid: fb6566fb-6468-eeaf-b49c-b9c2d6f38b0f 
Jul 08 19:03:53 pipeline.log:  -----------------------------------------------------
Jul 08 19:10:01 cron:  Jul  8 17:10:01 ip-172-31-150-165 CROND[23747]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Jul 08 19:11:41 pipeline.log:  2016/07/08 17:11:41 [DEBUGGER] Read 8 records from Kinesis (took=7m48.178352365s, lag=2016-07-08 17:03:52 +0000 UTC)
Jul 08 19:11:41 pipeline.log:  2016/07/08 17:11:41 [DEBUGGER] Request took more than 1 minute!
Jul 08 19:11:41 pipeline.log:  2016/07/08 17:11:41 [DEBUGGER]   -> Record size: 535.00B
Jul 08 19:11:41 pipeline.log:  2016/07/08 17:11:41 [DEBUGGER]   -> Record size: 5.92KB
Jul 08 19:11:41 pipeline.log:  2016/07/08 17:11:41 [DEBUGGER]   -> Record size: 3.99KB
Jul 08 19:11:41 pipeline.log:  2016/07/08 17:11:41 [DEBUGGER]   -> Record size: 5.91KB
Jul 08 19:11:41 pipeline.log:  2016/07/08 17:11:41 [DEBUGGER]   -> Record size: 766.00B
Jul 08 19:11:41 pipeline.log:  2016/07/08 17:11:41 [DEBUGGER]   -> Record size: 592.00B
Jul 08 19:11:41 pipeline.log:  2016/07/08 17:11:41 [DEBUGGER]   -> Record size: 11.49KB
Jul 08 19:11:41 pipeline.log:  2016/07/08 17:11:41 [DEBUGGER]   -> Record size: 6.75KB

@jasdel
Copy link
Contributor

jasdel commented Jul 8, 2016

Thanks for the update @fantyz I've forward this information on the the Kinesis team to get their insight.

One thing that we could try is to add a custom io.Rearder into the pipeline of unmarshing these responses. The idea here is that we can log the bytes read to hopefully identify if the hang up is in receiving the bytes across the wire, or in the SDK's unmarshaler.

I've updated the kinesisStream.go gist example including this extra debug. It adds detailed logging to the GetRecords request and adds a DebugReadCloser wrapping the HTTP responses's Body that will log each block of bytes that are read from the HTTP body stream.

This will produce output like the following. All non-read times are relative to the time the GetRecords time was started at.

Starting GetRecords build/sign request, took 156.776µs
Finished GetRecords Send, took 840.307398ms
Started GetRecords Unmarshal, took 840.367146ms
2016/07/08 15:59:55 DEBUG: read 512 bytes, took 56.036969ms
2016/07/08 15:59:55 DEBUG: read 1024 bytes, took 1.007µs
2016/07/08 15:59:55 DEBUG: read 2048 bytes, took 455ns
2016/07/08 15:59:55 DEBUG: read 512 bytes, took 1.106µs
2016/07/08 15:59:55 DEBUG: read 3584 bytes, took 3.127µs
2016/07/08 15:59:55 DEBUG: read 512 bytes, took 492ns
2016/07/08 15:59:55 DEBUG: read 7680 bytes, took 1.318µs
2016/07/08 15:59:55 DEBUG: read 512 bytes, took 917ns
2016/07/08 15:59:55 DEBUG: read 708 bytes, took 11.502µs
2016/07/08 15:59:55 DEBUG: read 15164 bytes, took 121.372523ms
...
2016/07/08 16:00:08 DEBUG: read 16384 bytes, took 2.966015ms
2016/07/08 16:00:08 DEBUG: read 1024 bytes, took 12.278µs
2016/07/08 16:00:08 DEBUG: read 16384 bytes, took 2.600777ms
2016/07/08 16:00:08 DEBUG: read 1024 bytes, took 96.543µs
2016/07/08 16:00:08 DEBUG: read 16384 bytes, took 12.09666ms
2016/07/08 16:00:08 DEBUG: read 1024 bytes, took 40.412µs
2016/07/08 16:00:08 DEBUG: read 16384 bytes, took 2.655483ms
2016/07/08 16:00:08 DEBUG: read 1024 bytes, took 17.227µs
2016/07/08 16:00:08 DEBUG: read 16384 bytes, took 3.108551ms
2016/07/08 16:00:08 DEBUG: read 1024 bytes, took 24.52µs
2016/07/08 16:00:08 DEBUG: read 5853 bytes, took 861.306µs
Finished GetRecords body read, took 14.634752912s
Finished GetRecords Unmarshal, took 14.634783236s
Finished GetRecords request, 14 records from shard 0, took 14.634786265s

Its a lot of extra information in the log, but it might help identify where things are getting hung up at.

@jasdel
Copy link
Contributor

jasdel commented Jul 8, 2016

I used the following to find long running requests:

AWS_REGION=<region> go run kinesisStream.go <stream> > /tmp/kinesisStream.out
grep "records from shard" /tmp/kinesisStream.out | awk '{print $10}' | go run durSort.go

durSort.go

14.634786265s
13.839473696s
13.650340904s
12.790674943s
12.221052297s
12.187944525s
10.668927175s
10.666806916s
9.716013485s
9.563715184s

@fantyz
Copy link
Author

fantyz commented Jul 12, 2016

I've done slight modifications to get your version to run in our system. You can see the source here.

Modifications done:

  • Hardcode stream name
  • Added .WithLogLevel(aws.LogDebugWithRequestRetries | aws.LogDebugWithRequestErrors | aws.LogDebugWithHTTPBody)
  • Added if time.Since(start) > 30*time.Second { fmt.Printf("GetRecords took more than 30 seconds!") } right after GetRecords to easily grep it in the papertrail logs
  • Removed putRecords to allow it to run against prod where we see the issue happening

To be continued once it hopefully catches the issue.

@fantyz
Copy link
Author

fantyz commented Jul 12, 2016

Caught one!

Jul 12 15:03:25 pipeline.log:  Starting GetRecords build/sign request, took 13.904µs
Jul 12 15:03:25 pipeline.log:  2016/07/12 13:03:25 DEBUG: Request kinesis/GetRecords Details:
Jul 12 15:03:25 pipeline.log:  ---[ REQUEST POST-SIGN ]-----------------------------
Jul 12 15:03:25 pipeline.log:  POST / HTTP/1.1 
Jul 12 15:03:25 pipeline.log:  Host: kinesis.us-east-1.amazonaws.com 
Jul 12 15:03:25 pipeline.log:  User-Agent: aws-sdk-go/1.0.11 (go1.6; linux; amd64) 
Jul 12 15:03:25 pipeline.log:  Content-Length: 282 
Jul 12 15:03:25 pipeline.log:  Authorization: AWS4-HMAC-SHA256 Credential=ASIAJACNTZW3NQXOLAXQ/20160712/us-east-1/kinesis/aws4_request, SignedHeaders=host;x-amz-date;x-amz-security-token;x-amz-target, Signature=eb842fd894a667916bb4653012e9472f3d0a3834ceaac4ef0aae936279b0639e 
Jul 12 15:03:25 pipeline.log:  Content-Type: application/x-amz-json-1.1 
Jul 12 15:03:25 pipeline.log:  X-Amz-Content-Sha256: d45ee2996ba45b8b09f719264c94b418cf5c738efa6d28ea363ff73b749255dd 
Jul 12 15:03:25 pipeline.log:  X-Amz-Date: 20160712T130325Z 
Jul 12 15:03:25 pipeline.log:  X-Amz-Security-Token: FQoDYXdzEEYaDGG2zMo1+mUcP+sZiSKcA3macZ233hc6PYH54Ln5hBZszt09jwM7LoMzVZx2X60LIwf66ZcC9nFwi0/s0C2nN8JsxcOgSZ/Yb8w4wdqBEZG6zuokkfkHUfliyrfaDZEL/aPNwMe16HPDaN5KwWDJP10VbYKwNDRA+rv8EuJoyf2/eoaiJU5udvVksDiqTxUmnIa/2TXTOix883maCYaWlDMoW1PRyFGHDxQMiepM7j4uoaXsoK7d4KdehMIb78YKOL/rjYMDT7jgrB8jNUWzGXywMVzX6c1YIiOZGLhTfD5VahY18oCw72NfS5UHgnXLUiJ2vbvF9rqeCisHHhoX/yLujNWSLKbnMSGwvLu6CLRctn63SRItXVC7iYWl1OGUJ/RaXqum8m9dILAvJB/jN1niYPpBlDVHR6mOdF65F62G1oKKwA+2CmcJPE49NYjbn7BKW/U9ut5Wo1LdlgNjF1o49HU/SHRe2uaRd8+FtqTjuZ+wSReCYvxIlDPeKrmEXHxQYZITgaksTg7K8xxT/zrc6kxURr/DhUnxc2A5DFU0uPiHLGbSf5tWPcMowsWTvAU= 
Jul 12 15:03:25 pipeline.log:  X-Amz-Target: Kinesis_20131202.GetRecords 
Jul 12 15:03:25 pipeline.log:  Accept-Encoding: gzip 
Jul 12 15:03:25 pipeline.log: {"Limit":10000,"ShardIterator":"AAAAAAAAAAFD6ICsKC0nix2BvU9x9UTtQ/1sPPXhNCsJy9D3xmvzU9l/qw2WTs5QNTwNgQqZ84x+HSHSmNq0F0wVCXyGz9KGp6BSBsD+kGaznRa1FRuVFkmJKXA7QHQ4618ksAgkIzCplHWqH09qNSmtIVErLoX9jVROlugEMqMqx3ZYfN6uDGbORaAhtdHR/KAU54+PxVBPtV9wPKWxxTnTODfdDuNuv7g1EOAJPdfcB5N+TzZ4AA=="}
Jul 12 15:03:25 pipeline.log:  -----------------------------------------------------
Jul 12 15:03:25 pipeline.log:  2016/07/12 13:03:25 DEBUG: Response kinesis/GetRecords Details:
Jul 12 15:03:25 pipeline.log:  ---[ RESPONSE ]--------------------------------------
Jul 12 15:03:25 pipeline.log:  HTTP/1.1 200 OK 
Jul 12 15:03:25 pipeline.log:  Content-Length: 417677 
Jul 12 15:03:25 pipeline.log:  Content-Type: application/x-amz-json-1.1 
Jul 12 15:03:25 pipeline.log:  Date: Tue, 12 Jul 2016 13:03:24 GMT 
Jul 12 15:03:25 pipeline.log:  Server: Apache-Coyote/1.1 
Jul 12 15:03:26 pipeline.log:  X-Amz-Id-2: Xp32ChP4khmreGN5rivkJljIOd7MNUVFe06m6hXywQ81NVM3f5QNesBrQ6ZigkrJXGguKMBbQEPOHGJIVNR4AMNuC7+52HVS10tEPVbv1TE= 
Jul 12 15:03:26 pipeline.log:  X-Amzn-Requestid: efca0e47-2989-1995-a034-e29a0e72b502 
Jul 12 15:03:26 pipeline.log:  -----------------------------------------------------
Jul 12 15:03:26 pipeline.log:  Finished GetRecords Send, took 60.192006ms
Jul 12 15:03:26 pipeline.log:  Started GetRecords Unmarshal, took 60.240635ms
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:25 DEBUG: read 512 bytes, took 25.831982ms
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:25 DEBUG: read 1024 bytes, took 11.141µs
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:25 DEBUG: read 2048 bytes, took 654ns
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:25 DEBUG: read 512 bytes, took 556ns
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:25 DEBUG: read 3584 bytes, took 22.321µs
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:25 DEBUG: read 512 bytes, took 714ns
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:25 DEBUG: read 7680 bytes, took 2.036µs
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:25 DEBUG: read 512 bytes, took 1.148µs
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:25 DEBUG: read 697 bytes, took 29.214µs
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:25 DEBUG: read 15175 bytes, took 8.580319ms
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:25 DEBUG: read 1209 bytes, took 2.009µs
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:25 DEBUG: read 1024 bytes, took 12.625µs
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:25 DEBUG: read 16384 bytes, took 8.423524ms
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:25 DEBUG: read 1024 bytes, took 42.287µs
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:26 DEBUG: read 13127 bytes, took 8.190222ms
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:26 DEBUG: read 3257 bytes, took 2.653µs
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:26 DEBUG: read 1024 bytes, took 14.178µs
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:26 DEBUG: read 16384 bytes, took 165.79µs
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:26 DEBUG: read 1024 bytes, took 16.645µs
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:26 DEBUG: read 16384 bytes, took 8.220345ms
Jul 12 15:03:26 pipeline.log:  2016/07/12 13:03:26 DEBUG: read 1024 bytes, took 13.322µs
Jul 12 15:03:27 pipeline.log:  2016/07/12 13:03:27 DEBUG: read 16384 bytes, took 1.473254589s
Jul 12 15:03:27 pipeline.log:  2016/07/12 13:03:27 DEBUG: read 1024 bytes, took 16.335µs
Jul 12 15:03:52 pipeline.log:  2016/07/12 13:03:52 DEBUG: read 9031 bytes, took 25.250034682s
Jul 12 15:03:52 pipeline.log:  2016/07/12 13:03:52 DEBUG: read 7353 bytes, took 4.84µs
Jul 12 15:04:19 pipeline.log:  2016/07/12 13:04:19 DEBUG: read 1024 bytes, took 26.959723302s
Jul 12 15:15:02 pipeline.log:  2016/07/12 13:15:02 DEBUG: read 16384 bytes, took 10m43.033927967s
Jul 12 15:15:02 pipeline.log:  2016/07/12 13:15:02 DEBUG: read 1024 bytes, took 17.575µs
Jul 12 15:27:05 pipeline.log:  2016/07/12 13:27:04 DEBUG: read 1024 bytes, took 17.762µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 16384 bytes, took 2m0.319726961s
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 1024 bytes, took 18.595µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 16384 bytes, took 124.755µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 1024 bytes, took 11.984µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 16384 bytes, took 573.078µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 1024 bytes, took 12.093µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 16384 bytes, took 3.029766ms
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 1024 bytes, took 16.442µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 16384 bytes, took 1.670321ms
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 1024 bytes, took 18.579µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 839 bytes, took 1.671051ms
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 3257 bytes, took 1.754µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 12288 bytes, took 2.942µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 1024 bytes, took 35.459µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 16384 bytes, took 929.825µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 1024 bytes, took 23.253µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 16384 bytes, took 1.488066ms
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 1024 bytes, took 21.32µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 16384 bytes, took 654.209µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 1024 bytes, took 19.064µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 16384 bytes, took 1.310997ms
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 1024 bytes, took 43.762µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 16384 bytes, took 648.015µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 1024 bytes, took 20.848µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 16384 bytes, took 644.039µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 1024 bytes, took 31.212µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 16384 bytes, took 500.04µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 1024 bytes, took 45.292µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 16384 bytes, took 360.311µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 1024 bytes, took 18.789µs
Jul 12 15:29:05 pipeline.log:  2016/07/12 13:29:04 DEBUG: read 212 bytes, took 99.948µs
Jul 12 15:29:05 pipeline.log:  Finished GetRecords body read, took 25m39.11891238s
Jul 12 15:29:05 pipeline.log:  Finished GetRecords Unmarshal, took 25m39.119042824s
Jul 12 15:29:05 pipeline.log:  Finished GetRecords request, 67 records from shard 1, took 25m39.11912334s
Jul 12 15:29:05 pipeline.log:  GetRecords took more than 30 seconds!

@jasdel
Copy link
Contributor

jasdel commented Jul 12, 2016

Thanks for the update @fantyz .The 16k read taking 10 min, and other multiple min delays reading from the network are surprising. I've included the request ID in my reach out to the Kinesis team. This should help investigating this issue further on their end.

@fantyz
Copy link
Author

fantyz commented Jul 13, 2016

Thanks, let me know what they come up with!

In the mean while I'm implementing client side retries on the GetRecords failures and is adding a timeout of a minute by setting my own http.Client.

@jasdel jasdel added service-api This issue is due to a problem in a service API, not the SDK implementation. and removed response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. bug This issue is a bug. labels Jul 13, 2016
@jasdel
Copy link
Contributor

jasdel commented Jul 13, 2016

Good idea, I think this is the best workaround for the issue. I will get back to you when Kinesis has any additional information. I don't think there is very much we can do from the SDK side beyond the work you're doing to add retries in .

@jasdel
Copy link
Contributor

jasdel commented Aug 1, 2016

Hi @fantyz I heard back from the service team. For the request ID you provided the service finished processing and responding to the the request quickly. This suggests the latency is being introduced client side or on the route between the server and the client.

In addition Kinesis agreed using timeouts and retry logic for the rare cases that networking issues cause latency.

@jasdel jasdel added closing-soon This issue will automatically close in 4 days unless further comments are made. and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Aug 1, 2016
@jasdel jasdel closed this as completed Aug 1, 2016
@jasdel
Copy link
Contributor

jasdel commented Apr 4, 2017

Hi we've merged in PR ##1166 which adds retrying of connection reset errors discovered during unmarshaling response body, and read timeouts for Kinesis API Get operations. The read timeouts should prevent your application hanging for significant amounts of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
service-api This issue is due to a problem in a service API, not the SDK implementation.
Projects
None yet
Development

No branches or pull requests

2 participants