[Bug Report] Unable to utilize multiple instances in sagemaker batch transform request #3134

grantdelozier · 2022-02-04T19:09:56Z

Describe the bug

Throughout sagemaker batch transform documentation it is suggested that multiple instances can be utilized to fulfill inference requests. API documentation for createTransformJob accepts a parameter called InstanceCount

However whenever I create transform jobs which utilize more than one instance, only one instance is actually utilized for fulfilling inferences. I can see through logs that multiple instances are started, but only one instance is used to fulfill requests.

It looks as though someone else noticed this previously, but the issue was closed without being resolved. In this thread @djarpin suggests that multiple instances will be utilized if multiple input files are utilized. However, this still doesn't seem to work either. If you include a folder with multiple files in your TransformInput argument, it will utilize both files but still only send all invocations to a single instance.

To reproduce

Invoke a createTransformJob() request with

TransformResources: {
        InstanceCount: instanceCount,
        InstanceType: model.defaultInstanceType
      },

where instanceCount > 1. In cloudwatch observe that all invocations are sent to a single instance while all other instance sit idle.

Here is a full list of the parameters I include in my createTransformJob() request

const params = {
      ModelName: model.sagemakerEndpoint,
      TransformInput: {
        ContentType: 'application/json',
        DataSource: {
          S3DataSource: {
            S3DataType: 'S3Prefix',
            S3Uri: 's3://'+ inferenceJob.s3Bucket + '/' + inferenceJob.s3InferenceArgsPath,
          }
        },
        SplitType: 'Line'
      },
      TransformJobName: inferenceJob.sagemakerJobName,
      TransformOutput: {
        S3OutputPath: 's3://'+ inferenceJob.s3Bucket + '/' + inferenceJob.s3InferenceOutputPath,
        Accept: 'application/json',
        AssembleWith: 'Line',
      },
      TransformResources: {
        InstanceCount: 2,
        InstanceType: model.defaultInstanceType
      },
      ModelClientConfig: {InvocationsMaxRetries: 0},
      BatchStrategy: 'SingleRecord',
      MaxConcurrentTransforms: 1,
      Tags: [
        {
          Key: 'ModelName',
          Value: model.name
        },
      ]
    }

The text was updated successfully, but these errors were encountered:

usbhub · 2022-06-17T05:08:42Z

I ran into this same problem and I'm really surprised that the behavior is like this where it assigns a whole file to a host. This could also cause more subtle performance issues like if you have some files that are much larger than others it won't be immediately obvious there's an issue because the other hosts will still be doing some work. When I split up the files one per host it did work as expected for me though, but as was said in the previous thread the sharding should happen at a record/batch level not a file level

jholmes-godaddy · 2022-08-19T21:30:12Z

@grantdelozier , could you clarify why this was closed? I am running into the same behavior: only one instance being used, even when the number of input files greatly exceeds the number of instances.

dwhite54 · 2022-11-29T02:22:56Z

@jholmes-godaddy it looks like this is a feature not a bug.

The solution is to split your input file into multiple pieces, though it seems that @grantdelozier also had trouble with that route.

grantdelozier · 2022-11-30T19:02:08Z

The short answer to why I closed this issue is that it stopped happening to me. I deleted and re-created my sagemaker model artifact, rebuilt my inference container on ECR, and double triple checked that my batch inference invocation parameters were correct, confirming through the Sagmaker batch inference management UI that I had given parameters and arguments correctly.

After doing this, everything started working as expected when i specified an instancecount > 1.

So I guess I had simply misconfigured something. I would encourage others struggling with this issue to go through the whole process of creating the sagemaker model, ECR image, and batch transform to double verify that everything has been set up correctly.

Sandy4321 · 2023-07-13T14:22:28Z

can you share work flow how you did it?

grantdelozier closed this as completed Jun 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] Unable to utilize multiple instances in sagemaker batch transform request #3134

[Bug Report] Unable to utilize multiple instances in sagemaker batch transform request #3134

grantdelozier commented Feb 4, 2022 •

edited

Loading

usbhub commented Jun 17, 2022

jholmes-godaddy commented Aug 19, 2022

dwhite54 commented Nov 29, 2022

grantdelozier commented Nov 30, 2022 •

edited

Loading

Sandy4321 commented Jul 13, 2023

[Bug Report] Unable to utilize multiple instances in sagemaker batch transform request #3134

[Bug Report] Unable to utilize multiple instances in sagemaker batch transform request #3134

Comments

grantdelozier commented Feb 4, 2022 • edited Loading

usbhub commented Jun 17, 2022

jholmes-godaddy commented Aug 19, 2022

dwhite54 commented Nov 29, 2022

grantdelozier commented Nov 30, 2022 • edited Loading

Sandy4321 commented Jul 13, 2023

grantdelozier commented Feb 4, 2022 •

edited

Loading

grantdelozier commented Nov 30, 2022 •

edited

Loading