Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] Unable to utilize multiple instances in sagemaker batch transform request #3134

Closed
grantdelozier opened this issue Feb 4, 2022 · 5 comments

Comments

@grantdelozier
Copy link

grantdelozier commented Feb 4, 2022

Describe the bug

Throughout sagemaker batch transform documentation it is suggested that multiple instances can be utilized to fulfill inference requests. API documentation for createTransformJob accepts a parameter called InstanceCount

However whenever I create transform jobs which utilize more than one instance, only one instance is actually utilized for fulfilling inferences. I can see through logs that multiple instances are started, but only one instance is used to fulfill requests.

It looks as though someone else noticed this previously, but the issue was closed without being resolved. In this thread @djarpin suggests that multiple instances will be utilized if multiple input files are utilized. However, this still doesn't seem to work either. If you include a folder with multiple files in your TransformInput argument, it will utilize both files but still only send all invocations to a single instance.

To reproduce

Invoke a createTransformJob() request with

TransformResources: {
        InstanceCount: instanceCount,
        InstanceType: model.defaultInstanceType
      },

where instanceCount > 1. In cloudwatch observe that all invocations are sent to a single instance while all other instance sit idle.

Here is a full list of the parameters I include in my createTransformJob() request

const params = {
      ModelName: model.sagemakerEndpoint,
      TransformInput: {
        ContentType: 'application/json',
        DataSource: {
          S3DataSource: {
            S3DataType: 'S3Prefix',
            S3Uri: 's3://'+ inferenceJob.s3Bucket + '/' + inferenceJob.s3InferenceArgsPath,
          }
        },
        SplitType: 'Line'
      },
      TransformJobName: inferenceJob.sagemakerJobName,
      TransformOutput: {
        S3OutputPath: 's3://'+ inferenceJob.s3Bucket + '/' + inferenceJob.s3InferenceOutputPath,
        Accept: 'application/json',
        AssembleWith: 'Line',
      },
      TransformResources: {
        InstanceCount: 2,
        InstanceType: model.defaultInstanceType
      },
      ModelClientConfig: {InvocationsMaxRetries: 0},
      BatchStrategy: 'SingleRecord',
      MaxConcurrentTransforms: 1,
      Tags: [
        {
          Key: 'ModelName',
          Value: model.name
        },
      ]
    }
@usbhub
Copy link

usbhub commented Jun 17, 2022

I ran into this same problem and I'm really surprised that the behavior is like this where it assigns a whole file to a host. This could also cause more subtle performance issues like if you have some files that are much larger than others it won't be immediately obvious there's an issue because the other hosts will still be doing some work. When I split up the files one per host it did work as expected for me though, but as was said in the previous thread the sharding should happen at a record/batch level not a file level

@jholmes-godaddy
Copy link

@grantdelozier , could you clarify why this was closed? I am running into the same behavior: only one instance being used, even when the number of input files greatly exceeds the number of instances.

@dwhite54
Copy link

@jholmes-godaddy it looks like this is a feature not a bug.

The solution is to split your input file into multiple pieces, though it seems that @grantdelozier also had trouble with that route.

@grantdelozier
Copy link
Author

grantdelozier commented Nov 30, 2022

The short answer to why I closed this issue is that it stopped happening to me. I deleted and re-created my sagemaker model artifact, rebuilt my inference container on ECR, and double triple checked that my batch inference invocation parameters were correct, confirming through the Sagmaker batch inference management UI that I had given parameters and arguments correctly.

After doing this, everything started working as expected when i specified an instancecount > 1.

So I guess I had simply misconfigured something. I would encourage others struggling with this issue to go through the whole process of creating the sagemaker model, ECR image, and batch transform to double verify that everything has been set up correctly.

@Sandy4321
Copy link

can you share work flow how you did it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants