Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflowmodel points to images that do not exist #912

Closed
NoahDolev opened this issue Jul 7, 2019 · 16 comments
Closed

Tensorflowmodel points to images that do not exist #912

NoahDolev opened this issue Jul 7, 2019 · 16 comments

Comments

@NoahDolev
Copy link

NoahDolev commented Jul 7, 2019

Please fill out the form below.

System Information

  • Tensorflow:
  • Fails for all versions:
  • *Fails for py3 and py2:
  • Fails for CPU and GPU:
  • No custom image:

Describe the problem

If I try to deploy a pre-built model like so:

sagemaker_model = TensorFlowModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/model0100.tar.gz',
                                  role = role,
                                  framework_version='1.13', py_version='py3',
                                  entry_point = 'train.py')

Will fail upon deploying:

predictor = sagemaker_model.deploy(initial_instance_count=1,
                                   instance_type='ml.p2.xlarge')

I receive:

ValueError: Error hosting endpoint sagemaker-tensorflow-2019-07-07-11-50-45-473: Failed Reason:  The image '520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-tensorflow:1.13-gpu-py3' does not exist.

I can get past this error by specifying the image (which is not well-documented - took a lot of digging to find a link that worked):

sagemaker_model = TensorFlowModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/model0100.tar.gz',
                                  role = role,
                                  framework_version='1.13', py_version='py3',
                                  entry_point = 'train.py', image = '763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:1.13-gpu' )

Any idea how to solve this?

@chuyang-deng
Copy link
Contributor

Hi @NoahDolev, thank you for using SageMaker! From the code you provided, it seems you want to train your model with train.py?

In order to use TensorFlow script mode to train your model (and then deploy), you want to start with the Tensorflow Estimator class: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/estimator.py#L188

You either set script_mode=True or py_version="py3" to enable script mode.

@NoahDolev
Copy link
Author

Hi @ChuyangDeng ,

I am not sure that has anything to do with the issue I posted. I am reporting to you that the docker image which SageMaker searches for by default is not correct for eu-west-1. Also, script_mode is not a valid flag of TensorFlowModel. This flag exists only in TensorFlow to the best of my knowledge.

Best,
Noah

@chuyang-deng
Copy link
Contributor

Hi @NoahDolev,

Are you trying to do training or hosting here? Our TensorFlow script mode is only supported for training. And a TensoFlowModel class is for hosting, that's why the docker image uri is not correct (cannot be found).

If you are training your model, you should use TensorFlow estimator class so that you can train with our script mode image.

If you are deploying your trained model, you will use TensorFlowModel class, but no script mode is supported with deploying.

@yuchuang1979
Copy link

yuchuang1979 commented Jul 10, 2019

@NoahDolev @ChuyangDeng I met the same error when I follow this link:
https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker/
to deploy a pre-trained model in SageMaker with a different model. Since I am using py3 in my model, so I have to specify the image like this:

`sagemaker_model = TensorFlowModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/model.tar.gz',
role = role,
py_version='py3',
framework_version = '1.12',
entry_point = 'train.py')

predictor = sagemaker_model.deploy(initial_instance_count=1,
instance_type='ml.p2.xlarge')`

ValueError: Error hosting endpoint sagemaker-tensorflow-2019-07-10-05-06-02-075: Failed Reason: The image '520713654638.dkr.ecr.us-east-2.amazonaws.com/sagemaker-tensorflow:1.12-gpu-py3' does not exist.

When I delete py_version='py3' there is no error anymore.

@NoahDolev
Copy link
Author

Hi @yuchuang1979 ,

Precisely what I am referring to. I am trying to deploy a model I trained elsewhere. You can also specify the image to solve the problem. My point, however, is that the default is pointing to the wrong docker image. It's a bug.

Best,
Noah

@yuchuang1979
Copy link

@NoahDolev thanks for pointing out that there is another route by specifying the image. I am totally new to SageMaker and just began the work several days ago.

How could you create the image before specifying it in the function?

@ChoiByungWook
Copy link
Contributor

Just some context.

There are two TensorFlow solutions that handle serving in the Python SDK.

They have different class representations and documentation as shown here.

  1. TensorFlowModel - https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/model.py#L47
    Doc: https://github.com/aws/sagemaker-python-sdk/tree/v1.12.0/src/sagemaker/tensorflow#deploying-directly-from-model-artifacts
    Key difference: Uses a proxy GRPC client to sent requests
    Container impl: https://github.com/aws/sagemaker-tensorflow-container/blob/master/src/tf_container/serve.py

  2. Model - https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/serving.py#L96
    Doc: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst
    Key difference: Utilizes the TensorFlow serving rest API
    Container impl: https://github.com/aws/sagemaker-tensorflow-serving-container/blob/master/container/sagemaker/serve.py

Python 3 isn't supported using the TensorFlowModel object, as the container uses the TensorFlow serving api library in conjunction with the GRPC client to handle making inferences, however the TensorFlow serving api isn't supported in Python 3 officially, so there are only Python 2 versions of the containers when using the TensorFlowModel object.

If you need Python 3 then you will need to use the Model object defined in #2 above. The inference script format will change if you need to handle pre and post processing. https://github.com/aws/sagemaker-tensorflow-serving-container#prepost-processing.

Also your inference requests will need to follow the TFS rest API.
https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#making-predictions-against-a-sagemaker-endpoint

Since you train externally you're going to need to make sure your model artifacts follow the correct format. https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#deploying-more-than-one-model-to-your-endpoint

Here is an example that does for the most part what you're trying to do. https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_serving_container/tensorflow_serving_container.ipynb

Sorry for the confusion and wall of text and links. Please let me know if there is anything I can clarify.

Thanks!

@yuchuang1979
Copy link

@ChoiByungWook This is quite clear. Thanks!

@panfeng-hover
Copy link

panfeng-hover commented Jul 23, 2019

@ChoiByungWook Thanks for your introduction! I am wondering when will tf 1.14 be supported for serving?

I tried cpu, gpu and elastic ones, but it seems the corresponding images are all not available:

The image '763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:1.14-cpu' does not exist.

The image '763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:1.14-gpu' does not exist.

I used your second one:

from sagemaker import get_execution_role
from sagemaker.tensorflow.serving import Model
role = get_execution_role()

sagemaker_model = Model(model_data = 's3://sagemaker-hover/Models/zulu/tpu/model.tar.gz',
                        role = role,
                        framework_version='1.14')
predictor = sagemaker_model.deploy(initial_instance_count=1, 
                                   instance_type='ml.p2.xlarge',
                                   endpoint_name='test-001')

And also for the TensorFlowModel module, it seems it only supports until tf 1.12.

@tomislavmitic2012
Copy link

We have to use the proxy server with circle to run this.

@keelerh
Copy link

keelerh commented Apr 13, 2020

Did the format for specifying images change after TensorFlow 2 support was added? Or are there just no pre-built images for TensorFlow frameworks 2.0 and 2.1? I get

UnexpectedStatusException: Error hosting endpoint sagemaker-tensorflow-2020-04-13-14-02-35-992: Failed. Reason:  The image '520713654638.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tensorflow:2.1.0-cpu-py2' does not exist..
UnexpectedStatusException: Error hosting endpoint sagemaker-tensorflow-2020-04-13-14-02-35-992: Failed. Reason:  The image '520713654638.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tensorflow:2.1.0-gpu-py2' does not exist..
UnexpectedStatusException: Error hosting endpoint sagemaker-tensorflow-2020-04-13-14-02-35-992: Failed. Reason:  The image '520713654638.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tensorflow:2.1.0-cpu-py3' does not exist..
UnexpectedStatusException: Error hosting endpoint sagemaker-tensorflow-2020-04-13-14-02-35-992: Failed. Reason:  The image '520713654638.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tensorflow:2.1.0-gpu-py3' does not exist..

When trying to specify

from sagemaker.tensorflow.model import TensorFlowModel
sagemaker_model = TensorFlowModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/model.tar.gz',
                                  role = role,
                                  framework_version = '2.1.0',
                                  entry_point = 'train.py')

in the sample notebook available at https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker/.

@ratulray
Copy link

ratulray commented May 6, 2020

@ChoiByungWook The container implementation code locations given above (for TensorflowModel & Model) are outdated. Can you please point to the current implementations?

@laurenyu
Copy link
Contributor

laurenyu commented May 6, 2020

@keelerh @ratulray I believe the class you're looking for is sagemaker.tensorflow.serving.Model (the second one that @ChoiByungWook mentioned): https://sagemaker.readthedocs.io/en/stable/sagemaker.tensorflow.html#tensorflow-serving-model. That class should retrieve the correct image URI for the TF 2.x images.

if you have any further questions, please open a new issue (it'll help with our internal tracking)

@ratulray
Copy link

ratulray commented May 7, 2020

Thanks Lauren for your response. Actually my question was not that. I opened a new issue #1472

@abdelhamidnouh
Copy link

image

What should i do?
a01
a02

@laurenyu
Copy link
Contributor

laurenyu commented Feb 3, 2021

@abdelhamednouh you're commenting on an old, closed issue with an unrelated error message - can you open a new issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests