Skip to content

Add Instructions for Contributing to the project #85

Open
@tlelson

Description

@tlelson

I am trying to get to the bottom of a problem #413 causing my deployed tensorflow model to fail.

The model is a simple and deploys with basic instructions to GCP MLE. The serving function which errors out on sagemaker works fine on MLE.

The problem seems to be in the way the sagemaker container processes the input.


As such I have started to debug locally but I am guessing about how to do that properly and am currently unsure how the local sagemaker container assumes the role passed to the TensorFlow constructor.

Currently, I am building the latest sagemaker-tensorflow-container image at v 1.10.0 and calling it from a local notebook instance using the MNIST example provided by amazon-sagemaker-examples:

from sagemaker.tensorflow import TensorFlow

mnist_estimator = TensorFlow(entry_point='mnist.py',
                             role=role,
                             framework_version='1.10.0',
                             training_steps=10, 
                             evaluation_steps=10,
                             train_instance_count=2,
                             train_instance_type='local',
                             image_name='my-sm-tensorflow:1.10.0-cpu-py2',
                            )

# mnist_estimator.fit(inputs) 
local_inputs = 'file://{}/data/'.format(os.getcwd())
mnist_estimator.fit(local_inputs)

however the local container fails because it cannot get an object from s3:

INFO:sagemaker:Creating training-job with name: my-sm-tensorflow-2018-10-08-05-34-16-185
Creating tmp6pytpo_algo-2-GGF0S_1 ...
Creating tmp6pytpo_algo-1-GGF0S_1 ...
Attaching to tmp6pytpo_algo-1-GGF0S_1, tmp6pytpo_algo-2-GGF0S_1
algo-1-GGF0S_1  | 2018-10-08 05:34:25,817 INFO - root - running container entrypoint
algo-1-GGF0S_1  | 2018-10-08 05:34:25,818 INFO - root - starting train task
algo-1-GGF0S_1  | 2018-10-08 05:34:25,841 INFO - container_support.training - Training starting
algo-2-GGF0S_1  | 2018-10-08 05:34:26,845 INFO - root - running container entrypoint
algo-2-GGF0S_1  | 2018-10-08 05:34:26,846 INFO - root - starting train task
algo-2-GGF0S_1  | 2018-10-08 05:34:26,873 INFO - container_support.training - Training starting
algo-1-GGF0S_1  | 2018-10-08 05:34:26,974 INFO - botocore.credentials - Found credentials in shared credentials file: ~/.aws/credentials
algo-1-GGF0S_1  | Downloading s3://sagemaker-ap-southeast-2-167464700695/my-sm-tensorflow-2018-10-08-05-34-16-185/source/sourcedir.tar.gz to /tmp/script.tar.gz
algo-1-GGF0S_1  | 2018-10-08 05:34:27,433 ERROR - container_support.training - uncaught exception during training: An error occurred (403) when calling the HeadObject operation: Forbidden
algo-1-GGF0S_1  | Traceback (most recent call last):
algo-1-GGF0S_1  |   File "/usr/local/lib/python2.7/dist-packages/container_support/training.py", line 36, in start
algo-1-GGF0S_1  |     fw.train()
algo-1-GGF0S_1  |   File "/usr/local/lib/python2.7/dist-packages/tf_container/train_entry_point.py", line 140, in train
algo-1-GGF0S_1  |     env.download_user_module()
algo-1-GGF0S_1  |   File "/usr/local/lib/python2.7/dist-packages/container_support/environment.py", line 89, in download_user_module
algo-1-GGF0S_1  |     cs.download_s3_resource(self.user_script_archive, tmp)
algo-1-GGF0S_1  |   File "/usr/local/lib/python2.7/dist-packages/container_support/utils.py", line 41, in download_s3_resource
algo-1-GGF0S_1  |     script_bucket.download_file(script_key_name, target)
algo-1-GGF0S_1  |   File "/usr/local/lib/python2.7/dist-packages/boto3/s3/inject.py", line 246, in bucket_download_file
algo-1-GGF0S_1  |     ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
algo-1-GGF0S_1  |   File "/usr/local/lib/python2.7/dist-packages/boto3/s3/inject.py", line 172, in download_file
algo-1-GGF0S_1  |     extra_args=ExtraArgs, callback=Callback)
algo-1-GGF0S_1  |   File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 307, in download_file
algo-1-GGF0S_1  |     future.result()
algo-1-GGF0S_1  |   File "/usr/local/lib/python2.7/dist-packages/s3transfer/futures.py", line 73, in result
algo-1-GGF0S_1  |     return self._coordinator.result()
algo-1-GGF0S_1  |   File "/usr/local/lib/python2.7/dist-packages/s3transfer/futures.py", line 233, in result
algo-1-GGF0S_1  |     raise self._exception
algo-1-GGF0S_1  | ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
algo-1-GGF0S_1  |
algo-1-GGF0S_1  |
tmp6pytpo_algo-1-GGF0S_1 exited with code 1
Stopping tmp6pytpo_algo-2-GGF0S_1 ...
Aborting on container exit... ... done
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-7-9d694d5f5d5b> in <module>()
      4 # try local inputs
      5 local_inputs = 'file://{}/data/'.format(os.getcwd())
----> 6 mnist_estimator.fit(local_inputs)

/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/tensorflow/estimator.pyc in fit(self, inputs, wait, logs, job_name, run_tensorboard_locally)
    248                 tensorboard.join()
    249         else:
--> 250             fit_super()
    251
    252     @classmethod

/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/tensorflow/estimator.pyc in fit_super()
    230         """
    231         def fit_super():
--> 232             super(TensorFlow, self).fit(inputs, wait, logs, job_name)
    233
    234         if run_tensorboard_locally and wait is False:

/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/estimator.pyc in fit(self, inputs, wait, logs, job_name)
    190         self._prepare_for_training(job_name=job_name)
    191
--> 192         self.latest_training_job = _TrainingJob.start_new(self, inputs)
    193         if wait:
    194             self.latest_training_job.wait(logs=logs)

/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/estimator.pyc in start_new(cls, estimator, inputs)
    432                                           resource_config=config['resource_config'], vpc_config=config['vpc_config'],
    433                                           hyperparameters=hyperparameters, stop_condition=config['stop_condition'],
--> 434                                           tags=estimator.tags)
    435
    436         return cls(estimator.sagemaker_session, estimator._current_job_name)

/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/session.pyc in train(self, image, input_mode, input_config, role, job_name, output_config, resource_config, vpc_config, hyperparameters, stop_condition, tags)
    277         LOGGER.info('Creating training-job with name: {}'.format(job_name))
    278         LOGGER.debug('train request: {}'.format(json.dumps(train_request, indent=4)))
--> 279         self.sagemaker_client.create_training_job(**train_request)
    280
    281     def tune(self, job_name, strategy, objective_type, objective_metric_name,

/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/local/local_session.pyc in create_training_job(self, TrainingJobName, AlgorithmSpecification, InputDataConfig, OutputDataConfig, ResourceConfig, **kwargs)
     73         training_job = _LocalTrainingJob(container)
     74         hyperparameters = kwargs['HyperParameters'] if 'HyperParameters' in kwargs else {}
---> 75         training_job.start(InputDataConfig, hyperparameters)
     76
     77         LocalSagemakerClient._training_jobs[TrainingJobName] = training_job

/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/local/entities.pyc in start(self, input_data_config, hyperparameters)
     58         self.state = self._TRAINING
     59
---> 60         self.model_artifacts = self.container.train(input_data_config, hyperparameters)
     61         self.end = datetime.datetime.now()
     62         self.state = self._COMPLETED

/Users/jimbo/Code/sagemaker-python-sdk/src/sagemaker/local/image.pyc in train(self, input_data_config, hyperparameters)
    124             # which contains the exit code and append the command line to it.
    125             msg = "Failed to run: %s, %s" % (compose_command, str(e))
--> 126             raise RuntimeError(msg)
    127
    128         s3_artifacts = self.retrieve_artifacts(compose_data)

RuntimeError: Failed to run: ['docker-compose', '-f', '/private/var/folders/1x/gyr4jt_s3jqc2c88vy74btnm0000gn/T/tmp6PyTpo/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

The role can be verified to copy the object, leading me to suppose that the container does not assume the role properly.

I wonder how it is meant to assume the role?


The instructions to build the container image locally are clear, thank you for that. I would like to see something in the README.md or CONTIBUTING.md that shows the recomended process of developing the container and calling the built image locally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions