--model_dir is inconsistent, confusing, and unnecessary

Raising here (where I believe the implementation is?) as opposed to on SageMaker SDK - which as I understand just documents the functionality.

**Every other** SageMaker framework container that I've seen (see docs for [MXNet](https://sagemaker.readthedocs.io/en/stable/using_mxnet.html#for-versions-1-3-and-higher), [PyTorch](https://sagemaker.readthedocs.io/en/stable/using_pytorch.html#prepare-a-pytorch-training-script), [Scikit-Learn](https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#prepare-a-scikit-learn-training-script), even [Chainer](https://sagemaker.readthedocs.io/en/stable/using_chainer.html#prepare-a-chainer-training-script)) advises determining the local path to store your model via a pattern like:

```python
parser = argparse.ArgumentParser()
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
```

...which is nice and flexible: I can consume all my parameters through argparse, and choose when debugging locally whether to specify via env var or CLI. I simply save my model to `args.model_dir`.

(FWIW the few other frameworks that I've checked closely don't seem to actually pass it in through the CLI: Just the env var default is getting used)

I don't understand why the TF container insists on passing in the slightly different `--model_dir`, which:

- Can't easily be used together with `--model-dir`, because argparse shifts hyphens to underscores on the output object by default
- Per [SageMaker SDK issue #1355](https://github.com/aws/sagemaker-python-sdk/issues/1355#issue-580309121), reports the *destination S3 URI* for single-instance training but the *local model folder* for multi-instance training!

I'd argue this doesn't work well for power-users (because it's inconsistent with consensus in other frameworks), but **also** doesn't work well for beginners: who have to grapple with this "dir" parameter that **isn't a directory at all** but an S3 location, and will be confused into thinking they have to upload their model there when in fact SageMaker will do it for them.

For extra evidence of this, see also this repo's #115 (looks like a case of confusion caused) and #130 (user has specifically commented in code that the param is externally useless).

I don't really understand what the use case for the S3 URI would ever be in user code? Regardless of whether it's single- or multi-instance training, SageMaker can upload the final contents of `SM_MODEL_DIR` (you can just only save the model on your rank 0 instance for multi, right?).

Nothing in the [amazon-sagemaker-script-mode](https://github.com/aws-samples/amazon-sagemaker-script-mode) samples seems to use the S3 URIs, and in many cases (e.g. [A](https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-batch-inference-script/code/train.py), [B](https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-distribution-options/code/train_hvd.py), [C](https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-distribution-options/code/train_ps.py), [D](https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-horovod-inference-pipeline/train.py)) they specifically implement a separate `--model_output_dir` to work around the issue.

[aws/sagemaker-python-sdk/issues/1355](https://github.com/aws/sagemaker-python-sdk/issues/1355#issue-580309121) was resolved as a documentation change... To be clear I'm not claiming that the documentation is inaccurate or that there aren't other ways to achieve the same result. I'm asking for a (potentially API-breaking) implementation change, on the basis that the current API is bad for developers.

Would be great to hear what people are using `--model_dir` S3 URIs for if I'm wrong!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

--model_dir is inconsistent, confusing, and unnecessary #340

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

--model_dir is inconsistent, confusing, and unnecessary #340

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions