Description
Raising here (where I believe the implementation is?) as opposed to on SageMaker SDK - which as I understand just documents the functionality.
Every other SageMaker framework container that I've seen (see docs for MXNet, PyTorch, Scikit-Learn, even Chainer) advises determining the local path to store your model via a pattern like:
parser = argparse.ArgumentParser()
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
...which is nice and flexible: I can consume all my parameters through argparse, and choose when debugging locally whether to specify via env var or CLI. I simply save my model to args.model_dir
.
(FWIW the few other frameworks that I've checked closely don't seem to actually pass it in through the CLI: Just the env var default is getting used)
I don't understand why the TF container insists on passing in the slightly different --model_dir
, which:
- Can't easily be used together with
--model-dir
, because argparse shifts hyphens to underscores on the output object by default - Per SageMaker SDK issue #1355, reports the destination S3 URI for single-instance training but the local model folder for multi-instance training!
I'd argue this doesn't work well for power-users (because it's inconsistent with consensus in other frameworks), but also doesn't work well for beginners: who have to grapple with this "dir" parameter that isn't a directory at all but an S3 location, and will be confused into thinking they have to upload their model there when in fact SageMaker will do it for them.
For extra evidence of this, see also this repo's #115 (looks like a case of confusion caused) and #130 (user has specifically commented in code that the param is externally useless).
I don't really understand what the use case for the S3 URI would ever be in user code? Regardless of whether it's single- or multi-instance training, SageMaker can upload the final contents of SM_MODEL_DIR
(you can just only save the model on your rank 0 instance for multi, right?).
Nothing in the amazon-sagemaker-script-mode samples seems to use the S3 URIs, and in many cases (e.g. A, B, C, D) they specifically implement a separate --model_output_dir
to work around the issue.
aws/sagemaker-python-sdk/issues/1355 was resolved as a documentation change... To be clear I'm not claiming that the documentation is inaccurate or that there aren't other ways to achieve the same result. I'm asking for a (potentially API-breaking) implementation change, on the basis that the current API is bad for developers.
Would be great to hear what people are using --model_dir
S3 URIs for if I'm wrong!