Open
Description
Hi,
This is my first time working with Sagemaker. I successfully trained a model, however, I'm having difficulty getting it to output evaluation metrics to the log files.
Here is a snippet of my model:
def metric_fn(label_ids, predicted_labels):
accuracy = tf.compat.v1.metrics.accuracy(label_ids, predicted_labels)
recall = tf.compat.v1.metrics.recall(label_ids,predicted_labels)
precision = tf.compat.v1.metrics.precision(label_ids,predicted_labels)
return {"eval_accuracy": accuracy,
"precision": precision,
"recall": recall}
if mode== tf.estimator.ModeKeys.EVAL:
eval_metrics = metric_fn(label_ids, predicted_labels)
return tf.estimator.EstimatorSpec(mode=mode,loss=loss,eval_metric_ops=eval_metrics)
And this is how the model is fit:
estimator = TensorFlow(
entry_point='script.py',
source_dir = [#Source_dir],
train_instance_type='ml.m5.2xlarge',
train_instance_count=4,
output_path=s3_output_location,
hyperparameters=hyperparameters,
role=role,
py_version='py3',
framework_version='1.15.2',
sagemaker_session=sess,
metric_definitions=[{'Name': 'eval-accuracy', 'Regex': 'eval-accuracy=(\d\.\d+)'},
{'Name': 'precision', 'Regex': 'precision=(\d\.\d+)'},
{'Name': 'recall', 'Regex': 'recall=(\d\.\d+)'}],
enable_sagemaker_metrics=True,
distributions= {'parameter_server': {'enabled': True}})
When the training finishes, I don't see any of these metrics in the logs, nor in the 'training jobs' section. This is how the Metrics section looks:
Metrics
Name Regex
eval-accuracy eval-accuracy=(\d.\d+)
precision precision=(\d.\d+)
recall recall=(\d.\d+)
I don't know why it should be so obscure. I've run the script multiple times with sagemaker, and no luck so far! I'd appreciate any help!