Skip to content

The variable "end_training" in Bert_Large training is wrongly used.  #170

@taotod

Description

@taotod

In the code below, the variable "end_training" is defined with boolean type to decide when to end the training.

https://github.com/IntelAI/models/blob/cdd842a33eb9d402ff18bfb79bd106ae132a8e99/models/language_modeling/pytorch/bert_large/training/gpu/run_pretrain_mlperf.py#L838

In the code below to calculate the one iteration training time, the variable "end_training" is wrongly re-used to record the end training time.
https://github.com/IntelAI/models/blob/cdd842a33eb9d402ff18bfb79bd106ae132a8e99/models/language_modeling/pytorch/bert_large/training/gpu/run_pretrain_mlperf.py#L1006

"end_training" is set with a non-zero value in the code line 1006. As a result, after one data file is used for training, the training exits here and will never go to next data file.
https://github.com/IntelAI/models/blob/cdd842a33eb9d402ff18bfb79bd106ae132a8e99/models/language_modeling/pytorch/bert_large/training/gpu/run_pretrain_mlperf.py#L1079

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions