Experiments on Dynamic Tanh(Paper: Transformers without Normalization)
-
Paper Review: https://haeun161.tistory.com/31
-
This paper got the idea from the fact that LayerNorm in Transformer architecture showed a s-shape form which resembles scaled Tanh
- Reconstruction Experiment: checks input&output of LayerNorm in ViT
- Comparsion Test : Checks input & output of BatchNorm in ResNet50 to see whether it has the same Pattern(S-shaped)
-
According to the paper, Limation of DyT was that DyT struggled to fully replace BatchNormalization
-
Experiment Environment
- torchvision 사용
- torchvision.transforms
- torchvision.models
- Model: ResNet50
- with BatchNorm
- with DyT
- Data: Mini version of Imagenet1K
- Initialization of α:
- Paper uses α=0.5 => however it didn't work well on ResNet50-DyT
- In this experiment, I initialized α as 1.0(α=1.0)
- torchvision 사용
- due to the limitation of available GPU, it is trained till epoch 30.