Weight Standardization (Paper Explained)
It's common for neural networks to include data normalization such as BatchNorm or GroupNorm. This paper extends the normalization to also include the weights of the network. This surprisingly simple change leads to a boost in performance and - combined with GroupNorm - new state-of-the-art results.
https://arxiv.org/abs/1903.10520
Abstract:
In this paper, we propose Weight Standardization (WS) to accelerate deep network training. WS is targeted at the micro-batch training setting where each GPU typically has only 1-2 images for training. The micro-batch training setting is hard because small batch sizes are not enough for training networks with Batch Normalization (BN), while other normalization methods that do not rely on batch knowledge still have difficulty matching the performances of BN in large-batch training. Our WS ends this problem because when used with Group Normalization and trained with 1 image/GPU, WS is able to match or outperform the performances of BN trained with large batch sizes with only 2 more lines of code. In micro-batch training, WS significantly outperforms other normalization methods. WS achieves these superior results by standardizing the weights in the convolutional layers, which we show is able to smooth the loss landscape by reducing the Lipschitz constants of the loss and the gradients. The effectiveness of WS is verified on many tasks, including image classification, object detection, instance segmentation, video recognition, semantic segmentation, and point cloud recognition. The code is available here: this https URL.
Authors: Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, Alan Yuille
Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Видео Weight Standardization (Paper Explained) канала Yannic Kilcher
https://arxiv.org/abs/1903.10520
Abstract:
In this paper, we propose Weight Standardization (WS) to accelerate deep network training. WS is targeted at the micro-batch training setting where each GPU typically has only 1-2 images for training. The micro-batch training setting is hard because small batch sizes are not enough for training networks with Batch Normalization (BN), while other normalization methods that do not rely on batch knowledge still have difficulty matching the performances of BN in large-batch training. Our WS ends this problem because when used with Group Normalization and trained with 1 image/GPU, WS is able to match or outperform the performances of BN trained with large batch sizes with only 2 more lines of code. In micro-batch training, WS significantly outperforms other normalization methods. WS achieves these superior results by standardizing the weights in the convolutional layers, which we show is able to smooth the loss landscape by reducing the Lipschitz constants of the loss and the gradients. The effectiveness of WS is verified on many tasks, including image classification, object detection, instance segmentation, video recognition, semantic segmentation, and point cloud recognition. The code is available here: this https URL.
Authors: Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, Alan Yuille
Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Видео Weight Standardization (Paper Explained) канала Yannic Kilcher
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
Group Normalization (Paper Explained)TAPAS: Weakly Supervised Table Parsing via Pre-training (Paper Explained)mixup: Beyond Empirical Risk Minimization (Paper Explained)On the Connection between Neural Networks and Kernels: a Modern Perspective - Simon DuWhat is backpropagation really doing? | Chapter 3, Deep learningBatch Normalization - EXPLAINED!Batch Normalization | How does it work, how to implement it (with code)Big Transfer (BiT): General Visual Representation Learning (Paper Explained)Direct and Indirect Standardization (age-adjusted mortality rates)!Optimal transport for machine learning - Gabriel Peyre, Ecole Normale SuperieureA critical analysis of self-supervision, or what we can learn from a single image (Paper Explained)How can to balance between standardization and customization? Polymorphism may be the answer!Learning To Classify Images Without Labels (Paper Explained)XLNet: Generalized Autoregressive Pretraining for Language UnderstandingTUNIT: Rethinking the Truly Unsupervised Image-to-Image Translation (Paper Explained)NVAE: A Deep Hierarchical Variational Autoencoder (Paper Explained)Gradient Surgery for Multi-Task LearningBatch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate ShiftFaster Neural Network Training with Data Echoing (Paper Explained)