Reformer: The Efficient Transformer
The Transformer for the masses! Reformer solves the biggest problem with the famous Transformer model: Its huge resource requirements. By cleverly combining Locality Sensitive Hashing and ideas from Reversible Networks, the classically huge footprint of the Transformer is drastically reduced. Not only does that mean the model uses less memory, but it can process much longer input sequences, up to 16K tokens with just 16gb of memory!
https://arxiv.org/abs/2001.04451
https://ai.googleblog.com/2020/01/reformer-efficient-transformer.html
Abstract:
Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L2) to O(LlogL), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
Authors: Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya
Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Видео Reformer: The Efficient Transformer канала Yannic Kilcher
https://arxiv.org/abs/2001.04451
https://ai.googleblog.com/2020/01/reformer-efficient-transformer.html
Abstract:
Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L2) to O(LlogL), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
Authors: Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya
Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Видео Reformer: The Efficient Transformer канала Yannic Kilcher
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
Big Bird: Transformers for Longer Sequences (Paper Explained)Attention Is All You NeedLongformer: The Long-Document TransformerLinformer: Self-Attention with Linear Complexity (Paper Explained)CS480/680 Lecture 19: Attention and Transformer NetworksAn Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)If You Don't Understand Quantum Physics, Try This!Rethinking Attention with Performers (Paper Explained)Was 2020 A Simulation? (Science & Math of the Simulation Theory)Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision TransformersCNN: Convolutional Neural Networks Explained - ComputerphileBERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingAirbus A320 Crashes in Pakistan | Here's What Really Happened to Flight 8303The Transformer neural network architecture EXPLAINED. “Attention is all you need” (NLP)Visual Guide to Transformer Neural Networks - (Episode 1) Position EmbeddingsIllustrated Guide to Transformers Neural Network: A step by step explanationReformer: The Efficient TransformerGPT-3: Language Models are Few-Shot Learners (Paper Explained)Group Normalization (Paper Explained)