Linformer: Self-Attention with Linear Complexity (Paper Explained)
Transformers are notoriously resource-intensive because their self-attention mechanism requires a squared number of memory and computations in the length of the input sequence. The Linformer Model gets around that by using the fact that often, the actual information in the attention matrix is of lower rank and can be approximated.
OUTLINE:
0:00 - Intro & Overview
1:40 - The Complexity of Self-Attention
4:50 - Embedding Dimension & Multiple Heads
8:45 - Formal Attention
10:30 - Empirical Investigation into RoBERTa
20:00 - Theorem: Self-Attention is Low Rank
28:10 - Linear Self-Attention Method
36:15 - Theorem: Linear Self-Attention
44:10 - Language Modeling
46:40 - NLP Benchmarks
47:50 - Compute Time & Memory Gains
48:20 - Broader Impact Statement
49:55 - Conclusion
Paper: https://arxiv.org/abs/2006.04768
Abstract:
Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences, as the standard self-attention mechanism of the Transformer uses O(n2) time and space with respect to sequence length. In this paper, we demonstrate that the self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from O(n2) to O(n) in both time and space. The resulting linear transformer, the \textit{Linformer}, performs on par with standard Transformer models, while being much more memory- and time-efficient.
Authors: Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, Hao Ma
Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Видео Linformer: Self-Attention with Linear Complexity (Paper Explained) канала Yannic Kilcher
OUTLINE:
0:00 - Intro & Overview
1:40 - The Complexity of Self-Attention
4:50 - Embedding Dimension & Multiple Heads
8:45 - Formal Attention
10:30 - Empirical Investigation into RoBERTa
20:00 - Theorem: Self-Attention is Low Rank
28:10 - Linear Self-Attention Method
36:15 - Theorem: Linear Self-Attention
44:10 - Language Modeling
46:40 - NLP Benchmarks
47:50 - Compute Time & Memory Gains
48:20 - Broader Impact Statement
49:55 - Conclusion
Paper: https://arxiv.org/abs/2006.04768
Abstract:
Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences, as the standard self-attention mechanism of the Transformer uses O(n2) time and space with respect to sequence length. In this paper, we demonstrate that the self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from O(n2) to O(n) in both time and space. The resulting linear transformer, the \textit{Linformer}, performs on par with standard Transformer models, while being much more memory- and time-efficient.
Authors: Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, Hao Ma
Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Видео Linformer: Self-Attention with Linear Complexity (Paper Explained) канала Yannic Kilcher
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
[Live Machine Learning Research] Plain Self-Ensembles (I actually DISCOVER SOMETHING) - Part 1Synthesizer: Rethinking Self-Attention in Transformer Models (Paper Explained)Rethinking Attention with Performers (Paper Explained)Dynamic Routing Between CapsulesBig Self-Supervised Models are Strong Semi-Supervised Learners (Paper Explained)BERTology Meets Biology: Interpreting Attention in Protein Language Models (Paper Explained)Week 12 – Practicum: Attention and the TransformerLongformer: The Long-Document TransformerSelf-training with Noisy Student improves ImageNet classification (Paper Explained)OpenAI CLIP: ConnectingText and Images (Paper Explained)SIREN: Implicit Neural Representations with Periodic Activation Functions (Paper Explained)Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (Paper Explained)Learning To Classify Images Without Labels (Paper Explained)Deep Networks Are Kernel Machines (Paper Explained)XLNet: Generalized Autoregressive Pretraining for Language UnderstandingAttention Is All You NeedImage GPT: Generative Pretraining from Pixels (Paper Explained)BYOL: Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning (Paper Explained)Transformer Neural Networks - EXPLAINED! (Attention is all you need)