Загрузка страницы

∞-former: Infinite Memory Transformer (aka Infty-Former / Infinity-Former, Research Paper Explained)

#inftyformer #infinityformer #transformer

Vanilla Transformers are excellent sequence models, but suffer from very harsch constraints on the length of the sequences they can process. Several attempts have been made to extend the Transformer's sequence length, but few have successfully gone beyond a constant factor improvement. This paper presents a method, based on continuous attention mechanisms, to attend to an unbounded past sequence by representing the past as a continuous signal, rather than a sequence. This enables the Infty-Former to effectively enrich the current context with global information, which increases performance on long-range dependencies in sequence tasks. Further, the paper presents the concept of sticky memories, which highlight past events that are of particular importance and elevates their representation in the long-term memory.

OUTLINE:
0:00 - Intro & Overview
1:10 - Sponsor Spot: Weights & Biases
3:35 - Problem Statement
8:00 - Continuous Attention Mechanism
16:25 - Unbounded Memory via concatenation & contraction
18:05 - Does this make sense?
20:25 - How the Long-Term Memory is used in an attention layer
27:40 - Entire Architecture Recap
29:30 - Sticky Memories by Importance Sampling
31:25 - Commentary: Pros and cons of using heuristics
32:30 - Experiments & Results

Paper: https://arxiv.org/abs/2109.00301

Sponsor: Weights & Biases
https://wandb.me/start

Abstract:
Transformers struggle when attending to long contexts, since the amount of computation grows with the context length, and therefore they cannot model long-term memories effectively. Several variations have been proposed to alleviate this problem, but they all have a finite memory capacity, being forced to drop old information. In this paper, we propose the ∞-former, which extends the vanilla transformer with an unbounded long-term memory. By making use of a continuous-space attention mechanism to attend over the long-term memory, the ∞-former's attention complexity becomes independent of the context length. Thus, it is able to model arbitrarily long contexts and maintain "sticky memories" while keeping a fixed computation budget. Experiments on a synthetic sorting task demonstrate the ability of the ∞-former to retain information from long sequences. We also perform experiments on language modeling, by training a model from scratch and by fine-tuning a pre-trained language model, which show benefits of unbounded long-term memories.

Authors: Pedro Henrique Martins, Zita Marinho, André F. T. Martins

Links:
TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/
BiliBili: https://space.bilibili.com/1824646584

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Видео ∞-former: Infinite Memory Transformer (aka Infty-Former / Infinity-Former, Research Paper Explained) канала Yannic Kilcher
Показать
Комментарии отсутствуют
Введите заголовок:

Введите адрес ссылки:

Введите адрес видео с YouTube:

Зарегистрируйтесь или войдите с
Информация о видео
6 сентября 2021 г. 17:07:08
00:36:37
Другие видео канала
WHO ARE YOU? 10k Subscribers Special (w/ Channel Analytics)WHO ARE YOU? 10k Subscribers Special (w/ Channel Analytics)Datasets for Data-Driven Reinforcement LearningDatasets for Data-Driven Reinforcement LearningReinforcement Learning with Augmented Data (Paper Explained)Reinforcement Learning with Augmented Data (Paper Explained)The Odds are Odd: A Statistical Test for Detecting Adversarial ExamplesThe Odds are Odd: A Statistical Test for Detecting Adversarial ExamplesAMP: Adversarial Motion Priors for Stylized Physics-Based Character Control (Paper Explained)AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control (Paper Explained)Expire-Span: Not All Memories are Created Equal: Learning to Forget by Expiring (Paper Explained)Expire-Span: Not All Memories are Created Equal: Learning to Forget by Expiring (Paper Explained)REALM: Retrieval-Augmented Language Model Pre-Training (Paper Explained)REALM: Retrieval-Augmented Language Model Pre-Training (Paper Explained)Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation (Paper Explained)Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation (Paper Explained)[Classic] Playing Atari with Deep Reinforcement Learning (Paper Explained)[Classic] Playing Atari with Deep Reinforcement Learning (Paper Explained)Symbolic Knowledge Distillation: from General Language Models to Commonsense Models (Explained)Symbolic Knowledge Distillation: from General Language Models to Commonsense Models (Explained)Gradient Origin Networks (Paper Explained w/ Live Coding)Gradient Origin Networks (Paper Explained w/ Live Coding)Perceiver: General Perception with Iterative Attention (Google DeepMind Research Paper Explained)Perceiver: General Perception with Iterative Attention (Google DeepMind Research Paper Explained)PonderNet: Learning to Ponder (Machine Learning Research Paper Explained)PonderNet: Learning to Ponder (Machine Learning Research Paper Explained)GLOM: How to represent part-whole hierarchies in a neural network (Geoff Hinton's Paper Explained)GLOM: How to represent part-whole hierarchies in a neural network (Geoff Hinton's Paper Explained)ALiBi - Train Short, Test Long: Attention with linear biases enables input length extrapolationALiBi - Train Short, Test Long: Attention with linear biases enables input length extrapolationListening to You! - Channel Update (Author Interviews)Listening to You! - Channel Update (Author Interviews)[ML News] Uber: Deep Learning for ETA | MuZero Video Compression  | Block-NeRF | EfficientNet-X[ML News] Uber: Deep Learning for ETA | MuZero Video Compression | Block-NeRF | EfficientNet-XOn the Measure of Intelligence by François Chollet - Part 1: Foundations (Paper Explained)On the Measure of Intelligence by François Chollet - Part 1: Foundations (Paper Explained)Growing Neural Cellular AutomataGrowing Neural Cellular Automata[ML News] DeepMind's Flamingo Image-Text model | Locked-Image Tuning | Jurassic X & MRKL[ML News] DeepMind's Flamingo Image-Text model | Locked-Image Tuning | Jurassic X & MRKLAvoiding Catastrophe: Active Dendrites Enable Multi-Task Learning in Dynamic Environments (Review)Avoiding Catastrophe: Active Dendrites Enable Multi-Task Learning in Dynamic Environments (Review)
Яндекс.Метрика