Lecture 12.2 Transformers
ERRATA: In slide 31, the first part of the transformer block should read
y = self.layernorm(x)
y = self.attention(y)
Also, the code currently suggests that the same layer normalization is applied twice. It is more common to apply different layer normalizations in the same block.
How to take the basic self-attention mechanism and build it up into a Transformer. We discuss The basic transformer block, layer normalization, causal block for autoregressive models and three different ways to encode position information.
annotated slides: https://dlvu.github.io/sa
lecturer: Peter Bloem
Видео Lecture 12.2 Transformers канала DLVU
y = self.layernorm(x)
y = self.attention(y)
Also, the code currently suggests that the same layer normalization is applied twice. It is more common to apply different layer normalizations in the same block.
How to take the basic self-attention mechanism and build it up into a Transformer. We discuss The basic transformer block, layer normalization, causal block for autoregressive models and three different ways to encode position information.
annotated slides: https://dlvu.github.io/sa
lecturer: Peter Bloem
Видео Lecture 12.2 Transformers канала DLVU
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
[archived] Lecture 1.2: A quick intro to AILecture 4.4 The bag of tricksLecture 4.2 Why does deep Learning work? (DLVU)Lecture 11.3: World ModelsLecture 6.3: variational autoencodersLecture 1.1: Neural networks[OLD] Lecture 2.4: Automatic Differentiation (DLVU)Lecture 7.2 Implicit models: GANs[OLD] Lecture 2.2: Backpropagation, scalar perspective (DLVU)Lecture 5.2 Recurrent Neural NetworksLecture 8.1b: Introduction - EmbeddingsLecture 1.2: Regression, classification and loss functionsLecture 2.1: A Review of Neural Networks (DLVU)Lecture 8.2: Graph and node embeddingLecture 6.2: Latent variable modelLecture 9.3: Gradient EstimationLecture 8.4: Application - query embeddingLecture 4.1 General Deep Learning practiceLecture 10.2: ARM & FlowsLecture 10.1: ARM & FlowsLecture 12.4 Scaling up (Mixed precision, Data-parallelism, FSDP)