[Paper reading] Attention is all you need, Transformer

[Paper reading] Attention is all you need, Transformer

2023. 8. 25. 12:51ㆍArtificialIntelligence/PaperReading

Transformer

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism.
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.
We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Conclusion

In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.
For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles.
We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours.

장점

Scale dot product attention 수식에서, 대표적인 두 방식 (additive vs dot-product attention)의 차이를 언급한 이후, 특정 항이 추가된 이유에 대해 논리적으로 분석한 파트가 좋았다.
- While for small values of dk the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of dk. We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by √dk.

Why self-attention 부분에서, 기존의 기법들 대비 해당 방식(Transformer, attention)이 더 우수한 3가지 이유를 체계적으로 제시하였다.
As side benefit, self-attention could yield more interpretable models. We inspect attention distributions from our models and present and discuss examples in the appendix. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.

동일한 모델의 다양한 parameter에 대해, 여러 실험으로 분류하여 결과를 분석한 점이 좋았다.
In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.
In Table 3 rows (B), we observe that reducing the attention key size dk hurts model quality. This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial. We further observe in rows (C) and (D) that, as expected, bigger models are better, and dropout is very helpful in avoiding over-fitting. In row (E) we replace our sinusoidal positional encoding with learned positional embeddings, and observe nearly identical results to the base model.

단점

Training 부분에서 간략하게 서술되고 넘어가는 부분이 많아, 잘 이해되지 않았다.
We used the Adam optimizer with β1 = 0.9, β2 = 0.98 and ε = 10−9. We varied the learning rate over the course of training. This corresponds to increasing the learning rate linearly for the first warmup_steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used warmup_steps = 4000. (β1 = 0.9, β2 = 0.98 이 뭘 말하는 지 모르겠다.)
Label Smoothing During training, we employed label smoothing of value εls = 0.1. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
Label Smoothing이 어떠한 원리로 동작하는지 잘 모르겠다.

개선할 점 & Question

기존에 사용되던 NLP 분야 아키텍처에 대한 이해가 부족해서, 논문에서 주장하는 내용들이 (CNN, RNN to Transfomer) 온전히 다 와닿지는 않았다.
Multi-head attention 부분에서, 왜 여러 개의 head를 사용하는 것인지 잘 이해되지 않았다.
Instead of performing a single attention function with dmodel-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively.
On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. (이게 무슨 말인지 잘 모르겠다.) With a single attention head, averaging inhibits this..

Positioning Encoding 부분에서, 해당 수식에 대한 설명을 잘 이해하지 못했다.
Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks.
The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed. In this work, we use sine and cosine functions of different frequencies
where pos is the position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, P Epos+k can be represented as a linear function of PEpos.
We also experimented with using learned positional embeddings instead, and found that the two versions produced nearly identical results (see Table 3 row (E)).
We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

참고 자료

https://www.youtube.com/watch?v=AA621UofTUA

https://wikidocs.net/31379

16-01 트랜스포머(Transformer)

* 이번 챕터는 앞서 설명한 어텐션 메커니즘 챕터에 대한 사전 이해가 필요합니다. 트랜스포머(Transformer)는 2017년 구글이 발표한 논문인 Attention i…

wikidocs.net

https://wikidocs.net/22893 (attention이란?)

15-01 어텐션 메커니즘 (Attention Mechanism)

앞서 배운 seq2seq 모델은 **인코더**에서 입력 시퀀스를 컨텍스트 벡터라는 하나의 고정된 크기의 벡터 표현으로 압축하고, **디코더**는 이 컨텍스트 벡터를 통해서 출력 …

wikidocs.net

seq2seq 모델은 인코더에서 입력 시퀀스를 컨텍스트 벡터라는 하나의 고정된 크기의 벡터 표현으로 압축하고, 디코더는 이 컨텍스트 벡터를 통해서 출력 시퀀스를 만들어냈습니다. 하지만 이러한 RNN에 기반한 seq2seq 모델에는 크게 두 가지 문제가 있습니다.
첫째, 하나의 고정된 크기의 벡터에 모든 정보를 압축하려고 하니까 정보 손실이 발생합니다.
둘째, RNN의 고질적인 문제인 기울기 소실(vanishing gradient) 문제가 존재합니다.

결국 이는 기계 번역 분야에서 입력 문장이 길면 번역 품질이 떨어지는 현상으로 나타났습니다. 이를 위한 대안으로 입력 시퀀스가 길어지면 출력 시퀀스의 정확도가 떨어지는 것을 보정해주기 위한 등장한 기법인 어텐션(attention)을 소개합니다.

어텐션의 기본 아이디어는 디코더에서 출력 단어를 예측하는 매 시점(time step)마다, 인코더에서의 전체 입력 문장을 다시 한 번 참고한다는 점입니다. 단, 전체 입력 문장을 전부 다 동일한 비율로 참고하는 것이 아니라, 해당 시점에서 예측해야할 단어와 연관이 있는 입력 단어 부분을 좀 더 집중(attention)해서 보게 됩니다.

'ArtificialIntelligence > PaperReading' 카테고리의 다른 글

[Paper reading] Swin Transformer (0)	2023.09.04
[Paper reading] Transformers for image recognition, ViT (0)	2023.08.28
[Paper reading] DenseNet (0)	2023.08.22
[Paper reading] GoogleNet (0)	2023.08.18
[Paper reading] ResNet (0)	2023.08.16

KimAnt 🥦