2023. 8. 25. 12:51ใArtificialIntelligence/PaperReading
Transformer
Abstract
- The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism.
- We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
- Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.
- We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Conclusion
- In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.
- For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles.
- We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours.
์ฅ์
- Scale dot product attention ์์์์, ๋ํ์ ์ธ ๋ ๋ฐฉ์ (additive vs dot-product attention)์ ์ฐจ์ด๋ฅผ ์ธ๊ธํ ์ดํ, ํน์ ํญ์ด ์ถ๊ฐ๋ ์ด์ ์ ๋ํด ๋
ผ๋ฆฌ์ ์ผ๋ก ๋ถ์ํ ํํธ๊ฐ ์ข์๋ค.
- While for small values of dk the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of dk. We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by √dk.
- Why self-attention ๋ถ๋ถ์์, ๊ธฐ์กด์ ๊ธฐ๋ฒ๋ค ๋๋น ํด๋น ๋ฐฉ์(Transformer, attention)์ด ๋ ์ฐ์ํ 3๊ฐ์ง ์ด์ ๋ฅผ ์ฒด๊ณ์ ์ผ๋ก ์ ์ํ์๋ค.
- As side benefit, self-attention could yield more interpretable models. We inspect attention distributions from our models and present and discuss examples in the appendix. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.
- ๋์ผํ ๋ชจ๋ธ์ ๋ค์ํ parameter์ ๋ํด, ์ฌ๋ฌ ์คํ์ผ๋ก ๋ถ๋ฅํ์ฌ ๊ฒฐ๊ณผ๋ฅผ ๋ถ์ํ ์ ์ด ์ข์๋ค.
- In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.
- In Table 3 rows (B), we observe that reducing the attention key size dk hurts model quality. This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial. We further observe in rows (C) and (D) that, as expected, bigger models are better, and dropout is very helpful in avoiding over-fitting. In row (E) we replace our sinusoidal positional encoding with learned positional embeddings, and observe nearly identical results to the base model.
๋จ์
- Training ๋ถ๋ถ์์ ๊ฐ๋ตํ๊ฒ ์์ ๋๊ณ ๋์ด๊ฐ๋ ๋ถ๋ถ์ด ๋ง์, ์ ์ดํด๋์ง ์์๋ค.
- We used the Adam optimizer with β1 = 0.9, β2 = 0.98 and ε = 10−9. We varied the learning rate over the course of training. This corresponds to increasing the learning rate linearly for the first warmup_steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used warmup_steps = 4000. (β1 = 0.9, β2 = 0.98 ์ด ๋ญ ๋งํ๋ ์ง ๋ชจ๋ฅด๊ฒ ๋ค.)
- Label Smoothing During training, we employed label smoothing of value εls = 0.1. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
Label Smoothing์ด ์ด๋ ํ ์๋ฆฌ๋ก ๋์ํ๋์ง ์ ๋ชจ๋ฅด๊ฒ ๋ค.
๊ฐ์ ํ ์ & Question
- ๊ธฐ์กด์ ์ฌ์ฉ๋๋ NLP ๋ถ์ผ ์ํคํ ์ฒ์ ๋ํ ์ดํด๊ฐ ๋ถ์กฑํด์, ๋ ผ๋ฌธ์์ ์ฃผ์ฅํ๋ ๋ด์ฉ๋ค์ด (CNN, RNN to Transfomer) ์จ์ ํ ๋ค ์๋ฟ์ง๋ ์์๋ค.
- Multi-head attention ๋ถ๋ถ์์, ์ ์ฌ๋ฌ ๊ฐ์ head๋ฅผ ์ฌ์ฉํ๋ ๊ฒ์ธ์ง ์ ์ดํด๋์ง ์์๋ค.
- Instead of performing a single attention function with dmodel-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively.
- On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.
- Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. (์ด๊ฒ ๋ฌด์จ ๋ง์ธ์ง ์ ๋ชจ๋ฅด๊ฒ ๋ค.) With a single attention head, averaging inhibits this..
- Positioning Encoding ๋ถ๋ถ์์, ํด๋น ์์์ ๋ํ ์ค๋ช ์ ์ ์ดํดํ์ง ๋ชปํ๋ค.
- Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks.
- The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed. In this work, we use sine and cosine functions of different frequencies
- where pos is the position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, P Epos+k can be represented as a linear function of PEpos.
- We also experimented with using learned positional embeddings instead, and found that the two versions produced nearly identical results (see Table 3 row (E)).
We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.
์ฐธ๊ณ ์๋ฃ
https://www.youtube.com/watch?v=AA621UofTUA
https://wikidocs.net/22893 (attention์ด๋?)
seq2seq ๋ชจ๋ธ์ ์ธ์ฝ๋์์ ์
๋ ฅ ์ํ์ค๋ฅผ ์ปจํ
์คํธ ๋ฒกํฐ๋ผ๋ ํ๋์ ๊ณ ์ ๋ ํฌ๊ธฐ์ ๋ฒกํฐ ํํ์ผ๋ก ์์ถํ๊ณ , ๋์ฝ๋๋ ์ด ์ปจํ
์คํธ ๋ฒกํฐ๋ฅผ ํตํด์ ์ถ๋ ฅ ์ํ์ค๋ฅผ ๋ง๋ค์ด๋์ต๋๋ค. ํ์ง๋ง ์ด๋ฌํ RNN์ ๊ธฐ๋ฐํ seq2seq ๋ชจ๋ธ์๋ ํฌ๊ฒ ๋ ๊ฐ์ง ๋ฌธ์ ๊ฐ ์์ต๋๋ค.
์ฒซ์งธ, ํ๋์ ๊ณ ์ ๋ ํฌ๊ธฐ์ ๋ฒกํฐ์ ๋ชจ๋ ์ ๋ณด๋ฅผ ์์ถํ๋ ค๊ณ ํ๋๊น ์ ๋ณด ์์ค์ด ๋ฐ์ํฉ๋๋ค.
๋์งธ, RNN์ ๊ณ ์ง์ ์ธ ๋ฌธ์ ์ธ ๊ธฐ์ธ๊ธฐ ์์ค(vanishing gradient) ๋ฌธ์ ๊ฐ ์กด์ฌํฉ๋๋ค.
๊ฒฐ๊ตญ ์ด๋ ๊ธฐ๊ณ ๋ฒ์ญ ๋ถ์ผ์์ ์ ๋ ฅ ๋ฌธ์ฅ์ด ๊ธธ๋ฉด ๋ฒ์ญ ํ์ง์ด ๋จ์ด์ง๋ ํ์์ผ๋ก ๋ํ๋ฌ์ต๋๋ค. ์ด๋ฅผ ์ํ ๋์์ผ๋ก ์ ๋ ฅ ์ํ์ค๊ฐ ๊ธธ์ด์ง๋ฉด ์ถ๋ ฅ ์ํ์ค์ ์ ํ๋๊ฐ ๋จ์ด์ง๋ ๊ฒ์ ๋ณด์ ํด์ฃผ๊ธฐ ์ํ ๋ฑ์ฅํ ๊ธฐ๋ฒ์ธ ์ดํ ์ (attention)์ ์๊ฐํฉ๋๋ค.
์ดํ ์ ์ ๊ธฐ๋ณธ ์์ด๋์ด๋ ๋์ฝ๋์์ ์ถ๋ ฅ ๋จ์ด๋ฅผ ์์ธกํ๋ ๋งค ์์ (time step)๋ง๋ค, ์ธ์ฝ๋์์์ ์ ์ฒด ์ ๋ ฅ ๋ฌธ์ฅ์ ๋ค์ ํ ๋ฒ ์ฐธ๊ณ ํ๋ค๋ ์ ์ ๋๋ค. ๋จ, ์ ์ฒด ์ ๋ ฅ ๋ฌธ์ฅ์ ์ ๋ถ ๋ค ๋์ผํ ๋น์จ๋ก ์ฐธ๊ณ ํ๋ ๊ฒ์ด ์๋๋ผ, ํด๋น ์์ ์์ ์์ธกํด์ผํ ๋จ์ด์ ์ฐ๊ด์ด ์๋ ์ ๋ ฅ ๋จ์ด ๋ถ๋ถ์ ์ข ๋ ์ง์ค(attention)ํด์ ๋ณด๊ฒ ๋ฉ๋๋ค.
'ArtificialIntelligence > PaperReading' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[Paper reading] Swin Transformer (0) | 2023.09.04 |
---|---|
[Paper reading] Transformers for image recognition, ViT (0) | 2023.08.28 |
[Paper reading] DenseNet (0) | 2023.08.22 |
[Paper reading] GoogleNet (0) | 2023.08.18 |
[Paper reading] ResNet (0) | 2023.08.16 |