[GoogleML] Recurrent Neural Networks

2023. 10. 17. 14:27ใ†ArtificialIntelligence/2023GoogleMLBootcamp

 

 

 

Why Sequence Models?

 

 

 

Notation

๋‹จ์–ด ๋‹จ์œ„๋กœ ๋Š์€ ๋’ค, ์ด๋ฆ„์— ๊ด€๋ จํ•œ ๋ถ€๋ถ„์„ ์‹๋ณ„ํ•œ๋‹ค๊ณ  ํ•˜์ž (์ด๋ฆ„ y -> 1, ์•„๋‹ˆ๋ฉด 0)

X<1> -> ์ฒซ๋ฒˆ์งธ ๋‹จ์–ด 

i ๋ฒˆ์งธ sample(๋ฌธ์žฅ)์— ๋Œ€ํ•ด t๋ฒˆ์งธ ๋‹จ์–ด, ์š”์†Œ -> X(i)<t> ๋ผ๊ณ  ํ‘œ๊ธฐํ•œ๋‹ค 

Tx(i) = 9 (๋‹จ์–ด ๊ฐœ์ˆ˜๊ฐ€ 9๊ฐœ๋ผ๋Š” ์˜๋ฏธ)

 

 

 

one-hot vector๋กœ ํ‘œํ˜„๋˜๋Š” ๊ฐ ๋‹จ์–ด๋“ค ex) X<t> 

์ฃผ์–ด์ง„ ์‚ฌ์ „, vocavulary์— ๋Œ€ํ•ด, mapping ๋˜๋Š” ๊ฐ’๋งŒ 1, ๋‚˜๋จธ์ง€๋Š” 0 -> one-hot

๋งŒ์•ฝ์— ์‚ฌ์ „์— ์—†๋Š” ๋‹จ์–ด๊ฐ€ ์ฃผ์–ด์ง„๋‹ค -> <UNK> unknown

์ด๋ฅผ ํ†ตํ•ด x -> y mapping 

 

 

 

Recurrent Neural Network Model

๊ธฐ์กด์˜ ๊ตฌ์กฐ๋กœ๋Š” ํ‘œํ˜„ํ•˜๊ธฐ ์–ด๋ น๋ˆ„ sequential data

 

 

 

๊ฐ€์ค‘์น˜๋“ค์„ ๊ณต์œ ํ•œ๋‹ค! 

 

 

 

y3์„ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด์„œ x1, x2, x3๋ฅผ ํ™œ์šฉํ•œ๋‹ค

์ด๋•Œ Recurrent Neural Net์˜ ํ•œ๊ณ„ 

๋‹ค์Œ์— ์˜ฌ (๋ฏธ๋ž˜์˜, ๋’ค์˜) ๋‹จ์–ด๋“ค์€ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•œ๋‹ค. 

์•ž๋งŒ ๋ณธ๋‹ค๋ฉด ๋ฃจ์ฆˆ๋ฒจํŠธ๋ผ๋Š” ์ •๋ณด ์—†์ด, ๋‘˜์„ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์–ด๋ ต๋‹ค

 

 

 

forward ๊ณผ์ •์„ ์˜๋ฏธํ•œ๋‹ค 

 

 

 

Waa์™€ Wax๋ฅผ ํ•œ๋ฒˆ์— ํ‘œํ˜„ํ•œ matrix Wa (์ฐจ์› ๊ณ„์‚ฐํ•˜๋Š” ๋ฒ•๋„ ๋‚˜์˜ด)

๋”์šฑ ๊ฐ„๋‹จํ•˜๊ฒŒ ์ˆ˜์‹์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

Backpropagation Through Time

cross entropy loss

๊ฐ step ๋ณ„ loss๋ฅผ ํ•ฉํ•˜์—ฌ ์ด loss๋ฅผ ํ‘œํ˜„ํ•˜๊ณ , updateํ•œ๋‹ค 

backpropagation through time

 

 

 

Different Types of RNNs

ex) ๋ฒˆ์—ญ ๊ณผ์ •๊ฐ™์€ ๊ฒฝ์šฐ, ๋‹จ์–ด์˜ ๊ธธ์ด๊ฐ€ ๋ฐ”๋€” ์ˆ˜๋„ ์žˆ๋‹ค. (์–ธ์–ด์— ๋”ฐ๋ผ ๋ฌธ์žฅ ๊ตฌ์กฐ ๋ณ€๊ฒฝ ๊ฐ€๋Šฅ)

 

 

 

many to many / many to one 

 

 

 

one to many - music generation

many to many - ๋ฒˆ์—ญ๊ธฐ

์ž…๋ ฅ ์–ธ์–ด๋ฅผ ์ฝ”๋“œ/๋ฒกํ„ฐํ™” - encoder 

์ถœ๋ ฅ ์–ธ์–ด๋ฅผ ์ฝ”๋“œ/๋ฒกํ„ฐํ™” - decoder

 

 

 

๋‹ค์–‘ํ•œ ๊ตฌ์กฐ๋“ค์ด ์žˆ๋‹ค (RNN)

 

 

 

Language Model and Sequence Generation

์–ธ์–ด๋ชจ๋ธ์ด ํ•˜๋Š” ๊ฒƒ์€ ๊ฐ output (์ž๋ฆฌ, ๋‹จ์–ด)์˜ P ํ™•๋ฅ ์„ ๋„์ถœํ•˜๋Š” ๊ฒƒ ! 

๊ฐ sequence์˜ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ 

 

 

 

tokenize - ํ† ํฐํ™”ํ•˜๋‹ค 

๊ฐ ๋‹จ์–ด๋ฅผ ์›ํ•ซ๋ฒกํ„ฐ๋กœ ๋งคํ•‘ํ•˜๋Š” ๊ณผ์ • 

๋์— EOS

์œ ๋‹ˆํฌํ•œ UNK

 

 

 

y1์€ ์ฒซ๋ฒˆ์งธ ์˜ฌ ๋‹จ์–ด๊ฐ€ ๋ฌด์—‡์ธ์ง€ ์ถ”์ธกํ•˜๋Š” ๊ฒƒ 

์—ฌ๋Ÿฌ ๋‹จ์–ด์— ๋Œ€ํ•œ ํ™•๋ฅ  P(word) ์ค‘ ๊ฐ€์žฅ ํฐ ๊ฒƒ์„ pred

 

 

 

๊ทธ ๋‹ค์Œ ์„ธ๋ฒˆ์งธ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธก ์‹œ ,

y1, y2 ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ๋กœ P(y3)๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค

์ฆ‰, P(y3 | y1, y2)

 

 

 

Sampling Novel Sequences

 

 

 

์บ๋ฆญํ„ฐ ๋ ˆ๋ฒจ ์–ธ์–ด ๋ชจ๋ธ

- UNK ์—†์ด ํ•˜๋‚˜ ํ•˜๋‚˜ ๋‹ค ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค

- ์บ๋ฆญํ„ฐ ํ•˜๋‚˜๋งˆ๋‹ค y<t>๊ฐ€ ๋งค์นญ๋˜๋Š” ๊ฒƒ 

- ํ•˜์ง€๋งŒ ๋‹จ์ , ๋” expensiveํ•˜๋‹ค (computational cost ์ธก๋ฉด์—์„œ)

 

 

 

Vanishing Gradients with RNNs

ํ›„๋ฐ˜๋ถ€ ๋ ˆ์ด์–ด์—์„œ ๋ฐœ์ƒํ•œ gradient๊ฐ€ ์•ž์ชฝ๊นŒ์ง€ ์ „ํŒŒ๋˜๊ธฐ ์–ด๋ ต๋‹ค

์•ž์ชฝ์˜ ๋‚ด์šฉ์„ ๋ชจ๋ธ์ด ํ›„๋ฐ˜๋ถ€๊นŒ์ง€ ๊ธฐ์–ตํ•˜๊ธฐ ์–ด๋ ต๋‹ค 

 

 

 

๊ทผ์ฒ˜์˜ input์—๋งŒ ์˜ํ–ฅ์„ ๋งŽ์ด ์ฃผ๊ณ 

๋งŽ์ด ๋–จ์–ด์ง„ ๋‹จ์–ด์—๋Š” ์˜ํ–ฅ๋ ฅ์„ ๋ฏธ์น˜๊ธฐ ์–ด๋ ต๋‹ค 

vanishing gradient๋Š” RNN์—์„œ ์ค‘์š”ํ•œ ๋ฌธ์ œ 

 

 

 

gradient exploding ์‹œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” Nan ๋ฌธ์ œ

-> gradient clipping์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค. 

ํ•˜์ง€๋งŒ ์‚ฌ๋ผ์ง€๋Š” ํ˜„์ƒ์€ ํ•ด๊ฒฐํ•˜๊ธฐ ๋” ์–ด๋ ต๋‹ค 

 

 

 

Gated Recurrent Unit (GRU)

 

 

 

๊ฐ๋งˆ, ๊ฒŒ์ดํŠธ 

 

 

 

GRU unit

๊ฐ๋งˆ๊ฐ€ ๋งค์šฐ ์ž‘์€ ๊ฐ’ -> vanishing gradient ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

๋‹ค๋ฅธ version -> LSTM

 

 

 

Long Short Term Memory (LSTM)

a<t> = tanh ๋น ์กŒ๋‹ค. 

๊ธฐ์กด์˜ GRU์—์„œ ๋‘๊ฐœ์˜ ๊ฒŒ์ดํŠธ๊ฐ€ LSTM์—์„œ๋Š” 3๊ฐœ๋กœ ์ฆ๊ฐ€

 

 

 

๊ต‰์žฅํžˆ ๋…ผ๋ฆฌํšŒ๋กœ ๊ฐ€์‚ฐ๊ธฐ / ๊ฐ์‚ฐ๊ธฐ ๋‹ฎ์•˜๋‹ค. 

ํŠนํžˆ ์บ๋ฆฌ๋ž‘ sum์œผ๋กœ ๋‘ ๊ฐœ ๋‚˜๋‰˜๋Š” ๊ฒƒ 

 

 

 

peephole connection 

 

 

 

Bidirectional RNN

๊ธฐ์กด ๋‹จ๋ฐฉํ–ฅ RNN

 

 

 

backward๊ฐ€ ์ถ”๊ฐ€๋œ BRNN

 

 

 

๊ณผ๊ฑฐ์™€ ๋ฏธ๋ž˜ ์ •๋ณด ๋ชจ๋‘ ํ•จ๊ป˜ ํ•ด๋‹น ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

Deep RNNs