Door control circulation unit (GRU)
Quotation and Translation: hands on learning and deep learning
When calculating the gradient in a recurrent neural network, the long product of the matrix will cause the gradient to disappear or diverge.
You may encounter situations where early observations are important for predicting all future observations. Consider a somewhat artificial situation where the first observation contains a checksum and the goal is to identify whether the checksum is correct at the end of the sequence.
In this case, the influence of the first symbol is crucial. We hope to have some mechanisms to store important early information in memory units. Without such a mechanism, we would have to assign a very large gradient to this observation because it affects all subsequent observations.
We may encounter a situation where some symbols are uncorrelated observations. For example, when parsing a web page, there may be some auxiliary HTML code, which is irrelevant for evaluating the emotion conveyed on the web page. We want some mechanism to skip these symbols in the potential state representation.
We may encounter the situation that there are logical interrupts between various parts of a sequence. For example, there may be a transition between chapters of a book, a transition between a bear market and a bull market in the stock market, and so on; In this case, it is best to have a way to reset our internal state representation.
Some methods have been proposed to solve this problem. One of the earliest methods is the long and short term memory (LSTM) of Hochreiter and Schmidhuber. 1997175, the gated recursive unit (GRU) of Cho et al. 2014176 is a slightly simplified variant, usually with comparable performance and significantly faster computing speed. See Chung et al., 2014177 for more details.
1, Door control in hidden state
The key difference between normal RNN and GRU is that the latter supports the gating of hidden states. This means that we have a special mechanism to control when the hidden state is updated and when it is reset.
These mechanisms are learned and solve the problems listed above. For example, if the first symbol is very important, we will learn not to update the hidden state after the first observation. Similarly, we will learn to skip irrelevant temporary observations.
Finally, we will learn to reset the recessive state when necessary. We will discuss this issue in detail below.
2, Reset and update gates
The first thing we need to introduce is to reset the door and update the door. We design them as vectors with entries of (0, 1), so that we can make convex combinations, for example, a combination of a hidden state and an alternative state. For example, a reset variable would allow us to control how much of the previous state we might want to remember. Similarly, an update variable will allow us to control how many of the new states are just copies of the old state.
We first generate these variables through the engineering gate. The following figure illustrates the input of reset and update gate in GRU. What is given is the input 𝑋 of the current time step and the hidden state 𝐻𝑡 -1 of the previous time step. The output is given by a full connection layer whose activation function is sigmoid.
Here, we assume that there are h hidden cells. For a given time step, the small batch input is_0 < 0_0_ × (number of examples: n, number of inputs: d), the hidden status of the previous time step is -1 < × . Then, reset the door × Add and update doors × calculate as follows.
R
t
=
σ
(
X
t
W
x
r
+
H
t
−
1
W
h
r
+
b
r
)
R_t = σ(X_t W_{xr} + H_{t−1} W_{hr} + b_r )
Rt=σ(XtWxr+Ht−1Whr+br)
Z
t
=
σ
(
X
t
W
x
z
+
H
t
−
1
W
h
z
+
b
z
)
Z_t = σ(X_t W_{xz} + H_{t−1} W_{hz} + b_z )
Zt=σ(XtWxz+Ht−1Whz+bz)
from IPython.display import SVG SVG(filename= '../img/gru_1.svg')
Figure: reset and update doors in GRU.
here, W x r W_{xr} Wxr , W x z W_{xz} Wxz ∈ R d × h R^{d×h} Rd × h and W h r W_{hr} Whr , W h z W_{hz} Whz ∈ R h × h R^{h×h} Rh × h is the weight parameter, b r b_r br , b z b_z bz ∈ R 1 × h R^{1×h} R1 × h is the deviation. We use a sigmoid function to convert values to intervals (0, 1).
3, Function of reset door
We first combine the reset gate with the conventional latency update mechanism. In a traditional deep RNN, we will have an updated form
H t = t a n h ( X t W x h + H t − 1 W h h + b h ) . H_t = tanh(X_t W_{xh} + H_{t−1} W_{hh} + b_h ). Ht=tanh(XtWxh+Ht−1Whh+bh).
This is basically the same as the discussion in the previous section, except that the tanh form of nonlinearity is used to ensure that the value of the hidden state remains within the interval (-1, 1). If we want to reduce the influence of previous states, we can multiply 𝐻𝑡 -1 by the 𝑅𝑡 element.
As long as the entries in the table are close to 1, we can restore a traditional depth RNN.
For all entries close to 0 in the 𝑅, the hidden status is the result of MLP with 𝑁 as the input.
Therefore, any pre-existing hidden state is "reset" to the default state. This leads to the following candidate value of the new hidden state (it is a candidate value because we still need to include the action of updating the door).
H precedes the equal sign here ̃_ t. Pay attention to the symbols above
H ~ t = t a n h ( X t W x h + ( R t ⊙ H t − 1 ) W h h + b h ) H̃_t = tanh(X_t W_{xh} + (R_t ⊙ H_{t−1} ) W_{hh} + b_h ) H~t=tanh(XtWxh+(Rt⊙Ht−1)Whh+bh)
The following figure illustrates the calculation process after the reset door is applied. The symbol ⊙ indicates dot multiplication between tensors.
4, Update door in action
Next, we need to include the role of the update door. This determines to what extent the new state 𝐻𝑡 is only the old state 𝐻𝑡 -1 and the new candidate state H ~ t H̃_t H~t(H ̃_ t) The extent to which it is used. The gated variable 𝑍 𝑡 can be used for this purpose, and only the convex combination of elements between the two candidate states is required. This leads to the final updated equation of GRU.
H t = Z t ⊙ H t − 1 + ( 1 − Z t ) ⊙ H ~ t . H_t = Z_t ⊙ H_{t−1} + (1 − Z_t ) ⊙ H̃_t . Ht=Zt⊙Ht−1+(1−Zt)⊙H~t.
SVG(filename= '../img/gru_2.svg')
Fig. 10.8.2: calculation of candidate hidden states in Gru. Multiplication is done as elements.
Whenever the update gate approaches 1, we simply keep the old state. In this case, the information from m__
When it approaches 0, the new latency state 𝐻 approaches the candidate latency state H ̃_ t. These designs can help to deal with the gradient vanishing problem in RNNs and better capture the dependence of time series with large time step distance. In conclusion, GRU has the following two outstanding features.
-
Reset gates help capture short-term dependencies in time series.
-
Update gates help capture long-term dependencies in time series.
5, Implement from scratch
To better understand this model, let's implement a GRU from scratch.
1. Read dataset
We first read The Time Machine corpus we used in section 10.5. The code to read the data set is given below.
import sys sys.path.insert(0, '..') import d2l import torch import torch.nn as nn from d2l import RNNModel from d2l import load_data_time_machine from d2l import train_and_predict_rnn from d2l import train_and_predict_rnn_nn torch.set_default_tensor_type('torch.cuda.FloatTensor') corpus_indices, vocab = load_data_time_machine()
SVG(filename= '../img/gru_3.svg')
Fig. 10.8.3: figure 10.8.3: Calculation of hidden state in GRU. As before, multiplication is done by element.
2. Initialize model parameters
The next step is to initialize the model parameters. We extract weights from Gauss with a variance of 0.01 and set the offset to 0. We instantiate all the terms related to updating and resetting doors and candidate hidden states themselves. Then we attach the gradient to all the parameters.
num_inputs, num_hiddens, num_outputs = len(vocab), 256, len(vocab) device = d2l.try_gpu() print('Using', device)
Using cpu
def get_params(): def _one(shape): return torch.randn(shape, device=device).normal_(std=0.01) def _three(): return (_one((num_inputs, num_hiddens)), _one((num_hiddens, num_hiddens)), torch.zeros(num_hiddens, device=device)) W_xz, W_hz, b_z = _three() # Update gate parameter W_xr, W_hr, b_r = _three() # Reset gate parameter W_xh, W_hh, b_h = _three() # Candidate hidden state parameter # Output layer parameters W_hq = _one((num_hiddens, num_outputs)) b_q = torch.zeros(num_outputs, device=device) # Create gradient params = [W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q] for param in params: param.requires_grad_(True) return params
3. Define model
Now we will define the hidden state initialization function init_gru_state. Just like init defined in section 10.5_ rnn_ Like the state function, this function returns a tuple composed of tesor s with shapes (batch size, number of hidden cells), and all values are set to 0.
def init_gru_state(batch_size, num_hiddens, device): return (torch.zeros(size=(batch_size, num_hiddens), device=device), )
Now we are ready to define the actual model. Its structure is the same as the basic RNN unit, but the updating equation is more complex.
In fact, it lists the previously defined formulas in code
def gru(inputs, state, params): W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q = params H, = state outputs = [] for X in inputs: m = nn.Sigmoid() Z = m(torch.matmul(X.float(), W_xz) + torch.matmul(H.float(), W_hz) + b_z) R = m(torch.matmul(X.float(), W_xr) + torch.matmul(H.float(), W_hr) + b_r) h = nn.Tanh() H_tilda = h(torch.matmul(X.float(), W_xh) + torch.matmul(R * H.float(), W_hh) + b_h) H = Z * H.float() + (1 - Z) * H_tilda Y = torch.matmul(H.float(), W_hq) + b_q outputs.append(Y) return outputs, (H,)
4. Training and forecasting
Training and forecasting work exactly the same as before. In other words, we need to define a duration, the number of truncated steps, the minimum batch size, the learning rate, and how we should actively cut the gradient. Finally, we create a 50 character string based on the prefix of traveler and time traveler.
num_epochs, num_steps, batch_size, lr, clipping_theta = 100, 35, 32, 1, 1 prefixes = ['traveller', 'time traveller'] train_and_predict_rnn(gru, get_params, init_gru_state, num_hiddens, corpus_indices, vocab, device, False, num_epochs, num_steps, lr, clipping_theta, batch_size, prefixes)
epoch 50, perplexity 11.929022, time 449.93 sec epoch 100, perplexity 9.153454, time 436.90 sec - travellere the the the the the the the the the the the the - time travellere the the the the the the the the the the the the
6, Simple implementation
In the nn module, we can directly call the "n" class in the "n" module. This encapsulates all the configuration details explicitly presented above. The code is significantly faster because it uses compiled operators rather than many of the details of Python detailed earlier.
gru_layer = nn.GRU(input_size=num_inputs, hidden_size=num_hiddens) model = RNNModel(gru_layer, num_hiddens, len(vocab)) model.to(device) train_and_predict_rnn_nn(model, num_hiddens, init_gru_state, corpus_indices, vocab, device, num_epochs*5, num_steps, lr, clipping_theta, batch_size, prefixes)
7, Summary
1. For the time series with time steps, the gated recurrent neural network can better capture the dependence.
2. Reset gates help capture short-term dependencies in time series.
3. Update gates help capture long-term dependencies in time series.
4. As long as the reset door is open, the GRU contains the basic RNNs as its extreme case. They can ignore sequences as needed.
8, Practice
1. Compare rnn RNN and rnn Run time, confusion and extracted string of Gru implementation.
2. Suppose we only want to use the input of time step t 'to predict the output of time step t>t'. What is the best value for each time step to reset and update the gate?
3. Adjust the super parameters, observe and analyze the impact on the running time, confusion and lyrics.
4. What happens if you implement only part of GRU? That is, a recursive unit with only one reset gate is implemented. Similarly, a cycle unit with only update gates is implemented.