Zaiyan Xu"A fulfilled life is moving from open options to sweet compulsions." - David Brooks
https://www.zaiyanxu.com/
Fri, 29 Mar 2024 13:35:47 +0000Fri, 29 Mar 2024 13:35:47 +0000Jekyll v3.9.5Title of post 1<p>Write your post content here in normal <code class="language-plaintext highlighter-rouge">markdown</code>. An example post is shown below for reference.</p>
<h3 id="introduction">Introduction</h3>
<p>Recurrent Neural Networks and their variations are very likely to overfit the training data. This is due to the large network formed by unfolding each cell of the RNN, and <em>relatively</em> small number of parameters (since they are shared over each time step) and training data. Thus, the perplexities obtained on the test data are often quite larger than expected. Several attempts have been made to minimize this problem using varied <strong>regularization</strong> techniques. This paper tackles this issue by proposing a model that combines several of such existing methods.</p>
<p><em>Merity et al</em>’s model is a modification of the standard <strong>LSTM</strong> in which <em>DropConnect</em> is applied to the hidden weights in the <em>recurrent</em> connections of the LSTM for regularization. The dropout mask for each weight is preserved and the same mask is used across all time steps, thereby adding negligible computation overhead. Apart from this, several other techniques have been incorporated :</p>
<ul>
<li><strong>Variational dropout</strong> : The same dropout mask is used for a particular recurrent connection in both the forward and backward pass for all time steps. Each input of a mini-batch has a separate dropout mask, which ensures that the regularizing effect due to it isn’t identical across different inputs.</li>
<li><strong>Embedding dropout</strong> : Dropout with dropout probability \(p_e\) is applied to word embedding vectors, which results in new word vectors which are identically zero for the dropped words. The remaining word vectors are scaled by \(\frac{1}{1-p_e}\) as compensation.</li>
<li><strong>AR and TAR</strong> : AR (Activation Regularization) and TAR (Temporal Activation Regularization) are modifications of \(L_2\) regularization, wherein the standard technique is applied to dropped <em>output activations</em> and dropped <em>change in output activations</em> respectively. Mathematically, the additional terms in the cost function \(J\) are (here \(\alpha\) and \(\beta\) are scaling constants and \(\textbf{D}\) is the dropout mask) :</li>
</ul>
\[J_{AR}=\alpha L_2\left(\textbf{D}_l^t\odot h_l^t\right)\\
J_{TAR}=\beta L_2\left(\textbf{D}_l^t\odot\left(h_l^t - h_l^{t-1}\right)\right)\]
<ul>
<li><strong>Weight tying</strong> : In this method, the parameters for word embeddings and the final output layer are shared.</li>
<li><strong>Variable backpropagation steps</strong> : A random number of BPTT steps are taken instead of a fixed number, whose mean is very close to the original fixed value (\(s\)). The BPTT step-size (\(x\)) is drawn from the following distribution (here \(\mathcal{N}\) is the Gaussian distribution, \(p\) is a number close to 0.95 and \(\sigma^2\) is the desired variance) :</li>
</ul>
\[x \sim p\cdot \mathcal{N}\left(s,\sigma^2\right) + (1-p)\cdot \mathcal{N}\left(\frac{s}{2},\sigma^2\right)\]
<ul>
<li><strong>Independent sizes of word embeddings and hidden layer</strong> : The sizes of the hidden layer and word embeddings are kept independent of each other.</li>
</ul>
<p>The paper also introduces a new optimization algorithm, namely <strong>Non-monotonically Triggered Averaged Stochastic Gradient Descent</strong> or NT-ASGD, which can be programmatically described as follows :</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">NT_ASGD</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">t</span><span class="p">,</span> <span class="n">w0</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">L</span><span class="p">,</span> <span class="n">lr</span><span class="p">):</span>
<span class="s">"""
Input parameters :
f - objective function
t - stopping criterion
w0 - initial parameters
n - non-monotonicity interval
L - number of epochs after which finetuning is done
lr - learning rate
Returns :
parameter(s) that minimize `f`
"""</span>
<span class="n">k</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">T</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">t</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">params</span> <span class="o">=</span> <span class="p">[];</span> <span class="n">logs</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">w</span> <span class="o">=</span> <span class="n">w0</span>
<span class="k">while</span> <span class="n">t</span><span class="p">(</span><span class="n">w</span><span class="p">):</span>
<span class="c1"># `func_grad` computes gradient of `f` at `w`
</span> <span class="n">w</span> <span class="o">=</span> <span class="n">w</span> <span class="o">-</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">func_grad</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">w</span><span class="p">)</span>
<span class="n">params</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">w</span><span class="p">)</span>
<span class="n">k</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">k</span><span class="o">%</span><span class="n">L</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="c1"># Compute model's perplexity for current parameters
</span> <span class="n">v</span> <span class="o">=</span> <span class="n">perplexity</span><span class="p">(</span><span class="n">w</span><span class="p">)</span>
<span class="k">if</span> <span class="n">t</span> <span class="o">></span> <span class="n">n</span> <span class="ow">and</span> <span class="n">v</span> <span class="o">></span> <span class="nb">min</span><span class="p">(</span><span class="n">logs</span><span class="p">[</span><span class="n">t</span><span class="o">-</span><span class="n">n</span><span class="p">:</span><span class="n">t</span><span class="o">+</span><span class="mi">1</span><span class="p">]):</span>
<span class="n">T</span> <span class="o">=</span> <span class="n">k</span>
<span class="n">logs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">v</span><span class="p">)</span>
<span class="n">t</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="c1"># Return the average of best `k-T+1` parameters
</span> <span class="k">return</span> <span class="nb">sum</span><span class="p">(</span><span class="n">params</span><span class="p">[</span><span class="o">-</span><span class="p">(</span><span class="n">k</span><span class="o">-</span><span class="n">T</span><span class="o">+</span><span class="mi">1</span><span class="p">):])</span><span class="o">/</span><span class="p">(</span><span class="n">k</span><span class="o">-</span><span class="n">T</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> </code></pre></figure>
<p>They also combined their <strong>AWD-LSTM</strong> (ASGD Weight Dropped LSTM) with a neural cache model to obtain further reduction in perplexities. A <em>neural cache model</em> stores previous states in memory, and predicts the output obtained by a <em>convex combination</em> of the output using stored states and the AWD-LSTM.</p>
<h3 id="network-description">Network description</h3>
<p><em>Merity et al</em>’s model used a 3-layer weight dropped LSTM with dropout probability <code class="language-plaintext highlighter-rouge">0.5</code> for <strong>PTB corpus</strong> and <code class="language-plaintext highlighter-rouge">0.65</code> for <strong>WikiText-2</strong>, combined with several of the above regularization techniques. The different hyperparameters (as referred to in the discussion above) are as follows : hidden layer size (\(H\)) = <code class="language-plaintext highlighter-rouge">1150</code>, embedding size (\(D\)) = <code class="language-plaintext highlighter-rouge">400</code>, number of epochs = <code class="language-plaintext highlighter-rouge">750</code>, \(L\) = <code class="language-plaintext highlighter-rouge">1</code>, \(n\) = <code class="language-plaintext highlighter-rouge">5</code>, learning rate = <code class="language-plaintext highlighter-rouge">30</code>, Gradients clipped at <code class="language-plaintext highlighter-rouge">0.25</code>, \(p\) = <code class="language-plaintext highlighter-rouge">0.95</code>, \(s\) = <code class="language-plaintext highlighter-rouge">70</code>, \(\sigma^2\) = <code class="language-plaintext highlighter-rouge">5</code>, \(\alpha\) = <code class="language-plaintext highlighter-rouge">2</code>, \(\beta\) = <code class="language-plaintext highlighter-rouge">1</code>, dropout probabilities for input, hidden outputs, final output and embeddings as <code class="language-plaintext highlighter-rouge">0.4</code>, <code class="language-plaintext highlighter-rouge">0.3</code>, <code class="language-plaintext highlighter-rouge">0.4</code> and <code class="language-plaintext highlighter-rouge">0.1</code> respectively.</p>
<p>Word embedding weights were initialized from \(\mathcal{U}\left[-0.1,0.1\right]\) and all other hidden weights from \(\mathcal{U}\left[-\frac{1}{\sqrt{1150}},\frac{1}{\sqrt{1150}}\right]\). Mini-batch size of <code class="language-plaintext highlighter-rouge">40</code> was used for PTB and <code class="language-plaintext highlighter-rouge">80</code> for WT-2.</p>
<h3 id="result-highlights">Result highlights</h3>
<ul>
<li>3-layer AWD-LSTM with weight tying attained 57.3 PPL on PTB</li>
<li>3-layer AWD-LSTM with weight tying and a continuous cache pointer attained 52.8 PPL on PTB</li>
</ul>
Wed, 10 Jan 2018 15:10:00 +0000
https://www.zaiyanxu.com/blog/2018/01/post-1
https://www.zaiyanxu.com/blog/2018/01/post-1rnndiscussionTitle of post 2<p>Write your post content here in normal <code class="language-plaintext highlighter-rouge">markdown</code>. An example post is shown below for reference.</p>
<h3 id="introduction">Introduction</h3>
<p>Regularization is an important step during the training of neural networks. It helps to generalize the model by reducing the possibility of overfitting the training data. There are several types of regularization techniques, with L2 , L1, elastic-net (linear combination of L2 and L1 regularization), dropout and drop-connect being the major ones. While L2, L1 and elastic-net regularization techniques work by constraining the trainable parameters (or <em>weights</em>) from attaining large values (so that no drastic changes in output are observed for slight changes in the input; or in other words, they prefer <em>diffused</em> weights rather than <em>peaked</em> ones), <strong>dropout</strong> and drop-connect work by averaging the task over a dynamically and randomly generated large <em>ensemble</em> of networks. These networks are obtained by randomly disconnecting neurons (<em>dropout</em>) or weights (<em>drop-connect</em>) from the original network so as to obtain a subnetwork on which the training process is carried out (although for \(\approx1\) epoch for each, since the chance of the same subnetwork being generated again is very rare).</p>
<p>Application of dropout to feedforward neural networks gives promising results. RNNs are thought of as individual <em>cells</em> that are <em>unfolded</em> over several time-steps, with the input at each time-step being a token of the sequence. When dropout is used for regularizing such a network, the ‘disturbance’ it generates at each time step propagates over a long interval, thereby decreasing the network’s ability to represent long range dependencies. Thus, applying dropout in the standard manner to RNNs fails to give any improvement. It is here where <em>Zaremba et al</em>’s research comes into the picture.</p>
<h3 id="network-description">Network description</h3>
<p>The network architecture that <em>Zaremba et al</em> proposed is quite simple and intuitive. In case of deep RNNs (i.e. RNNs spanning over several layers (\(h\)) where output of \(h_{l-1}^{t}\) is used as the input for \(h_l^t\)), all connections between the cells in unfolded state can be broadly classified into two categories - <em>recurrent</em> and <em>non-recurrent</em>. The connections between cells in the same layer i.e. \(h_l^t ~-~ h_l^{t+1}~~\forall t\) are <em>recurrent</em> connections, and those between cells in adjacent layers i.e. \(h_l^t ~-~ h_{l+1}^t~~\forall l\) are <em>non-recurrent</em> connections. <em>Zaremba et al</em> suggested that <strong>dropout</strong> should be applied only to non-recurrent connections - thereby preventing problems which arose earlier.</p>
<p>The modified network for LSTM units can be mathematically represented as follows:</p>
<p>Denoting \(T_{m,n}\) as an affine transformation from \(\mathbb{R}^m\rightarrow\mathbb{R}^n\) ( i.e. \(T_{m,n}(x)=Wx+b\) where \(x\in\mathbb{R}^{m\times1}\), \(W\in\mathbb{R}^{n\times m}\) and \(b\in\mathbb{R}^{n\times1}\) and similarly for multiple inputs) and \(\otimes\) as elementwise multiplication, we have :</p>
\[f_l^t=\text{sigmoid}\left(T_{N,D}^1\left(\textbf{D}\left(h_{l-1}^t\right) ; h_l^{t-1}\right)\right) \\
i_l^t=\text{sigmoid}\left(T_{N,D}^2\left(\textbf{D}\left(h_{l-1}^t\right) ; h_l^{t-1}\right)\right) \\
o_l^t=\text{sigmoid}\left(T_{N,D}^3\left(\textbf{D}\left(h_{l-1}^t\right) ; h_l^{t-1}\right)\right) \\
u_l^t=\text{tanh}\left(T_{N,D}^4\left(\textbf{D}\left(h_{l-1}^t\right) ; h_l^{t-1}\right)\right) \\
c_l^t=c_l^{t-1}\otimes f_l^t + u_l^t\otimes i_l^t \\
h_l^t = \text{tanh}\left(c_l^t\right)\otimes o_l^t\]
<p>Here, \(\textbf{D}\) is the dropout <em>layer</em> or operator which sets a subset of its input randomly to zero with dropout probability <code class="language-plaintext highlighter-rouge">p</code>. This modification can be adopted for any other RNN architecture.</p>
<p><em>Zaremba et al</em> used these architectures for their experiments in which each cell was unrolled for 35 steps. Mini-batch size was 20 for both :</p>
<p><strong>Medium LSTM</strong> :
Hidden-layer dimension = <code class="language-plaintext highlighter-rouge">650</code>,
Weights initialized uniformly in <code class="language-plaintext highlighter-rouge">[-0.05,0.05]</code>,
Dropout probability = <code class="language-plaintext highlighter-rouge">0.5</code>,
Number of epochs = <code class="language-plaintext highlighter-rouge">39</code>,
Learning rate = <code class="language-plaintext highlighter-rouge">1</code> which decays by a factor of <code class="language-plaintext highlighter-rouge">1.2</code> after 6 epochs,
Gradients clipped at <code class="language-plaintext highlighter-rouge">5</code>.</p>
<p><strong>Large LSTM</strong> :
Hidden-layer dimension = <code class="language-plaintext highlighter-rouge">1500</code>,
Weights initialized uniformly in <code class="language-plaintext highlighter-rouge">[-0.04,0.04]</code>,
Dropout probability = <code class="language-plaintext highlighter-rouge">0.65</code>,
Number of epochs = <code class="language-plaintext highlighter-rouge">55</code>,
Learning rate = <code class="language-plaintext highlighter-rouge">1</code> which decays by a factor of <code class="language-plaintext highlighter-rouge">1.15</code> after 14 epochs,
Gradients clipped at <code class="language-plaintext highlighter-rouge">10</code>.</p>
<h3 id="result-highlights">Result highlights</h3>
<ul>
<li>78.4 PPL on Penn TreeBank dataset using a single <strong>Large LSTM</strong></li>
<li>68.7 PPL on Penn TreeBank dataset using an ensemble of 38 <strong>Large LSTM</strong>s</li>
</ul>
Wed, 10 Jan 2018 15:01:00 +0000
https://www.zaiyanxu.com/blog/2018/01/post-2
https://www.zaiyanxu.com/blog/2018/01/post-2rnnregularization