Deep learning for sequence data

Ch 6 in Deep Learning with Python (DLP) deals with sequence data such as text (sequence of words) and time series.

Text Vectorization

We need to translate texts into numerical tensors for analysis (aka vectorize text). There are 3 ways to do it: one-hot encoding of tokens, one-hot hashing of tokens, and token embedding. Token refers to the unit one can break text into, such as word or character. 

One-hot encoding requires a dictionary of unique tokens, say, words here. Only record 1 where a particular word in the dictionary is used and 0 everywhere else. This is the technique we used in our Ch 3 IMDb example

One-hot hashing is used when the required dictionary is too big. We then hash words into vectors of fixed size, saving memory and allowing online encoding. However, we must keep our hashing space much bigger than the total number of unique tokens being hashed. Otherwise, hash collision can happen and two different words have the same hash. Then computer can’t tell the difference between the two words.

As one-hot encoding and hashing essentially hardcode texts, the resulting vectors are sparse and high-dimensional. On the other hand, word embedding is learnt from data and the resulting vectors are dense and low-dimensional. We can either learn the embedding using data of our own, or use an embedding pre-trained via a different dataset from our own. 

Our plots from the sentiment analysis of IMDb movie reviews indicate a better result when using a pre-trained embedding (see here for details ). However, this certainly doesn’t have to be the case for other projects. 

Recurrent Neural Network (RNN)

The networks we’ve seen up to this point do not keep any state between inputs. They treat each input independently and we call this type of networks feedforward network.

However, when dealing with sequence data, each input is more likely to be correlated with each other. For example, each movie review is a sequence of words. While reading the review, we keep a memory of what words have been used before. Another example here is air temperature every 10 minutes over 8 years in Jena, Germany. The temperature data is a time-series sequence. If the air temperature was high 10 minutes ago, the current temperature is likely to be high as well. 

Therefore, one popular network to analyze sequence data is RNN. Ch 6 introduces 4 types of RNN: Simple RNN, LSTM, GRU, and bidirectional RNN. If a Simple RNN has many layers, it can suffer vanishing gradient problem and become untrainable. LSTM and GRU are the solution to this problem. 

LSTM saves the information and carries it to the next input. GRU is similar and has a faster run time, albeit less powerful. Bidirectional RNN processes a sequence in both directions (e.g., chronologically and anti-chronologically). 

To regularize an RNN, we need to follow a specific dropout procedure. The same dropout must be used at every layer and a temporally constant dropout mask at the inner recurrent activations of the layer. We tried all 4 RNNs and the dropout method to forecast air temperature in Jena, Germany (see here for details).

1D Convnet

It’s clear that order matters to the Jena climate data. More recent data play a more important role in forecasting temperature. Thus, RNN should be used. However, for text data, a keyword found at the beginning of the text is equally important as the one found at the end of the text. 

In Ch 5, we used 2D convnets for computer vision. Here, we use 1D convnet on the movie reviews, the text data we have. Another strategy is to run data first through a 1D convnet and then through an RNN. This is particularly helpful if we have long sequences. Running through a 1D convent shortens the sequences and the shorter sequences makes using RNN feasible.