Deep Learning Introduction Recurrent Neural Network - Text Preprocessing


The core content comes from blog link 1Blog Connection 2 I hope you will support the author a lot
This article is for recording, to prevent forgetting

Recurrent Neural Networks - Text Preprocessing


For the sequence data processing problem, we evaluated the required statistical tools and challenges in forecasting in the previous section. Such data exists in many forms, text being one of the most common examples. For example, an article can be simply viewed as a sequence of words, or even a sequence of characters. In this section, we will go through common preprocessing steps for parsing text. These steps usually include:
1. Load the text into memory as a string.
2. Split the string into tokens (such as words and characters).
3. Build a vocabulary to map the split tokens to numerical indexes.
4. Convert text to digital index sequence for easy model operation.

import collections
import re
from d2l import torch as d2l

1 read dataset

First, we load the text from H.G.Well's time machine. This is a fairly small corpus, just over 30,000 words, but enough for us to try our hand at it, while real-world document collections can contain billions of words. The following function reads a dataset into a list of lines of text, where each line of text is a string. For simplicity, we ignore punctuation and capitalization here.

d2l.DATA_HUB['time_machine'] = (d2l.DATA_URL + 'timemachine.txt',

def read_time_machine():
    """Load the time machine dataset into a list of lines of text"""
    with open('time_machine'), 'r') as f:
        lines = f.readlines()
    return [re.sub('[^A-Za-z]+', ' ', line).strip().lower() for line in lines]

lines = read_time_machine()
print(f'# Total lines of text: {len(lines)}')


# Total lines of text: 3221
the time machine by h g wells
twinkled and his usually pale face was flushed and animated the

2 Lemmatization

The tokenize function below takes a list of text lines (lines) as input, where each element in the list is a sequence of text (such as a line of text). Each text sequence is split into a list of tokens, and a token is the basic unit of text. Finally, return a list of lists of tokens, each of which is a string.

def tokenize(lines, token='word'):  #@save
    """Split a line of text into words or character tokens"""
    if token == 'word':
        return [line.split() for line in lines]
    elif token == 'char':
        return [list(line) for line in lines]
        print('Error: unknown token type:' + token)

tokens = tokenize(lines)
for i in range(11):


['the', 'time', 'machine', 'by', 'h', 'g', 'wells']
['the', 'time', 'traveller', 'for', 'so', 'it', 'will', 'be', 'convenient', 'to', 'speak', 'of', 'him']
['was', 'expounding', 'a', 'recondite', 'matter', 'to', 'us', 'his', 'grey', 'eyes', 'shone', 'and']
['twinkled', 'and', 'his', 'usually', 'pale', 'face', 'was', 'flushed', 'and', 'animated', 'the']

3 vocabulary

The type of token is a string, and the input required by the model is a number, so this type is not convenient for the model to use. Now, let's build a dictionary, often called a vocabulary, that maps tokens of type string to numeric indices starting at the beginning. We first merge all the documents in the training set together, and count their unique lexical units, and the statistical results obtained are called corpus. Each unique token is then assigned a numerical index based on its frequency of occurrence. Rarely occurring tokens are usually removed, which reduces complexity. Additionally, any token that does not exist or has been removed from the corpus will be mapped to a specific unknown token "". We can choose to add a list to save those reserved tokens, for example: filler token (""); sequence start token (""); sequence end token ("").

class Vocab:  #@save
    """text vocabulary"""
    def __init__(self, tokens=None, min_freq=0, reserved_tokens=None):
        if tokens is None:
            tokens = []
        if reserved_tokens is None:
            reserved_tokens = []
        # Sort by frequency
        counter = count_corpus(tokens)
        self._token_freqs = sorted(counter.items(), key=lambda x: x[1],
        # Unknown tokens have an index of 0
        self.idx_to_token = ['<unk>'] + reserved_tokens
        self.token_to_idx = {token: idx
                             for idx, token in enumerate(self.idx_to_token)}
        for token, freq in self._token_freqs:
            if freq < min_freq:
            if token not in self.token_to_idx:
                self.token_to_idx[token] = len(self.idx_to_token) - 1

    def __len__(self):
        return len(self.idx_to_token)

    def __getitem__(self, tokens):
        if not isinstance(tokens, (list, tuple)):
            return self.token_to_idx.get(tokens, self.unk)
        return [self.__getitem__(token) for token in tokens]

    def to_tokens(self, indices):
        if not isinstance(indices, (list, tuple)):
            return self.idx_to_token[indices]
        return [self.idx_to_token[index] for index in indices]

    def unk(self):  # Unknown tokens have an index of 0
        return 0

    def token_freqs(self):
        return self._token_freqs

def count_corpus(tokens):  #@save
    """Count the frequency of words"""
    # The tokens here are 1D list or 2D list
    if len(tokens) == 0 or isinstance(tokens[0], list):
        # Flattens a list of tokens into a list
        tokens = [token for line in tokens for token in line]
    return collections.Counter(tokens)

We first use the Time Machine dataset as a corpus to build a vocabulary, and then print the top few high-frequency tokens and their indices.

vocab = Vocab(tokens)


[('<unk>', 0), ('the', 1), ('i', 2), ('and', 3), ('of', 4), ('a', 5), ('to', 6), ('was', 7), ('in', 8), ('that', 9)]

Now, we can convert each text line into a numerically indexed list.

for i in [0, 10]:
    print('text:', tokens[i])
    print('index:', vocab[tokens[i]])


text: ['the', 'time', 'machine', 'by', 'h', 'g', 'wells']
index: [1, 19, 50, 40, 2183, 2184, 400]
text: ['twinkled', 'and', 'his', 'usually', 'pale', 'face', 'was', 'flushed', 'and', 'animated', 'the']
index: [2186, 3, 25, 1044, 362, 113, 7, 1421, 3, 1045, 1]

4 Integrate all functions

When using the above functions, we pack all the functionality into the load_corpus_time_machine function, which returns corpus (list of token indices) and vocab (vocabulary for the Time Machine corpus). The changes we've made here are:
1. In order to simplify the training in the following chapters, we use characters (instead of words) to realize text lemmatization;
2. Each text line in the time machine data set is not necessarily a sentence or a paragraph, but may also be a word, so the returned corpus is only processed as a single list, rather than a list composed of multi-word lists.

def load_corpus_time_machine(max_tokens=-1):  #@save
    """Returns the lexical index list and vocabulary list of the Time Machine dataset"""
    lines = read_time_machine()
    tokens = tokenize(lines, 'char')
    vocab = Vocab(tokens)
    # Because each line of text in the Time Machine dataset is not necessarily a sentence or a paragraph,
    # So to flatten all lines of text into a list
    corpus = [vocab[token] for line in tokens for token in line]
    if max_tokens > 0:
        corpus = corpus[:max_tokens]
    return corpus, vocab

corpus, vocab = load_corpus_time_machine()
len(corpus), len(vocab)


(170580, 28)

5 Summary

  • Text is one of the most common forms of sequence data.

  • To preprocess text, we usually split the text into tokens, build a vocabulary to map token strings to numeric indices, and convert text data into token indices for model manipulation.

Tags: Python Deep Learning rnn

Posted by shauny17 on Mon, 28 Nov 2022 23:57:48 +0530