Python Wu Enda Deep Learning Job 24--Speech Recognition Keyword

Keyword Speech Recognition

In this week's video, you learned how to apply in-depth learning to speech recognition. In this job, you will build a speech dataset and implement an algorithm for keyword detection, sometimes referred to as wake-up or trigger word detection. Keyword recognition is a technology that allows devices such as Amazon Alexa, Google Home, Apple Siri, and Baidu DuerOS to respond when they hear a specific word.

For this exercise, our trigger word will be "Activate." Every time you hear you say "Activate", it beeps. When the job is finished, you will be able to record your own speech and trigger a reminder tone when the algorithm detects you say "Activate".

After completing this task, perhaps you can also extend it to run on your laptop so that every time you say "Activate", it will start your favorite application, turn on the network connection lights in your house, or trigger other events?

In this assignment, you will learn:

  • Building language recognition projects
  • Composite and process audio records to create training/development datasets
  • Training keyword detection models and making predictions
import numpy as np
from pydub import AudioSegment
import random
import sys
import io
import os
import glob
import IPython
from td_utils import *
%matplotlib inline
d:\vr\virtual_environment\lib\site-packages\pydub\ RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
  warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)

1 Composite data: Create voice datasets

Let's start by building a dataset for the trigger word detection algorithm. Ideally, a voice dataset should be as close as possible to the application on which it will run. In this case, you want to detect the word "activate" in your work environment (library, home, office, open space...). Therefore, you need to mix positive words ("activate") and negative words (random words other than activate) on different background sounds to create recordings. Let's see how to create such a dataset.

1.1 Listening Data

A friend of yours is helping you with this project by visiting libraries, cafes, restaurants, homes and offices throughout the region to record background noise and audio clips of people saying positive/negative words. The dataset includes people who speak with a variety of accents.

In raw_ In the data directory, you can find a subset of the original audio files, including positive words, negative words, and background noise. You will use these audio files to synthesize the dataset to train the model. The "activate" directory contains positive examples of what people call "activate". The "negatives" catalog contains negative examples of random words other than "activate". Each audio record has only one word. The "backgrounds" directory contains 10 seconds of background noise in different environments.

Run the cells below to try out some examples.

CSDN Playing audio is not supported
CSDN Playing audio is not supported
CSDN Playing audio is not supported

You will use these three types of records (positives/negatives/backgrounds) to create tagged datasets.

1.2 From Recording to Spectrogram

What is recording exactly? The microphone records small changes in air pressure over time, which can also make your ears feel sound. You can think of the recording as a long series of numbers used to measure small changes in the air pressure detected by the microphone. We will use audio sampled at 44100Hz (or 44100Hz). This means that the microphone can give us 44100 numbers per second. Therefore, a 10-second audio clip is represented by 441000 numbers (= 10 × 44100 10 \times 44100 10×44100).

It is difficult to tell whether the word "activate" is said from this "original" representation of audio. To help your sequence model learn to detect trigger words more easily, we will calculate the spectrogram of the audio. Spectrograms tell us how many different frequencies an audio segment has at a given time.

(If you've taken advanced courses in signal processing or Fourier Transform, you can calculate the spectrum by sliding a window over the original audio signal and using Fourier Transform to calculate the most active frequency in each window. Don't worry if you don't understand the previous sentence.)

Let's take an example.

CSDN Playing audio is not supported
x = graph_spectrogram("audio_examples/example_train.wav")

The graph above shows the activity of each frequency (y-axis) over multiple time steps (x-axis).

Figure 1: The recorded frequency spectrum, in which the colors indicate the degree to which different frequencies (loudness) occur in the audio at different points in time. The green square indicates that a frequency is more active or active in an audio clip (speaker). Blue squares indicate less active frequencies.

The size of the output spectrum depends on the superparameters of the spectrum software and the length of the input. In this notebook, we will use a 10-second audio clip as the "standard length" for the training example. The time steps of the spectrum are 5511. Later you will see that the spectrum will be the input to the network x x x, so T x = 5511 T_x=5511 Tx​=5511.

_, data ="audio_examples/example_train.wav")
print("Time steps in audio recording before spectrogram", data[:,0].shape)
print("Time steps in input after spectrogram", x.shape)
Time steps in audio recording before spectrogram (441000,)
Time steps in input after spectrogram (101, 5511)

Now you can define:

Tx = 5511 # Time steps from spectral input to model
n_freq = 101 # Number of frequencies of input models at each time step of the spectrum

Note that even if 10 seconds is our default training sample length, 10 seconds can be discretized into a different number of values. You have seen 441000 (original audio) and 5511 (spectrum). In the first case, each step represents 10 / 441000 ≈ 0.000023 10/441000 \approx 0.000023 10/441000_0.000023 seconds. In the second case, each step represents 10 / 5511 ≈ 0.0018 10/5511 \approx 0.0018 10/5511_0.0018 seconds.

For 10 seconds of audio, the key values you will see in this job are:

  • 441000 441000 441000 (original audio)
  • 5511 = T x 5511 = T_x 5511=Tx (spectral output, and input dimensions of the neural network).
  • 10000 10000 10000 (synthesize audio with pydub module)
  • 1375 = T y 1375=T_y 1375=Ty (number of steps in GRU output to build).

Note that each of these representations corresponds to exactly 10 seconds of time. They just discretized them in varying degrees. All these are superparameters and can be changed (except 441000, which is a microphone function). The value we select is within the standard range used by the speech system.

Above T y = 1375 T_y=1375 The number Ty =1375 means that for the output of the model, we discretize 10s into 1375 time intervals (the length of each time interval is 10 / 1375 ≈ 0.0072 10/1375 \approx 0.0072 10/1375_0.0072 seconds) and try to predict for each time interval whether someone has recently said "activate".

The 10,000 above corresponds to discretizing the 10-second clip into 10/10,000 = 0.001-second iterations. 0.001 seconds is also known as 1ms or 1ms. So when we say we're going to be discrete at 1ms intervals, that means we're using 10,000 steps.

Ty = 1375 # The number of time steps in the output of our model

1.3 Generate a single training example

Since speech data is difficult to obtain and tag, you will use the active, negative and background audio segments to synthesize the training data. Recording many 10-second audio clips with random "activates" content is very slow. Instead, it's easier to record many positive and negative words and record background noise separately (or download it from free online sources).

To synthesize a training sample, you will:

  • Randomly select 10 seconds background audio clip
  • Randomly insert 0-4 audio clips of "activates" into this 10-second clip
  • Randomly insert an audio clip of 10 negatives into this 10-second clip

Because you've synthesized the word "activates" into a background clip, you know exactly when "activates" will appear in a 10-second clip. As you will see later, this also makes the tag generation possible y ⟨ t ⟩ y^{\langle t \rangle} y_t_is easier.

You will use the pydub package to process audio. Pydub converts the original audio file into a list of Pydub data structures (it's not important to know more here). Pydub uses 1 millisecond as the discrete time interval (1 millisecond equals 1 millisecond = 1/1000 seconds), which is why 10,000 steps are always used to represent 10-second clips.

# Loading Audio Fragments Using pydub 
activates, negatives, backgrounds = load_raw_audio()

print("background len: " + str(len(backgrounds[0])))    # Should be 10,000 because it is a 10-second clip
print("activate[0] len: " + str(len(activates[0])))     # Perhaps about 1,000, because "activate" audio clips typically take about 1 second (but vary a lot) 
print("activate[1] len: " + str(len(activates[1])))     # Different "activate" clips can have different lengths 
background len: 10000
activate[0] len: 721
activate[1] len: 731

Overlay positive/negative words on the background:

Given a 10-second background clip and a short audio clip (positive or negative word), you need to be able to "add" or "insert" short audio clips of words to the background. To ensure that the audio segments inserted into the background do not overlap, you will track the time of the previously inserted audio segments. You will insert multiple positive/negative clips into the background without inserting an "activate" or random word where it overlaps another clip previously added.

For clarity, when you insert an "activate" of one second into a 10-second Cafe noise segment, you end up with a 10-second segment that sounds like someone said "activate" in a cafe, with "activate" added to the background Cafe noise. Note that you don't end with an 11-second clip. Later you will see how pydub can help you with this.

Create labels while overlaying:

Remember the label y ⟨ t ⟩ y^{\langle t \rangle} y_t_represents whether someone has just finished saying "activate." Given a background clip, we can do this for all t t t Initialization y ⟨ t ⟩ = 0 y^{\langle t \rangle}=0 y_t=0 because the clip does not contain any "activates."

When an "activate" clip is inserted or overwritten, it is also updated y ⟨ t ⟩ y^{\langle t \rangle} Label for y_t_so that the 50 steps output now have Target Label 1. You will train GRU to detect when someone has finished saying "activate". For example, suppose the synthetic "activate" clip ends at the 5-second mark in 10-second audio - exactly half of the clip. Recall T y = 1375 T_y=1375 Ty =1375, so the time step $687 = $int (1375*0.5) corresponds to the time when the audio enters for 5 seconds. Therefore, you will set y ⟨ 688 ⟩ = 1 y^{\langle 688 \rangle} = 1 y_688=1. In addition, if GRU detects "activate" anywhere within a short period of time (internally), you will be very satisfied, so we will actually label it y ⟨ t ⟩ y^{\langle t \rangle} The 50 consecutive values of y_t_are set to 1. We have y ⟨ 688 ⟩ = y ⟨ 689 ⟩ = ⋯ = y ⟨ 737 ⟩ = 1 y^{\langle 688 \rangle} = y^{\langle 689 \rangle} = \cdots = y^{\langle 737 \rangle} = 1 y⟨688⟩=y⟨689⟩=⋯=y⟨737⟩=1.

This is another reason to synthesize training data: generate these tags, as described above y ⟨ t ⟩ y^{\langle t \rangle} y_t_is relatively simple. Conversely, if you record 10 seconds of audio on a microphone, it can be very time consuming for a person to listen to it and mark it accurately and manually when "activate" is complete.

The following image shows the labels y ⟨ t ⟩ y^{\langle t \rangle} y_t For clips where we have inserted "activate", "innocent", "activate", "baby", please note that the positive label "1" is the associated positive-only word.

Figure 2

To implement the synthetic training set process, you will use the following help functions. All these functions will use a 1ms discrete time interval, so the 10-second audio will be discretized into 10,000 steps.

  1. get_random_time_segment(segment_ms) gets a random period of time in our background audio
  2. is_overlapping (segment_time, existing_segments) checks if a time period overlaps an existing time period
  3. insert_audio_clip (background, audio_clip, existing_times) uses get_ Random_ Time_ Segments and is_ Overapping randomly inserts an audio clip into our background audio.
  4. insert_ones (y, segment_end_ms) insert 1 after the word "activate" in our label vector y.

Function get_random_time_segment(segment_ms) returns a random period in which we can insert a duration of segment_ MS audio clip. Read through the code to make sure you know what it is doing.

def get_random_time_segment(segment_ms):
    Get 10,000 ms The length of time in the audio clip is segment_ms Random period of time.
    segment_ms -- Duration of audio clips in milliseconds("ms" representative "Millisecond")
    segment_time -- with ms Is a tuple of units ( segment_start,segment_end)
    segment_start = np.random.randint(low=0, high=10000-segment_ms)   # Make sure the segment does not have a background of more than 10 seconds 
    segment_end = segment_start + segment_ms - 1
    return (segment_start, segment_end)

Next, suppose you insert an audio clip in (1000,1800) and (3400,4500) segments. That is, the first segment starts at 1000 steps and ends at 1800 steps. Now, if we consider inserting a new audio clip in (3000,3600), does this overlap with one of the previously inserted clips? In this case, (3000,3600) and (3400,4500) overlap, so we should decide not to insert the fragment here.

For the purpose of this function, (100,200) and (200,250) are defined as overlapping because they overlap at time step 200. However, (100,199) and (200,250) do not overlap.

Exercise: Implementing is_overlapping (segment_time, existing_segments) checks whether the new time period overlaps any previous time period. You need to perform two steps:

  1. Create a False flag and set it to True later if overlaps are found.
  2. Loop through previous_ The start and end times of segments. These times are compared with the start and end times of the subdivision. If there is overlap, set the flag defined in (1) to True. You can use:
for ....:  
     if ... <= ... and ... >= ...:  

Tip: If the paragraph starts before the end of the previous paragraph and ends after the beginning of the previous paragraph, there is overlap.

# GRADED FUNCTION: is_overlapping

def is_overlapping(segment_time, previous_segments):
    Check if the time of the segment overlaps the time of the existing segment. 
    segment_time -- Tuple of the new segment ( segment_start,segment_end)
    previous_segments -- A tuple list of existing segments ( segment_start,segment_end) 
    If the time period overlaps any existing period, then True,Otherwise it is False
    segment_start, segment_end = segment_time
    # Step 1: Initialize the overlap identifier overlap to the False flag (1 line)
    overlap = False
    # Step 2: Loop through previous_ The start and end times of segments.
    # Compare start/end times and set the flag overlap to True (3 lines) if there is overlap
    for previous_start, previous_end in previous_segments:
        if segment_start <= previous_end and segment_end >= previous_start:
            overlap = True
    return overlap
overlap1 = is_overlapping((950, 1430), [(2000, 2550), (260, 949)])
overlap2 = is_overlapping((2305, 2950), [(824, 1532), (1900, 2305), (3424, 3656)])
print("Overlap 1 = ", overlap1)
print("Overlap 2 = ", overlap2)
Overlap 1 =  False
Overlap 2 =  True

Now let's use the previous auxiliary function to insert a new audio fragment into the background at a random time of 10 seconds, but make sure that no newly inserted fragment overlaps the previous one.

Exercise: Implement insert_audio_clip() to overlay the audio clip onto the background 10 seconds clip. You will need to perform four steps:

  1. Get a random period of time in ms for the correct duration.
  2. Make sure that the time period does not overlap any previous time period. If they overlap, return to step 1 and select a new time period.
  3. Add a new time period to the list of existing time periods to track all the time periods you insert.
  4. Use pydub to cover the audio clip on the background. We've done that for you.
# GRADED FUNCTION: insert_audio_clip

def insert_audio_clip(background, audio_clip, previous_segments):
    Insert a new audio segment on the background noise in the random time step to ensure that the audio segment does not overlap with the existing segment.
    background -- 10 Seconds background recording. 
    audio_clip -- To insert/Overlay audio clips. 
    previous_segments -- Time of Audio Fragment Placed
    new_background -- Updated background audio
    # Get the duration of the audio clip in ms units
    segment_ms = len(audio_clip)
    # Step 1: Use one of the auxiliary functions to select a random time period to insert
    # New audio clip. (1 line)
    segment_time = get_random_time_segment(segment_ms)
    # Step 2: Check the new segment_ Whether time and previous_ One of the segments overlaps.  
    # If so, continue to randomly select a new segment_time until it does not overlap. (2 lines)
    while is_overlapping(segment_time, previous_segments):
        segment_time = get_random_time_segment(segment_ms)

    # Step 3: Move the new segment_time added to previous_segments list (1 line)
    # Step 4: Overlay Audio Fragments and Background
    new_background = background.overlay(audio_clip, position = segment_time[0])
    return new_background, segment_time
audio_clip, segment_time = insert_audio_clip(backgrounds[0], activates[0], [(3790, 4400)])
audio_clip.export("insert_test.wav", format="wav")
print("Segment Time: ", segment_time)
Segment Time:  (2915, 3635)
CSDN Playing audio is not supported
# Expected audio
CSDN Playing audio is not supported

Finally, assuming you just inserted "activate." execute the code to update the label y ⟨ t ⟩ y^{\langle t \rangle} y_t_. In the code below, because T y = 1375 T_y=1375 Ty = 1375, so y is a (1,1375) dimensional vector.

If "activate" is in a time step t t End of t, set y ⟨ t + 1 ⟩ = 1 y^{\langle t+1 \rangle} = 1 Y_t+1=1 and up to 49 Other continuous values. However, make sure you have not finished the end of the array and try to update y[0][1375], because T y = 1375 T_y=1375 Ty =1375, so the valid index is y[0][0] to y[0][1374]. Therefore, if "activate" ends in 1370 steps, you only get y[0][1371] = y[0][1372] = y[0][1373] = y[0][1374] = 1

Exercise: Implement insert_ones(). You can use the for loop. (If you are an expert in slice operations for python, use slices to vectorize this at any time.) If segment_end_ms end (using 10000 steps of discretization), convert it to output y y Index of y (using 1375 1375 1375 steps discretization), we will use the following formula:

    segment_end_y = int(segment_end_ms * Ty / 10000.0)
# GRADED FUNCTION: insert_ones

def insert_ones(y, segment_end_ms):
    Update Label Vector y. The label of the last 50 outputs at the end of the segment should be set to 1.
    Strictly speaking, we mean segment_end_y The label should be 0, and the next 50 labels should be 1.
    y -- numpy Dimensions of arrays (1, Ty), Label for training samples
    segment_end_ms -- with ms End time of a segment in units
    y -- Update Label
    # Background duration (in spectral time step)
    segment_end_y = int(segment_end_ms * Ty / 10000.0)
    # Add 1 to the correct index in the background label (y)
    for i in range(segment_end_y + 1, segment_end_y + 51):
        if i < Ty:
            y[0, i] = 1
    return y
arr1 = insert_ones(np.zeros((1, Ty)), 9700)
plt.plot(insert_ones(arr1, 4251)[0,:])
print("sanity checks:", arr1[0][1333], arr1[0][634], arr1[0][635])
sanity checks: 0.0 1.0 0.0

Finally, you can use insert_audio_clip and insert_ones to create a new training example.

Exercise: Implement create_training_example(). You need to perform the following steps:

  1. Tag vector y y y initialized to dimension is ( 1 , T y ) (1,T_y) Zero numpy array of (1,Ty)
  2. Initialize the collection of existing segments to an empty list
  3. Randomly select 0 to 4 "activate" audio clips and insert them into a 10-second clip. Also in the label vector y y Insert the label in the correct place for y.
  4. Randomly select 0 to 2 negative audio clips and insert them into 10 second clips.
# GRADED FUNCTION: create_training_example

def create_training_example(background, activates, negatives):
    Create training samples with a given background, positive and negative examples.
    background -- 10 Seconds background recording
    activates --  "activate" List of Audio Fragments of the Word
    negatives -- No "activate" List of Audio Fragments of the Word
    x -- Spectrogram of training samples
    y -- Label for each time step of the spectrum graph
    # Set up random seeds
    # Make the background quieter
    background = background - 20

    # Step 1: Initialize y (label vector) to 0 (1 line)
    y = np.zeros((1, Ty))

    # Step 2: Initialize time into an empty list (1 line)
    previous_segments = []
    # Select 0-4 random "activate" audio clips from the entire "activate" recording list
    number_of_activates = np.random.randint(0, 5)
    random_indices = np.random.randint(len(activates), size=number_of_activates)
    random_activates = [activates[i] for i in random_indices]
    # Step 3: Cycle through the random selection of the "activate" clip to insert the background
    for random_activate in random_activates:
        # Insert Audio Clip to Background
        background, segment_time = insert_audio_clip(background, random_activate, previous_segments)
        # From segment_ Segment_in time Start and segment_end 
        segment_start, segment_end = segment_time
        # Insert label in y
        y = insert_ones(y, segment_end_ms=segment_end)
    # Randomly select 0-2 negative recordings from the entire list of negative recordings
    number_of_negatives = np.random.randint(0, 3)
    random_indices = np.random.randint(len(negatives), size=number_of_negatives)
    random_negatives = [negatives[i] for i in random_indices]

     # Step 4: Cycle randomly to select negative sample fragments and insert them into the background
    for random_negative in random_negatives:
        # Insert Audio Clip to Background
        background, _ = insert_audio_clip(background, random_negative, previous_segments)
    # Standardize the volume of audio clips 
    background = match_target_amplitude(background, -20.0)

    # Export new training samples 
    file_handle = background.export("train" + ".wav", format="wav")
    print("file (train.wav) Saved in your directory.")
    # Get and draw the spectrum of a new recording (background of positive and negative overlays)
    x = graph_spectrogram("train.wav")
    return x, y
x, y = create_training_example(backgrounds[0], activates, negatives)
file (train.wav) Saved in your directory.

Now you can listen to the training example you created and compare it with the spectrum generated above.

CSDN Playing audio is not supported
CSDN Playing audio is not supported

Finally, you can draw associated labels for the generated training samples.


1.4 Complete Training Set

You have now implemented the code required to generate a single training sample. We used this process to generate a large number of training sets. To save time, we have generated a set of training samples.

# Loading preprocessed training samples
X = np.load("./XY_train/X.npy")
Y = np.load("./XY_train/Y.npy")

1.5 Development Set

To test our model, we recorded a development set of 25 samples. While synthesizing training data, we want to create the development set with the same distribution as the actual input. So we recorded 25 10-second audio clips of people saying "activate" and other random words and labeled them manually. This follows the principle described in Lesson 3 that we should create the development set as similar as the test set as possible. That's why our developers use real audio instead of synthesizing it.

# Load Preprocessing Development Set Example
X_dev = np.load("./XY_dev/X_dev.npy")
Y_dev = np.load("./XY_dev/Y_dev.npy")

2 Model

Now that you have set up your dataset, let's write and train a keyword recognition model!

The model will use one-dimensional convolution, GRU, and dense layers. Let's load packages that use these layers in Keras. Loading may take a minute.

from keras.callbacks import ModelCheckpoint
from keras.models import Model, load_model, Sequential
from keras.layers import Dense, Activation, Dropout, Input, Masking, TimeDistributed, LSTM, Conv1D
from keras.layers import GRU, Bidirectional, BatchNormalization, Reshape
from keras.optimizers import Adam
Using TensorFlow backend.

2.1 Modeling

This is the model architecture that we will use. Take some time to look at the model to see if it makes sense.

Figure 3

A key step in the model is the one-dimensional convolution step (near the bottom of Figure 3). It inputs 5511 steps of spectrum and outputs 1375 steps, which are then further processed by multiple layers to obtain the final result T y = 1375 T_y=1375 Ty = 1375 output steps. This layer acts like the 2D convolution you saw in Lesson 4 by extracting lower-level features and generating smaller-sized outputs.

By calculation, the one-dimensional conversion layer also helps speed up the model, as GRU now only needs to process 1375 time steps, not 5511. The two GRU layers read the input sequence from left to right, then use the dense+sigmoid layer pair as a final step y ⟨ t ⟩ y^{\langle t \rangle} y_t_Predict. because y y y is a binary value (0 or 1), so we use the Sigmoid output at the last level to estimate the probability that the output will be 1, as the user just said, "activate".

Note that we are using one-way RNN instead of two-way RNN. This is really important for keyword detection because we want to be able to detect trigger words immediately after they are spoken. If we use two-way RNN, we have to wait for the entire 10 seconds of audio to be recorded before we know if "activate" was said in the first second of the audio clip.

There are four steps to implement the model:

Step 1: CONV Layer. Using Conv1D() and 196 filters,
The filter size is 15 (kernel_size = 15) and the step size is 4. [ See documentation.]

Step 2: First GRU layer. To generate a GRU layer, use:

X = GRU(units = 128, return_sequences = True)(X)

Set return_sequences = True ensures that all hidden states of the GRU are fed to the next level. Remember to do this after the Dropout and BatchNorm layers.

Step 3: Second GRU layer. This is similar to the previous GRU layer (remember to use return_sequences = True), but there is an additional dropout layer.

Step 4: Create a time-intensive layer by following these steps:

X = TimeDistributed(Dense(1, activation = "sigmoid"))(X)

This creates a Sigmoid dense layer that follows, so the parameters used for the dense layer are the same for each time step. [ See documentation.]

Exercise: Implement the model(), whose architecture is shown in Figure 3.


def model(input_shape):
    use Keras Diagram for creating a model Function creating the model's graph in Keras.
    input_shape -- Dimension of model input data (using Keras Contract)
    model -- Keras Model instances
    X_input = Input(shape = input_shape)
    # Step 1: Convolution Layer (4 lines)
    X = Conv1D(196, 15, strides=4)(X_input)             # CONV1D
    X = BatchNormalization()(X)                         # Batch normalization batch standardization
    X = Activation('relu')(X)                           # ReLu activation ReLu activation
    X = Dropout(0.8)(X)                                 # dropout (use 0.8)

    # Step 2: First GRU layer (4 lines)
    X = GRU(units = 128, return_sequences=True)(X)      # GRU (uses 128 cells and returns a sequence)
    X = Dropout(0.8)(X)                                 # dropout (use 0.8)
    X = BatchNormalization()(X)                         # Batch normalization batch standardization

    # Step 3: Second GRU layer (4 lines)
    X = GRU(units = 128, return_sequences=True)(X)      # GRU (uses 128 cells and returns a sequence)
    X = Dropout(0.8)(X)                                 # dropout (use 0.8)
    X = BatchNormalization()(X)                         # Batch normalization batch standardization
    X = Dropout(0.8)(X)                                 # dropout (use 0.8)

    # Step 4: Time Distribution Full Connection Layer (1 line)
    X = TimeDistributed(Dense(1, activation = "sigmoid"))(X) # time distributed  (sigmoid)

    model = Model(inputs = X_input, outputs = X)
    return model  
model = model(input_shape = (Tx, n_freq))

Let's output a model summary to see the dimensions.

Model: "model_1"
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 5511, 101)         0         
conv1d_1 (Conv1D)            (None, 1375, 196)         297136    
batch_normalization_1 (Batch (None, 1375, 196)         784       
activation_1 (Activation)    (None, 1375, 196)         0         
dropout_1 (Dropout)          (None, 1375, 196)         0         
gru_1 (GRU)                  (None, 1375, 128)         124800    
dropout_2 (Dropout)          (None, 1375, 128)         0         
batch_normalization_2 (Batch (None, 1375, 128)         512       
gru_2 (GRU)                  (None, 1375, 128)         98688     
dropout_3 (Dropout)          (None, 1375, 128)         0         
batch_normalization_3 (Batch (None, 1375, 128)         512       
dropout_4 (Dropout)          (None, 1375, 128)         0         
time_distributed_1 (TimeDist (None, 1375, 1)           129       
Total params: 522,561
Trainable params: 521,657
Non-trainable params: 904

The output of the network is (None, 1375, 1) and the input is (None, 5511, 101). Conv1D reduces the number of steps from 5511 on the spectrum to 1375.

2.2 Fitting Model

Keyword detection takes a long time to train. To save time, we have used the architecture you built above to train a model on the GPU for about three hours and provide a large training set of about 4000 samples. Let's load the model.

model = load_model('./models/tr_model.h5')

You can further train the model using the Adam optimizer and the binary cross-entropy loss, as shown below. This will run very quickly because we only train one epoch and provide a small training set of 26 examples.

opt = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"]), Y, batch_size = 5, epochs=1)
Epoch 1/1
26/26 [==============================] - 10s 381ms/step - loss: 0.0893 - accuracy: 0.9717

2.3 Test Model

Finally, let's see how your model behaves on the development set.

loss, acc = model.evaluate(X_dev, Y_dev)
print("Dev set accuracy = ", acc)
25/25 [==============================] - 1s 37ms/step
Dev set accuracy =  0.9507200121879578

It seems all right! However, accuracy is not an important indicator of this task, since the label is skewed to 0 severely, the accuracy of a neural network that only outputs 0 will be slightly higher than 90%. We can define more useful indicators, such as F1 scores or "accuracy/recall rates". However, instead of using it here, let's just see how the model works from experience.

3 Prediction

Now that you have established a working model for trigger word detection, let's use it to make predictions. This code snippet runs audio over the network (saved in a wav file).

You can use your model to predict new audio clips.

You first need to calculate the prediction of the input audio clip.

Exercise: Implement predict_activates(). You need to do the following:

  1. Calculate the spectrum of an audio file
  2. Use np.swap and np.expand_dims resizes the input to (1, Tx, n_freqs)
  3. Using forward propagation in the model to calculate the prediction for each output step
def detect_triggerword(filename):
    plt.subplot(2, 1, 1)

    x = graph_spectrogram(filename)
    # Freqs output (Tx), we want (Tx, freqs) to be input into the model
    x  = x.swapaxes(0,1)
    x = np.expand_dims(x, axis=0)
    predictions = model.predict(x)
    plt.subplot(2, 1, 2)
    return predictions

Once the likelihood of detecting the word "activate" in each output step has been estimated, a "chiming" sound can be triggered when the likelihood is higher than a certain threshold. In addition, after saying "activate", for many consecutive values, y ⟨ t ⟩ y^{\langle t \rangle} y_t_may be close to 1, but we only want to make a reminder sound once. Therefore, a ringtone will be inserted at most once every 75 output steps. This will help prevent us from inserting two reminder tones for a single instance of "activate". (The effect is similar to non-maximum suppression in computer vision)

Exercise: Implement chime_on_activate(). You need to do the following:

  1. Prediction probability traversing each output step
  2. Insert "chime" in the original audio clip when the prediction is greater than the threshold and more than 75 consecutive time steps have elapsed

Convert 1375 to 10,000 steps of discretization using the following code, and insert "chime" using pydub:

audio_clip = audio_clip.overlay(chime, position = ((i / Ty) * audio.duration_seconds)*1000)

chime_file = "audio_examples/chime.wav"
def chime_on_activate(filename, predictions, threshold):
    audio_clip = AudioSegment.from_wav(filename)
    chime = AudioSegment.from_wav(chime_file)
    Ty = predictions.shape[1]
    # Step 1: Initialize the continuous output step to 0
    consecutive_timesteps = 0
    # Step 2: Output step in loop y
    for i in range(Ty):
        # Step 3: Increase the continuous output step
        consecutive_timesteps += 1
        # Step 4: If the prediction is above the threshold and more than 75 consecutive output steps have been passed
        if predictions[0,i,0] > threshold and consecutive_timesteps > 75:
            # Step 5: Use pydub to overlay audio and background
            audio_clip = audio_clip.overlay(chime, position = ((i / Ty) * audio_clip.duration_seconds)*1000)
            # Step 6: Reset the continuous output step to zero
            consecutive_timesteps = 0
    audio_clip.export("chime_output.wav", format='wav')

3.1 Test Development Set

Let's explore how our model behaves on two unknown audio clips in the development set. First let's listen to two development collection clips.

CSDN Playing audio is not supported
CSDN Playing audio is not supported

Now let's run the model on these audio clips to see if it adds a reminder tone after "activate"!

filename = "./raw_data/dev/1.wav"
prediction = detect_triggerword(filename)
chime_on_activate(filename, prediction, 0.5)

CSDN Playing audio is not supported
filename  = "./raw_data/dev/2.wav"
prediction = detect_triggerword(filename)
chime_on_activate(filename, prediction, 0.5)

CSDN Playing audio is not supported

This is what you should remember:

  • Data synthesis is an effective way to create large training sets for speech problems, especially trigger word detection.
  • Before transferring audio data to RNN, GRU, or LSTM, the use of spectrograms and an optional 1-D conversion layer is a common preprocessing step.
  • A very effective trigger word detection system can be built using end-to-end in-depth learning methods.

4 Try your own example!

In this optional exercise in this notebook, you can try using your model on your own audio clips!

Record a 10-second audio clip and say "activate" and other random words as myaudio.wav is uploaded to the Coursera hub. Make sure that the audio is uploaded as a WAV file. If your audio is recorded in another format (such as mp3), you can find free software online to convert it to wav. If your recording time is not 10 seconds, the code below will trim or fill the sound to 10 seconds as needed.

# Preprocess audio to correct format
def preprocess_audio(filename):
    # Trim or fill audio segments to 10000ms
    padding = AudioSegment.silent(duration=10000)
    segment = AudioSegment.from_wav(filename)[:10000]
    segment = padding.overlay(segment)
    # Set frame rate to 44100
    segment = segment.set_frame_rate(44100)
    # Export to wav
    segment.export(filename, format='wav')

After uploading the audio file to Coursera, place the file path in the following variable.

your_filename = "audio_examples/my_audio.wav"
IPython.display.Audio(your_filename) # Listen to your uploaded audio
CSDN Playing audio is not supported

Finally, the model is used to predict when "activate" is said in a 10-second audio clip and triggers a reminder tone. If beeps are not added properly, try adjusting chime_threshold.

chime_threshold = 0.5
prediction = detect_triggerword(your_filename)
chime_on_activate(your_filename, prediction, chime_threshold)

CSDN Playing audio is not supported

Tags: Python Deep Learning Voice recognition

Posted by mrherman on Sat, 06 Aug 2022 22:38:33 +0530