# introduction

The Deep Deterministic Policy Gradient (DDPG ) algorithm is an off-line deep reinforcement learning algorithm specially proposed by the DeepMind team to solve continuous control problems. It actually borrows from the Deep Q-Network (DQN) algorithm. some ideas. This article will lead you to understand this algorithm. The links to the paper and code are below.

Please click star if you like it!

# 1 Introduction to DDPG Algorithm

Before the DDPG algorithm, when we solved the continuous action space problem, there were mainly two ways: one was to discretize the continuous action, and then use the reinforcement learning algorithm (such as DQN ) to solve. The second is to use the Policy Gradient (PG) algorithm (eg Reinforce ) to solve directly. However, for the first method, the discretization process is deviated from the engineering practice to a certain extent; for the second method, the effect of the PG algorithm in solving the continuous control problem is often unsatisfactory. For this reason, the DDPG algorithm was born and has achieved very good results in many continuous control problems.

The DDPG algorithm is an offline deep reinforcement learning algorithm under the Actor-Critic (AC) framework. Therefore, the algorithm includes an Actor network and a Critic network. Each network is updated according to its own update rules, so as to maximize the cumulative expected return. .

# 2 DDPG algorithm principle

The DDPG algorithm combines the deterministic policy gradient algorithm and the related technologies in the DQN algorithm. When we talked about the DQN algorithm, we explained two important technologies in detail: experience replay and target network. Specifically, the DDPG algorithm mainly includes the following three key technologies:

(1) Experience playback: the experience data that the agent will get Put it into the Replay Buffer, and sample in batches when updating network parameters.

(2) Target network: A set of Target Actor network and Target Critic network for estimating the target is used outside the Actor network and the Critic network. When updating the target network, in order to avoid the parameter updating too fast, the soft update method is adopted.

(3) Noise exploration: The actions output by the deterministic strategy are deterministic actions, which lack the exploration of the environment. In the training phase, noise is added to the actions output by the Actor network, so that the agent has a certain ability to explore.

## 2.1 Experience Playback

Experience replay is a technique to stabilize the probability distribution of experience, which can improve the stability of training. Experience playback mainly has two key steps: "storage" and "playback":

Storage: save experience as Forms are stored in the experience pool.

Playback: Sample one or more pieces of experience data from the experience pool according to certain rules.

From the storage point of view, experience playback can be divided into centralized playback and distributed playback:
Centralized replay: The agent operates in an environment, and the experience is uniformly stored in the experience pool.

Distributed Playback: Multiple agents run simultaneously in multiple environments and store experiences uniformly in an experience pool. Since multiple agents generate experience simultaneously, experience can be collected faster while using more resources.

From the sampling point of view, experience playback can be divided into uniform playback and priority playback:
Uniform Replay: Equal probability sampling experience from experience pool.
Priority playback: assign a priority to each experience in the experience pool, and prefer to choose the experience with a higher priority when sampling experience. The general practice is that if an experience (such as ) has a priority of , then the probability of choosing this experience is: Priority playback can refer to this paper for details: Priority experience playback

1. When training the Q network, the correlation between the data can be broken, so that the data satisfy the independent and identical distribution, thereby reducing the variance of parameter update and improving the convergence speed.

2. The experience can be reused, and the data utilization rate is high, which is especially useful for situations where data acquisition is difficult.

Can't be applied to episodic update and multi-step learning algorithms. But applying experience replay to Q-learning circumvents this shortcoming.

Centralized uniform playback is used in the code, as follows:

```import numpy as np

class ReplayBuffer:
def __init__(self, max_size, state_dim, action_dim, batch_size):
self.mem_size = max_size
self.batch_size = batch_size
self.mem_cnt = 0

self.state_memory = np.zeros((self.mem_size, state_dim))
self.action_memory = np.zeros((self.mem_size, action_dim))
self.reward_memory = np.zeros((self.mem_size, ))
self.next_state_memory = np.zeros((self.mem_size, state_dim))
self.terminal_memory = np.zeros((self.mem_size, ), dtype=np.bool)

def store_transition(self, state, action, reward, state_, done):
mem_idx = self.mem_cnt % self.mem_size

self.state_memory[mem_idx] = state
self.action_memory[mem_idx] = action
self.reward_memory[mem_idx] = reward
self.next_state_memory[mem_idx] = state_
self.terminal_memory[mem_idx] = done

self.mem_cnt += 1

def sample_buffer(self):
mem_len = min(self.mem_size, self.mem_cnt)
batch = np.random.choice(mem_len, self.batch_size, replace=False)

states = self.state_memory[batch]
actions = self.action_memory[batch]
rewards = self.reward_memory[batch]
states_ = self.next_state_memory[batch]
terminals = self.terminal_memory[batch]

return states, actions, rewards, states_, terminals

return self.mem_cnt >= self.batch_size```

## 2.2 Target Network

Since the DDPG algorithm is based on the AC framework, the algorithm must contain Actor and Critic networks. In addition, each network has its corresponding target network, so the DDPG algorithm includes four networks, namely the Actor network. , Critic Network , the Target Actor Network and Target Critic Network . This section mainly introduces the update process of the DDPG algorithm, the update method of the target network and the purpose of introducing the target network.

### 2.2.1 Algorithm update process

The algorithm update mainly updates the parameters of the Actor and Critic networks, where the Actor network is updated by maximizing the cumulative expected return , the critical network is updated by minimizing the error between the evaluation value and the target value . In the training phase, we sample a batch of data from the Replay Buffer, assuming that the sampled data is .

Critic Network: Calculate state using Target Actor network The following actions: Note here: no noise needs to be added after the action is calculated. Then use the Target Critic network to calculate the state-action pair The target value of : Then use the Critic network to calculate the state-action pair Estimated value of: Finally, use the gradient descent algorithm to minimize the difference between the estimated value and the expected value , to update the parameters in the Critic network: The above process is actually very similar to the DQN algorithm.

Actor Network: Calculate state using Actor Network The following actions: Note here: no noise needs to be added after the action is calculated. Then use the Critic network to calculate the state-action pair The estimated value of (i.e. the cumulative expected return): Finally, use the gradient ascent algorithm to maximize the cumulative expected return (The code implementation is optimized by gradient descent algorithm , in fact, they are essentially the same), so as to update the parameters in the Actor network.

So far we have completed the update of Actor and Critic network.

### 2.2.2 Update of the target network

For the update of the target network, the soft update method is used in the DDPG algorithm, which can also be called Exponential Moving Average (EMA). i.e. introduce a learning rate (or become momentum) , do a weighted average of the old target network parameters and the new corresponding network parameters, and then assign them to the target network:

Target Actor Network: Target Critic Network: Learning Rate (Momentum) , usually takes the value 0.005.

2.2.3 The purpose of introducing the target network

As we mentioned earlier, the purpose of introducing the target network is actually the same as the idea of ​​the DQN algorithm. Since the parameters in the target network are slowly updated by soft update, its output will be more stable, and the target value calculated by the target network will naturally be more stable, thus further ensuring that the learning process of the critical network is more stable. Just imagine, if you directly use the Critic network to calculate the target value Then because the Critic network is constantly updated, the network fluctuates violently, and the natural target value is changes are also dramatic. In the learning process, if the evaluation value of the Critic network is chasing a rapidly changing target, it is easy to cause network oscillations, which will lead to the collapse of the entire learning process.

The above is one purpose, in fact there is another purpose. When using the Critic network to calculate the target value (as shown above), it is essentially a bootstrapping process. then let keep approaching , which can easily lead to network overestimation. because when When an overestimate occurs, it is passed back to , resulting in over-estimation of this item, thus forming a positive feedback, which eventually leads to over-estimation of the entire network.

Bootstrapping

Indicates that in the calculation process of the current value function, the subsequent state value function or action value function will be used, that is, the subsequent state or state action pair will be used.

So what's the problem with overestimating? If the overestimation is uniform, it will not affect the final decision; but if it is not uniform, it will have a great impact on the final decision. Let's take a chestnut, everyone can easily understand. In the above figure, we assume that there are three actions, and the actual action value of each action is 200, 100, and 230. Obviously, the action value of action 3 is the highest, and the agent will choose action 3. If the network has overestimated and is uniform, assuming that the amount of overestimation is 100, then the action values ​​estimated by the network are 300, 200 and 330 in turn. Obviously, the action value of action 3 is still the highest, and the agent will still Choose Action 3. Therefore, uniform overestimation has no effect on the final decision. Similarly, we assume that there are three actions, and the actual action value of each action is 200, 100, and 230. Obviously, the action value of action 3 is the highest, and the agent will choose action 3. If the network is unevenly overestimated, the estimated action values ​​are 280, 300, and 240 in turn. At this time, it is obvious that the action value of action 2 is the highest, and the agent will choose action 2. But actually the true action value of action 2 is the lowest, that is, the action is the worst. Therefore, uneven overestimation can have a large impact on the final decision.

However, in fact, the overestimation of the network is non-uniform, so this problem needs to be avoided, which is essentially to solve the Bootstrapping problem. This problem can be solved by adopting the target network At this point we let Approaching the target value When , it is no longer bootstrapping (you can carefully observe the meaning of bootstrapping).

## 2.3 Noise exploration

Exploration is crucial for an agent, and deterministic strategies "innately" lack the ability to explore, so we need to artificially add noise to the output action to make the agent have the ability to explore. In the DDPG algorithm, the author adopts the Ornstein Uhlenbeck process as the action noise. The Ornstein Uhlenbeck process is defined by the following stochastic differential equation (taking the one-dimensional case as an example): in   is the parameter ( )， Is the standard Brownain movement. When the initial disturbance is a single point distribution at the origin (i.e. ),and , the solution of the above equation is (Proof: will substitute , can be simplified . Product this formula from 0 to ,have to . when and Time simplification to get the result. )

This solution has a mean of 0 and a variance of , and the covariance is (Proof: Since the mean is 0, so . Also, Ito Isometric tells us ,so , further simplification can be obtained. )

for always ,so . According to this, it can be seen that the Ornstein Uhlenbeck process is used to positively correlate adjacent perturbations, so that actions are shifted in similar directions.

Code implementation of OU noise:

```class OUActionNoise:
def __init__(self, mu, sigma=0.15, theta=0.2, dt=1e-2, x0=None):
self.theta = theta
self.mu = mu
self.sigma = sigma
self.dt = dt
self.x0 = x0
self.reset()

def __call__(self):
x = self.x_prev + self.theta * (self.mu - self.x_prev) * self.dt + \
self.sigma * np.sqrt(self.dt) * np.random.normal(size=self.mu.shape)
self.x_prev = x

return x

def reset(self):
self.x_prev = self.x0 if self.x0 is not None else np.zeros_like(self.mu)```

After watching OU noise, many friends may be confused, this is too complicated. But I will tell you that OU noise is actually unnecessary, because we can completely replace it with noise that obeys a normal distribution, and the experimental results also confirm this. Therefore, the Twin Delayed Deep Deterministic policy gradient (TD3) algorithm discards the OU noise and adopts the noise that obeys the normal distribution, which is simpler to implement.

In addition, we need to remind everyone that noise will only be added to the actions output by the Actor network in the training phase. Do not add noise in the inference phase, and do not add noise when updating network parameters, as I have reminded earlier. Because we only need to give the agent the ability to explore during the training phase, it is not needed during inference.

# 3 DDPG algorithm pseudo code # 4 Code implementation

Code implementation of Actor and Critic network (networks.py):

```import torch as T
import torch.nn as nn
import torch.optim as optim

device = T.device("cuda:0" if T.cuda.is_available() else "cpu")

def weight_init(m):
if isinstance(m, nn.Linear):
nn.init.xavier_normal_(m.weight)
if m.bias is not None:
nn.init.constant_(m.bias, 0.0)
elif isinstance(m, nn.BatchNorm1d):
nn.init.constant_(m.weight, 1.0)
nn.init.constant_(m.bias, 0.0)

class ActorNetwork(nn.Module):
def __init__(self, alpha, state_dim, action_dim, fc1_dim, fc2_dim):
super(ActorNetwork, self).__init__()
self.fc1 = nn.Linear(state_dim, fc1_dim)
self.ln1 = nn.LayerNorm(fc1_dim)
self.fc2 = nn.Linear(fc1_dim, fc2_dim)
self.ln2 = nn.LayerNorm(fc2_dim)
self.action = nn.Linear(fc2_dim, action_dim)

self.apply(weight_init)
self.to(device)

def forward(self, state):
x = T.relu(self.ln1(self.fc1(state)))
x = T.relu(self.ln2(self.fc2(x)))
action = T.tanh(self.action(x))

return action

def save_checkpoint(self, checkpoint_file):
T.save(self.state_dict(), checkpoint_file)

class CriticNetwork(nn.Module):
def __init__(self, beta, state_dim, action_dim, fc1_dim, fc2_dim):
super(CriticNetwork, self).__init__()
self.fc1 = nn.Linear(state_dim, fc1_dim)
self.ln1 = nn.LayerNorm(fc1_dim)
self.fc2 = nn.Linear(fc1_dim, fc2_dim)
self.ln2 = nn.LayerNorm(fc2_dim)
self.fc3 = nn.Linear(action_dim, fc2_dim)
self.q = nn.Linear(fc2_dim, 1)

self.apply(weight_init)
self.to(device)

def forward(self, state, action):
x_s = T.relu(self.ln1(self.fc1(state)))
x_s = self.ln2(self.fc2(x_s))
x_a = self.fc3(action)
x = T.relu(x_s + x_a)
q = self.q(x)

return q

def save_checkpoint(self, checkpoint_file):
T.save(self.state_dict(), checkpoint_file)

```

Note: The first two Linear layers in the Actor and Critic networks are followed by the Layer Normalization (LN) layer. Because I found a very interesting phenomenon during the experiment, if the LN layer is not added, or the Batch Normalization (BN) layer is added, the entire training process is easy to collapse or the training effect is very poor. The specific reason is not particularly clear to me. Interested friends can git the code and run it again. If you know the reason, you may wish to communicate together.

Code implementation of DDPG algorithm (DDPG.py):

```import torch as T
import torch.nn.functional as F
import numpy as np
from networks import ActorNetwork, CriticNetwork
from buffer import ReplayBuffer

device = T.device("cuda:0" if T.cuda.is_available() else "cpu")

class DDPG:
def __init__(self, alpha, beta, state_dim, action_dim, actor_fc1_dim,
actor_fc2_dim, critic_fc1_dim, critic_fc2_dim, ckpt_dir,
gamma=0.99, tau=0.005, action_noise=0.1, max_size=1000000,
batch_size=256):
self.gamma = gamma
self.tau = tau
self.action_noise = action_noise
self.checkpoint_dir = ckpt_dir

self.actor = ActorNetwork(alpha=alpha, state_dim=state_dim, action_dim=action_dim,
fc1_dim=actor_fc1_dim, fc2_dim=actor_fc2_dim)
self.target_actor = ActorNetwork(alpha=alpha, state_dim=state_dim, action_dim=action_dim,
fc1_dim=actor_fc1_dim, fc2_dim=actor_fc2_dim)
self.critic = CriticNetwork(beta=beta, state_dim=state_dim, action_dim=action_dim,
fc1_dim=critic_fc1_dim, fc2_dim=critic_fc2_dim)
self.target_critic = CriticNetwork(beta=beta, state_dim=state_dim, action_dim=action_dim,
fc1_dim=critic_fc1_dim, fc2_dim=critic_fc2_dim)

self.memory = ReplayBuffer(max_size=max_size, state_dim=state_dim, action_dim=action_dim,
batch_size=batch_size)

self.update_network_parameters(tau=1.0)

def update_network_parameters(self, tau=None):
if tau is None:
tau = self.tau

for actor_params, target_actor_params in zip(self.actor.parameters(),
self.target_actor.parameters()):
target_actor_params.data.copy_(tau * actor_params + (1 - tau) * target_actor_params)

for critic_params, target_critic_params in zip(self.critic.parameters(),
self.target_critic.parameters()):
target_critic_params.data.copy_(tau * critic_params + (1 - tau) * target_critic_params)

def remember(self, state, action, reward, state_, done):
self.memory.store_transition(state, action, reward, state_, done)

def choose_action(self, observation, train=True):
self.actor.eval()
state = T.tensor([observation], dtype=T.float).to(device)
action = self.actor.forward(state).squeeze()

if train:
noise = T.tensor(np.random.normal(loc=0.0, scale=self.action_noise),
dtype=T.float).to(device)
action = T.clamp(action+noise, -1, 1)
self.actor.train()

return action.detach().cpu().numpy()

def learn(self):
return

states, actions, reward, states_, terminals = self.memory.sample_buffer()
states_tensor = T.tensor(states, dtype=T.float).to(device)
actions_tensor = T.tensor(actions, dtype=T.float).to(device)
rewards_tensor = T.tensor(reward, dtype=T.float).to(device)
next_states_tensor = T.tensor(states_, dtype=T.float).to(device)
terminals_tensor = T.tensor(terminals).to(device)

next_actions_tensor = self.target_actor.forward(next_states_tensor)
q_ = self.target_critic.forward(next_states_tensor, next_actions_tensor).view(-1)
q_[terminals_tensor] = 0.0
target = rewards_tensor + self.gamma * q_
q = self.critic.forward(states_tensor, actions_tensor).view(-1)

critic_loss = F.mse_loss(q, target.detach())
critic_loss.backward()
self.critic.optimizer.step()

new_actions_tensor = self.actor.forward(states_tensor)
actor_loss = -T.mean(self.critic(states_tensor, new_actions_tensor))
actor_loss.backward()
self.actor.optimizer.step()

self.update_network_parameters()

def save_models(self, episode):
self.actor.save_checkpoint(self.checkpoint_dir + 'Actor/DDPG_actor_{}.pth'.format(episode))
print('Saving actor network successfully!')
self.target_actor.save_checkpoint(self.checkpoint_dir +
'Target_actor/DDPG_target_actor_{}.pth'.format(episode))
print('Saving target_actor network successfully!')
self.critic.save_checkpoint(self.checkpoint_dir + 'Critic/DDPG_critic_{}'.format(episode))
print('Saving critic network successfully!')
self.target_critic.save_checkpoint(self.checkpoint_dir +
'Target_critic/DDPG_target_critic_{}'.format(episode))
print('Saving target critic network successfully!')

'Target_actor/DDPG_target_actor_{}.pth'.format(episode))
'Target_critic/DDPG_target_critic_{}'.format(episode))
```

The algorithm simulation environment is the LunarLanderContinuous-v2 environment in the gym library, so the gym library needs to be configured first. Enter the corresponding Python environment in Aanconda and execute the following commands

`pip install gym`

However, the gym library installed in this way only includes a small number of built-in environments, such as algorithm environment, simple word game environment and classic control environment, and cannot use LunarLanderContinuous-v2. Therefore, some other dependencies must be installed. For details, please refer to this blog:  AttributeError: module 'gym.envs.box2d' has no attribute 'LunarLander' solution. If the environment is already configured, please ignore this paragraph.

Training script (train.py):

```import gym
import numpy as np
import argparse
from DDPG import DDPG
from utils import create_directory, plot_learning_curve, scale_action

parser = argparse.ArgumentParser("DDPG parameters")

args = parser.parse_args()

def main():
env = gym.make('LunarLanderContinuous-v2')
agent = DDPG(alpha=0.0003, beta=0.0003, state_dim=env.observation_space.shape,
action_dim=env.action_space.shape, actor_fc1_dim=400, actor_fc2_dim=300,
critic_fc1_dim=400, critic_fc2_dim=300, ckpt_dir=args.checkpoint_dir,
batch_size=256)
create_directory(args.checkpoint_dir,
sub_paths=['Actor', 'Target_actor', 'Critic', 'Target_critic'])

reward_history = []
avg_reward_history = []
for episode in range(args.max_episodes):
done = False
total_reward = 0
observation = env.reset()
while not done:
action = agent.choose_action(observation, train=True)
action_ = scale_action(action.copy(), env.action_space.high, env.action_space.low)
observation_, reward, done, info = env.step(action_)
agent.remember(observation, action, reward, observation_, done)
agent.learn()
total_reward += reward
observation = observation_

reward_history.append(total_reward)
avg_reward = np.mean(reward_history[-100:])
avg_reward_history.append(avg_reward)
print('Ep: {} Reward: {:.1f} AvgReward: {:.1f}'.format(episode+1, total_reward, avg_reward))

if (episode + 1) % 200 == 0:
agent.save_models(episode+1)

episodes = [i+1 for i in range(args.max_episodes)]
plot_learning_curve(episodes, avg_reward_history, title='AvgReward',
ylabel='reward', figure_file=args.figure_file)

if __name__ == '__main__':
main()```

There are three parameters in the training script, max_episodes represents the number of training episodes, checkpoint_dir represents the training weight saving path, and figure_file represents the saving path of the training results (actually a cumulative reward curve), which can be set by default.

The drawing function and the create folder function are also used during training, and they are all placed in the utils.py script:

```import os
import numpy as np
import matplotlib.pyplot as plt

class OUActionNoise:
def __init__(self, mu, sigma=0.15, theta=0.2, dt=1e-2, x0=None):
self.theta = theta
self.mu = mu
self.sigma = sigma
self.dt = dt
self.x0 = x0
self.reset()

def __call__(self):
x = self.x_prev + self.theta * (self.mu - self.x_prev) * self.dt + \
self.sigma * np.sqrt(self.dt) * np.random.normal(size=self.mu.shape)
self.x_prev = x

return x

def reset(self):
self.x_prev = self.x0 if self.x0 is not None else np.zeros_like(self.mu)

def create_directory(path: str, sub_paths: list):
for sub_path in sub_paths:
if not os.path.exists(path + sub_path):
os.makedirs(path + sub_path, exist_ok=True)
print('Create path: {} successfully'.format(path+sub_path))
else:

def plot_learning_curve(episodes, records, title, ylabel, figure_file):
plt.figure()
plt.plot(episodes, records, color='r', linestyle='-')
plt.title(title)
plt.xlabel('episode')
plt.ylabel(ylabel)

plt.show()
plt.savefig(figure_file)

def scale_action(action, high, low):
action = np.clip(action, -1, 1)
weight = (high - low) / 2
bias = (high + low) / 2
action_ = action * weight + bias

return action_
```

In addition, we also provide test code, which is mainly used to test the training effect and observe the dynamic rendering of the environment (test.py):

```import gym
import imageio
import argparse
from DDPG import DDPG
from utils import scale_action

parser = argparse.ArgumentParser()

args = parser.parse_args()

def main():
env = gym.make('LunarLanderContinuous-v2')
agent = DDPG(alpha=0.0003, beta=0.0003, state_dim=env.observation_space.shape,
action_dim=env.action_space.shape, actor_fc1_dim=400, actor_fc2_dim=300,
critic_fc1_dim=400, critic_fc2_dim=300, ckpt_dir=args.checkpoint_dir,
batch_size=256)
video = imageio.get_writer(args.filename, fps=args.fps)

done = False
observation = env.reset()
while not done:
if args.render:
env.render()
action = agent.choose_action(observation, train=True)
action_ = scale_action(action.copy(), env.action_space.high, env.action_space.low)
observation_, reward, done, info = env.step(action_)
observation = observation_
if args.save_video:
video.append_data(env.render(mode='rgb_array'))

if __name__ == '__main__':
main()```

The training script includes five parameters, filename indicates the save path of the dynamic graph of the environment, checkpoint_dir indicates the loaded weight path, save_video indicates whether to save the dynamic graph, fps indicates the frame rate of the dynamic graph, and rander indicates whether to enable the environment rendering. You only need to adjust the two parameters of save_video and rander, and the rest can be left as default.

# 5 Experimental results It can be seen from the average reward curve that the algorithm tends to converge when the iteration reaches about 700 steps. This is the test rendering, the agent can complete the landing task well!

# 6 Conclusion

This article mainly explains the principle and code implementation of the DDPG algorithm. Although it is a very good algorithm, there are still some problems that need to be improved, such as overestimation. Later, we will explain the TD3 algorithm, which actually makes some improvements on the basis of the DDPG algorithm, overcomes some problems in the DDPG algorithm, and thus significantly improves the performance of the algorithm.

Posted by shanejeffery86 on Wed, 01 Jun 2022 21:31:58 +0530