# introduction

The Deep Deterministic Policy Gradient (DDPG ) algorithm is an off-line deep reinforcement learning algorithm specially proposed by the DeepMind team to solve continuous control problems. It actually borrows from the Deep Q-Network (DQN) algorithm. some ideas. This article will lead you to understand this algorithm. The links to the paper and code are below.

paper: https://arxiv.org/pdf/1509.02971.pdf

Code: https://github.com/indigoLovee/DDPG

Please click star if you like it!

# 1 Introduction to DDPG Algorithm

Before the DDPG algorithm, when we solved the continuous action space problem, there were mainly two ways: one was to discretize the continuous action, and then use the reinforcement learning algorithm (such as DQN ) to solve. The second is to use the Policy Gradient (PG) algorithm (eg Reinforce ) to solve directly. However, for the first method, the discretization process is deviated from the engineering practice to a certain extent; for the second method, the effect of the PG algorithm in solving the continuous control problem is often unsatisfactory. For this reason, the DDPG algorithm was born and has achieved very good results in many continuous control problems.

The DDPG algorithm is an offline deep reinforcement learning algorithm under the Actor-Critic (AC) framework. Therefore, the algorithm includes an Actor network and a Critic network. Each network is updated according to its own update rules, so as to maximize the cumulative expected return. .

# 2 DDPG algorithm principle

The DDPG algorithm combines the deterministic policy gradient algorithm and the related technologies in the DQN algorithm. When we talked about the DQN algorithm, we explained two important technologies in detail: experience replay and target network. Specifically, the DDPG algorithm mainly includes the following three key technologies:

(1) Experience playback: the experience data that the agent will getPut it into the Replay Buffer, and sample in batches when updating network parameters.

(2) Target network: A set of Target Actor network and Target Critic network for estimating the target is used outside the Actor network and the Critic network. When updating the target network, in order to avoid the parameter updating too fast, the soft update method is adopted.

(3) Noise exploration: The actions output by the deterministic strategy are deterministic actions, which lack the exploration of the environment. In the training phase, noise is added to the actions output by the Actor network, so that the agent has a certain ability to explore.

## 2.1 Experience Playback

Experience replay is a technique to stabilize the probability distribution of experience, which can improve the stability of training. Experience playback mainly has two key steps: "storage" and "playback":

Storage: save experience asForms are stored in the experience pool.

Playback: Sample one or more pieces of experience data from the experience pool according to certain rules.

From the storage point of view, experience playback can be divided into centralized playback and distributed playback:

Centralized replay: The agent operates in an environment, and the experience is uniformly stored in the experience pool.

Distributed Playback: Multiple agents run simultaneously in multiple environments and store experiences uniformly in an experience pool. Since multiple agents generate experience simultaneously, experience can be collected faster while using more resources.

From the sampling point of view, experience playback can be divided into uniform playback and priority playback:

Uniform Replay: Equal probability sampling experience from experience pool.

Priority playback: assign a priority to each experience in the experience pool, and prefer to choose the experience with a higher priority when sampling experience. The general practice is that if an experience (such as) has a priority of, then the probability of choosing this experience is:

Priority playback can refer to this paper for details: Priority experience playback

Advantages of Experience Playback:

1. When training the Q network, the correlation between the data can be broken, so that the data satisfy the independent and identical distribution, thereby reducing the variance of parameter update and improving the convergence speed.

2. The experience can be reused, and the data utilization rate is high, which is especially useful for situations where data acquisition is difficult.

Disadvantages of Experience Playback:

Can't be applied to episodic update and multi-step learning algorithms. But applying experience replay to Q-learning circumvents this shortcoming.

Centralized uniform playback is used in the code, as follows:

import numpy as np class ReplayBuffer: def __init__(self, max_size, state_dim, action_dim, batch_size): self.mem_size = max_size self.batch_size = batch_size self.mem_cnt = 0 self.state_memory = np.zeros((self.mem_size, state_dim)) self.action_memory = np.zeros((self.mem_size, action_dim)) self.reward_memory = np.zeros((self.mem_size, )) self.next_state_memory = np.zeros((self.mem_size, state_dim)) self.terminal_memory = np.zeros((self.mem_size, ), dtype=np.bool) def store_transition(self, state, action, reward, state_, done): mem_idx = self.mem_cnt % self.mem_size self.state_memory[mem_idx] = state self.action_memory[mem_idx] = action self.reward_memory[mem_idx] = reward self.next_state_memory[mem_idx] = state_ self.terminal_memory[mem_idx] = done self.mem_cnt += 1 def sample_buffer(self): mem_len = min(self.mem_size, self.mem_cnt) batch = np.random.choice(mem_len, self.batch_size, replace=False) states = self.state_memory[batch] actions = self.action_memory[batch] rewards = self.reward_memory[batch] states_ = self.next_state_memory[batch] terminals = self.terminal_memory[batch] return states, actions, rewards, states_, terminals def ready(self): return self.mem_cnt >= self.batch_size

## 2.2 Target Network

Since the DDPG algorithm is based on the AC framework, the algorithm must contain Actor and Critic networks. In addition, each network has its corresponding target network, so the DDPG algorithm includes four networks, namely the Actor network., Critic Network, the Target Actor Networkand Target Critic Network. This section mainly introduces the update process of the DDPG algorithm, the update method of the target network and the purpose of introducing the target network.

### 2.2.1 Algorithm update process

The algorithm update mainly updates the parameters of the Actor and Critic networks, where the Actor network is updated by maximizing the cumulative expected return, the critical network is updated by minimizing the error between the evaluation value and the target value. In the training phase, we sample a batch of data from the Replay Buffer, assuming that the sampled data is.

Critic Network: Calculate state using Target Actor networkThe following actions:

Note here: no noise needs to be added after the action is calculated. Then use the Target Critic network to calculate the state-action pairThe target value of :

Then use the Critic network to calculate the state-action pairEstimated value of:

Finally, use the gradient descent algorithm to minimize the difference between the estimated value and the expected value, to update the parameters in the Critic network:

The above process is actually very similar to the DQN algorithm.

Actor Network: Calculate state using Actor NetworkThe following actions:

Note here: no noise needs to be added after the action is calculated. Then use the Critic network to calculate the state-action pairThe estimated value of (i.e. the cumulative expected return):

Finally, use the gradient ascent algorithm to maximize the cumulative expected return(The code implementation is optimized by gradient descent algorithm, in fact, they are essentially the same), so as to update the parameters in the Actor network.

So far we have completed the update of Actor and Critic network.

### 2.2.2 Update of the target network

For the update of the target network, the soft update method is used in the DDPG algorithm, which can also be called Exponential Moving Average (EMA). i.e. introduce a learning rate (or become momentum), do a weighted average of the old target network parameters and the new corresponding network parameters, and then assign them to the target network:

Target Actor Network:

Target Critic Network:

Learning Rate (Momentum), usually takes the value 0.005.

2.2.3 The purpose of introducing the target network

As we mentioned earlier, the purpose of introducing the target network is actually the same as the idea of the DQN algorithm. Since the parameters in the target network are slowly updated by soft update, its output will be more stable, and the target value calculated by the target network will naturally be more stable, thus further ensuring that the learning process of the critical network is more stable. Just imagine, if you directly use the Critic network to calculate the target value

Then because the Critic network is constantly updated, the network fluctuates violently, and the natural target value ischanges are also dramatic. In the learning process, if the evaluation value of the Critic network is chasing a rapidly changing target, it is easy to cause network oscillations, which will lead to the collapse of the entire learning process.

The above is one purpose, in fact there is another purpose. When using the Critic network to calculate the target value (as shown above), it is essentially a bootstrapping process. then letkeep approaching, which can easily lead to network overestimation. because whenWhen an overestimate occurs, it is passed back to, resulting in over-estimation of this item, thus forming a positive feedback, which eventually leads to over-estimation of the entire network.

Bootstrapping

Indicates that in the calculation process of the current value function, the subsequent state value function or action value function will be used, that is, the subsequent state or state action pair will be used.

So what's the problem with overestimating? If the overestimation is uniform, it will not affect the final decision; but if it is not uniform, it will have a great impact on the final decision. Let's take a chestnut, everyone can easily understand.

In the above figure, we assume that there are three actions, and the actual action value of each action is 200, 100, and 230. Obviously, the action value of action 3 is the highest, and the agent will choose action 3. If the network has overestimated and is uniform, assuming that the amount of overestimation is 100, then the action values estimated by the network are 300, 200 and 330 in turn. Obviously, the action value of action 3 is still the highest, and the agent will still Choose Action 3. Therefore, uniform overestimation has no effect on the final decision.

Similarly, we assume that there are three actions, and the actual action value of each action is 200, 100, and 230. Obviously, the action value of action 3 is the highest, and the agent will choose action 3. If the network is unevenly overestimated, the estimated action values are 280, 300, and 240 in turn. At this time, it is obvious that the action value of action 2 is the highest, and the agent will choose action 2. But actually the true action value of action 2 is the lowest, that is, the action is the worst. Therefore, uneven overestimation can have a large impact on the final decision.

However, in fact, the overestimation of the network is non-uniform, so this problem needs to be avoided, which is essentially to solve the Bootstrapping problem. This problem can be solved by adopting the target network

At this point we letApproaching the target valueWhen , it is no longer bootstrapping (you can carefully observe the meaning of bootstrapping).

## 2.3 Noise exploration

Exploration is crucial for an agent, and deterministic strategies "innately" lack the ability to explore, so we need to artificially add noise to the output action to make the agent have the ability to explore. In the DDPG algorithm, the author adopts the Ornstein Uhlenbeck process as the action noise. The Ornstein Uhlenbeck process is defined by the following stochastic differential equation (taking the one-dimensional case as an example):

in，，is the parameter ()，Is the standard Brownain movement. When the initial disturbance is a single point distribution at the origin (i.e.),and, the solution of the above equation is

(Proof: willsubstitute, can be simplified. Product this formula from 0 to,have to. whenandTime simplification to get the result. )

This solution has a mean of 0 and a variance of, and the covariance is

(Proof: Since the mean is 0, so. Also, Ito Isometric tells us,so, further simplification can be obtained. )

foralways,so . According to this, it can be seen that the Ornstein Uhlenbeck process is used to positively correlate adjacent perturbations, so that actions are shifted in similar directions.

Code implementation of OU noise:

class OUActionNoise: def __init__(self, mu, sigma=0.15, theta=0.2, dt=1e-2, x0=None): self.theta = theta self.mu = mu self.sigma = sigma self.dt = dt self.x0 = x0 self.reset() def __call__(self): x = self.x_prev + self.theta * (self.mu - self.x_prev) * self.dt + \ self.sigma * np.sqrt(self.dt) * np.random.normal(size=self.mu.shape) self.x_prev = x return x def reset(self): self.x_prev = self.x0 if self.x0 is not None else np.zeros_like(self.mu)

After watching OU noise, many friends may be confused, this is too complicated. But I will tell you that OU noise is actually unnecessary, because we can completely replace it with noise that obeys a normal distribution, and the experimental results also confirm this. Therefore, the Twin Delayed Deep Deterministic policy gradient (TD3) algorithm discards the OU noise and adopts the noise that obeys the normal distribution, which is simpler to implement.

In addition, we need to remind everyone that noise will only be added to the actions output by the Actor network in the training phase. Do not add noise in the inference phase, and do not add noise when updating network parameters, as I have reminded earlier. Because we only need to give the agent the ability to explore during the training phase, it is not needed during inference.

# 3 DDPG algorithm pseudo code

# 4 Code implementation

Code implementation of Actor and Critic network (networks.py):

import torch as T import torch.nn as nn import torch.optim as optim device = T.device("cuda:0" if T.cuda.is_available() else "cpu") def weight_init(m): if isinstance(m, nn.Linear): nn.init.xavier_normal_(m.weight) if m.bias is not None: nn.init.constant_(m.bias, 0.0) elif isinstance(m, nn.BatchNorm1d): nn.init.constant_(m.weight, 1.0) nn.init.constant_(m.bias, 0.0) class ActorNetwork(nn.Module): def __init__(self, alpha, state_dim, action_dim, fc1_dim, fc2_dim): super(ActorNetwork, self).__init__() self.fc1 = nn.Linear(state_dim, fc1_dim) self.ln1 = nn.LayerNorm(fc1_dim) self.fc2 = nn.Linear(fc1_dim, fc2_dim) self.ln2 = nn.LayerNorm(fc2_dim) self.action = nn.Linear(fc2_dim, action_dim) self.optimizer = optim.Adam(self.parameters(), lr=alpha) self.apply(weight_init) self.to(device) def forward(self, state): x = T.relu(self.ln1(self.fc1(state))) x = T.relu(self.ln2(self.fc2(x))) action = T.tanh(self.action(x)) return action def save_checkpoint(self, checkpoint_file): T.save(self.state_dict(), checkpoint_file) def load_checkpoint(self, checkpoint_file): self.load_state_dict(T.load(checkpoint_file)) class CriticNetwork(nn.Module): def __init__(self, beta, state_dim, action_dim, fc1_dim, fc2_dim): super(CriticNetwork, self).__init__() self.fc1 = nn.Linear(state_dim, fc1_dim) self.ln1 = nn.LayerNorm(fc1_dim) self.fc2 = nn.Linear(fc1_dim, fc2_dim) self.ln2 = nn.LayerNorm(fc2_dim) self.fc3 = nn.Linear(action_dim, fc2_dim) self.q = nn.Linear(fc2_dim, 1) self.optimizer = optim.Adam(self.parameters(), lr=beta, weight_decay=0.001) self.apply(weight_init) self.to(device) def forward(self, state, action): x_s = T.relu(self.ln1(self.fc1(state))) x_s = self.ln2(self.fc2(x_s)) x_a = self.fc3(action) x = T.relu(x_s + x_a) q = self.q(x) return q def save_checkpoint(self, checkpoint_file): T.save(self.state_dict(), checkpoint_file) def load_checkpoint(self, checkpoint_file): self.load_state_dict(T.load(checkpoint_file))

Note: The first two Linear layers in the Actor and Critic networks are followed by the Layer Normalization (LN) layer. Because I found a very interesting phenomenon during the experiment, if the LN layer is not added, or the Batch Normalization (BN) layer is added, the entire training process is easy to collapse or the training effect is very poor. The specific reason is not particularly clear to me. Interested friends can git the code and run it again. If you know the reason, you may wish to communicate together.

Code implementation of DDPG algorithm (DDPG.py):

import torch as T import torch.nn.functional as F import numpy as np from networks import ActorNetwork, CriticNetwork from buffer import ReplayBuffer device = T.device("cuda:0" if T.cuda.is_available() else "cpu") class DDPG: def __init__(self, alpha, beta, state_dim, action_dim, actor_fc1_dim, actor_fc2_dim, critic_fc1_dim, critic_fc2_dim, ckpt_dir, gamma=0.99, tau=0.005, action_noise=0.1, max_size=1000000, batch_size=256): self.gamma = gamma self.tau = tau self.action_noise = action_noise self.checkpoint_dir = ckpt_dir self.actor = ActorNetwork(alpha=alpha, state_dim=state_dim, action_dim=action_dim, fc1_dim=actor_fc1_dim, fc2_dim=actor_fc2_dim) self.target_actor = ActorNetwork(alpha=alpha, state_dim=state_dim, action_dim=action_dim, fc1_dim=actor_fc1_dim, fc2_dim=actor_fc2_dim) self.critic = CriticNetwork(beta=beta, state_dim=state_dim, action_dim=action_dim, fc1_dim=critic_fc1_dim, fc2_dim=critic_fc2_dim) self.target_critic = CriticNetwork(beta=beta, state_dim=state_dim, action_dim=action_dim, fc1_dim=critic_fc1_dim, fc2_dim=critic_fc2_dim) self.memory = ReplayBuffer(max_size=max_size, state_dim=state_dim, action_dim=action_dim, batch_size=batch_size) self.update_network_parameters(tau=1.0) def update_network_parameters(self, tau=None): if tau is None: tau = self.tau for actor_params, target_actor_params in zip(self.actor.parameters(), self.target_actor.parameters()): target_actor_params.data.copy_(tau * actor_params + (1 - tau) * target_actor_params) for critic_params, target_critic_params in zip(self.critic.parameters(), self.target_critic.parameters()): target_critic_params.data.copy_(tau * critic_params + (1 - tau) * target_critic_params) def remember(self, state, action, reward, state_, done): self.memory.store_transition(state, action, reward, state_, done) def choose_action(self, observation, train=True): self.actor.eval() state = T.tensor([observation], dtype=T.float).to(device) action = self.actor.forward(state).squeeze() if train: noise = T.tensor(np.random.normal(loc=0.0, scale=self.action_noise), dtype=T.float).to(device) action = T.clamp(action+noise, -1, 1) self.actor.train() return action.detach().cpu().numpy() def learn(self): if not self.memory.ready(): return states, actions, reward, states_, terminals = self.memory.sample_buffer() states_tensor = T.tensor(states, dtype=T.float).to(device) actions_tensor = T.tensor(actions, dtype=T.float).to(device) rewards_tensor = T.tensor(reward, dtype=T.float).to(device) next_states_tensor = T.tensor(states_, dtype=T.float).to(device) terminals_tensor = T.tensor(terminals).to(device) with T.no_grad(): next_actions_tensor = self.target_actor.forward(next_states_tensor) q_ = self.target_critic.forward(next_states_tensor, next_actions_tensor).view(-1) q_[terminals_tensor] = 0.0 target = rewards_tensor + self.gamma * q_ q = self.critic.forward(states_tensor, actions_tensor).view(-1) critic_loss = F.mse_loss(q, target.detach()) self.critic.optimizer.zero_grad() critic_loss.backward() self.critic.optimizer.step() new_actions_tensor = self.actor.forward(states_tensor) actor_loss = -T.mean(self.critic(states_tensor, new_actions_tensor)) self.actor.optimizer.zero_grad() actor_loss.backward() self.actor.optimizer.step() self.update_network_parameters() def save_models(self, episode): self.actor.save_checkpoint(self.checkpoint_dir + 'Actor/DDPG_actor_{}.pth'.format(episode)) print('Saving actor network successfully!') self.target_actor.save_checkpoint(self.checkpoint_dir + 'Target_actor/DDPG_target_actor_{}.pth'.format(episode)) print('Saving target_actor network successfully!') self.critic.save_checkpoint(self.checkpoint_dir + 'Critic/DDPG_critic_{}'.format(episode)) print('Saving critic network successfully!') self.target_critic.save_checkpoint(self.checkpoint_dir + 'Target_critic/DDPG_target_critic_{}'.format(episode)) print('Saving target critic network successfully!') def load_models(self, episode): self.actor.load_checkpoint(self.checkpoint_dir + 'Actor/DDPG_actor_{}.pth'.format(episode)) print('Loading actor network successfully!') self.target_actor.load_checkpoint(self.checkpoint_dir + 'Target_actor/DDPG_target_actor_{}.pth'.format(episode)) print('Loading target_actor network successfully!') self.critic.load_checkpoint(self.checkpoint_dir + 'Critic/DDPG_critic_{}'.format(episode)) print('Loading critic network successfully!') self.target_critic.load_checkpoint(self.checkpoint_dir + 'Target_critic/DDPG_target_critic_{}'.format(episode)) print('Loading target critic network successfully!')

The algorithm simulation environment is the LunarLanderContinuous-v2 environment in the gym library, so the gym library needs to be configured first. Enter the corresponding Python environment in Aanconda and execute the following commands

pip install gym

However, the gym library installed in this way only includes a small number of built-in environments, such as algorithm environment, simple word game environment and classic control environment, and cannot use LunarLanderContinuous-v2. Therefore, some other dependencies must be installed. For details, please refer to this blog: AttributeError: module 'gym.envs.box2d' has no attribute 'LunarLander' solution. If the environment is already configured, please ignore this paragraph.

Training script (train.py):

import gym import numpy as np import argparse from DDPG import DDPG from utils import create_directory, plot_learning_curve, scale_action parser = argparse.ArgumentParser("DDPG parameters") parser.add_argument('--max_episodes', type=int, default=1000) parser.add_argument('--checkpoint_dir', type=str, default='./checkpoints/DDPG/') parser.add_argument('--figure_file', type=str, default='./output_images/reward.png') args = parser.parse_args() def main(): env = gym.make('LunarLanderContinuous-v2') agent = DDPG(alpha=0.0003, beta=0.0003, state_dim=env.observation_space.shape[0], action_dim=env.action_space.shape[0], actor_fc1_dim=400, actor_fc2_dim=300, critic_fc1_dim=400, critic_fc2_dim=300, ckpt_dir=args.checkpoint_dir, batch_size=256) create_directory(args.checkpoint_dir, sub_paths=['Actor', 'Target_actor', 'Critic', 'Target_critic']) reward_history = [] avg_reward_history = [] for episode in range(args.max_episodes): done = False total_reward = 0 observation = env.reset() while not done: action = agent.choose_action(observation, train=True) action_ = scale_action(action.copy(), env.action_space.high, env.action_space.low) observation_, reward, done, info = env.step(action_) agent.remember(observation, action, reward, observation_, done) agent.learn() total_reward += reward observation = observation_ reward_history.append(total_reward) avg_reward = np.mean(reward_history[-100:]) avg_reward_history.append(avg_reward) print('Ep: {} Reward: {:.1f} AvgReward: {:.1f}'.format(episode+1, total_reward, avg_reward)) if (episode + 1) % 200 == 0: agent.save_models(episode+1) episodes = [i+1 for i in range(args.max_episodes)] plot_learning_curve(episodes, avg_reward_history, title='AvgReward', ylabel='reward', figure_file=args.figure_file) if __name__ == '__main__': main()

There are three parameters in the training script, max_episodes represents the number of training episodes, checkpoint_dir represents the training weight saving path, and figure_file represents the saving path of the training results (actually a cumulative reward curve), which can be set by default.

The drawing function and the create folder function are also used during training, and they are all placed in the utils.py script:

import os import numpy as np import matplotlib.pyplot as plt class OUActionNoise: def __init__(self, mu, sigma=0.15, theta=0.2, dt=1e-2, x0=None): self.theta = theta self.mu = mu self.sigma = sigma self.dt = dt self.x0 = x0 self.reset() def __call__(self): x = self.x_prev + self.theta * (self.mu - self.x_prev) * self.dt + \ self.sigma * np.sqrt(self.dt) * np.random.normal(size=self.mu.shape) self.x_prev = x return x def reset(self): self.x_prev = self.x0 if self.x0 is not None else np.zeros_like(self.mu) def create_directory(path: str, sub_paths: list): for sub_path in sub_paths: if not os.path.exists(path + sub_path): os.makedirs(path + sub_path, exist_ok=True) print('Create path: {} successfully'.format(path+sub_path)) else: print('Path: {} is already existence'.format(path+sub_path)) def plot_learning_curve(episodes, records, title, ylabel, figure_file): plt.figure() plt.plot(episodes, records, color='r', linestyle='-') plt.title(title) plt.xlabel('episode') plt.ylabel(ylabel) plt.show() plt.savefig(figure_file) def scale_action(action, high, low): action = np.clip(action, -1, 1) weight = (high - low) / 2 bias = (high + low) / 2 action_ = action * weight + bias return action_

In addition, we also provide test code, which is mainly used to test the training effect and observe the dynamic rendering of the environment (test.py):

import gym import imageio import argparse from DDPG import DDPG from utils import scale_action parser = argparse.ArgumentParser() parser.add_argument('--filename', type=str, default='./output_images/LunarLander.gif') parser.add_argument('--checkpoint_dir', type=str, default='./checkpoints/DDPG/') parser.add_argument('--save_video', type=bool, default=True) parser.add_argument('--fps', type=int, default=30) parser.add_argument('--render', type=bool, default=True) args = parser.parse_args() def main(): env = gym.make('LunarLanderContinuous-v2') agent = DDPG(alpha=0.0003, beta=0.0003, state_dim=env.observation_space.shape[0], action_dim=env.action_space.shape[0], actor_fc1_dim=400, actor_fc2_dim=300, critic_fc1_dim=400, critic_fc2_dim=300, ckpt_dir=args.checkpoint_dir, batch_size=256) agent.load_models(1000) video = imageio.get_writer(args.filename, fps=args.fps) done = False observation = env.reset() while not done: if args.render: env.render() action = agent.choose_action(observation, train=True) action_ = scale_action(action.copy(), env.action_space.high, env.action_space.low) observation_, reward, done, info = env.step(action_) observation = observation_ if args.save_video: video.append_data(env.render(mode='rgb_array')) if __name__ == '__main__': main()

The training script includes five parameters, filename indicates the save path of the dynamic graph of the environment, checkpoint_dir indicates the loaded weight path, save_video indicates whether to save the dynamic graph, fps indicates the frame rate of the dynamic graph, and rander indicates whether to enable the environment rendering. You only need to adjust the two parameters of save_video and rander, and the rest can be left as default.

# 5 Experimental results

It can be seen from the average reward curve that the algorithm tends to converge when the iteration reaches about 700 steps.

This is the test rendering, the agent can complete the landing task well!

# 6 Conclusion

This article mainly explains the principle and code implementation of the DDPG algorithm. Although it is a very good algorithm, there are still some problems that need to be improved, such as overestimation. Later, we will explain the TD3 algorithm, which actually makes some improvements on the basis of the DDPG algorithm, overcomes some problems in the DDPG algorithm, and thus significantly improves the performance of the algorithm.