Network migration debugging and tuning for mindspire learning (taking ResNet50 as an example)

Network migration debugging and tuning for mindspire learning

  • ResNet50 as an example

Migration process

  • Migration objectives: network implementation, data set, convergence accuracy, training performance
  • Recurrence index: not only the training stage, but also the reasoning stage should be reproduced. Slight difference, belonging to the normal fluctuation range.
  • Replication steps: single Step replication + network integration. Reproduce the running results of a single Step, that is, obtain the state of the network after only the first Step is executed, and then iterate out the running results of the entire network for many times (results after data preprocessing, weight initialization, forward calculation, loss calculation, reverse gradient calculation and optimizer update)


network analysis

  • Mindspire supports both PyNative and static Graph modes. The dynamic Graph mode is flexible and easy to debug. Therefore, the dynamic Graph mode is mainly used for network debugging. The static Graph mode has good performance and is mainly used for network training. When analyzing missing operators and functions, these two modes should be analyzed respectively.

  • If missing operators and functions are found, first consider combining the missing operators and functions based on the current operator or function

  • ResNet series network structure

  • Operator analysis: available for reference Operator mapping
    Supporting operators: (nn.Conv2D-nn.Conv2d, nn.BatchNorm2D-nn.BatchNom2d, nn.ReLU-nn.ReLU, nn.MaxPool2D-nn.MaxPool2d, nn.Linear-nn.Dense, torch.flatten-nn.Flatten)
    Missing operator: nn Adaptiveavgpool2d

  • Lack of operator alternatives: in the ResNet50 network, the input image shape is fixed, unified as N, 322424, where n is the batch size, 3 is the number of channels, 224 and 224 are the width and height of the image respectively. The operators that change the image size in the network are Conv2d and Maxpool2d, which have a fixed impact on the shape. Therefore, nn The input and output shapes of adaptiveavgpool2d can be determined in advance as long as we calculate nn The input and output shapes of adaptiveavgpool2d can be accessed through nn Avgpool or nn Reducemean, so the lack of this operator is replaceable and does not affect the network training.

  • Comparison of other functions

Pytoch usage features Mindspire corresponding function
nn.init.kaiming_normal_ initializer(init='HeNormal')
nn.init.constant_ initializer(init='Constant')
nn.Sequential nn.SequentialCell
nn.Module nn.Cell
nn.distibuted context.set_auto_parallel_context
torch.optim.SGD nn.optim.SGD or nn.optim.Momentum

Network script development

  • CIFAR-10, CIFAR-100 dataset Download:

  • CIFAR-10: a total of 10 classes and 60000 32*32 color images. Binary file, data in dataset Py.

    • Training set: 50000 images
    • Test set: 10000 images
  • ImageNet2012:

  • ImageNet2012: a total of 1000 classes, 224*224 color images. Data format: JPEG, data in dataset Py.

  • Training set: 1281167 images in total

    • Test set: 50000 images in total

Dataset processing

  • Data preprocessing using MindData mainly includes the following steps:
  1. Pass in the data path and read the data file.
  2. Parse the data.
  3. Data processing (such as common data segmentation, shuffle, data enhancement, etc.).
  4. Data distribution (data is distributed in batch_size, and distributed training involves multi machine distribution).
  • ResNet50 network uses ImageNet2012 dataset (PyTorch version)
# sample execution (requires torchvision)
from PIL import Image
from torchvision import transforms
input_image =
preprocess = transforms.Compose([
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
input_tensor = preprocess(input_image)
input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model
  • The main operations are Resize, CenterCrop and Normalize

Data processing based on MindData development

create train or eval dataset.
from mindspore import dtype as mstype
import mindspore.dataset as ds
import as C
import mindspore.dataset.transforms.c_transforms as C2

# Create a dataset (path, batch\u size, rank\u size:device number, rank\u id:device serial number in all machines, training mode)
def create_dataset(dataset_path, batch_size=32, rank_size=1, rank_id=0, do_train=True):
    # num_paralel_workers: parallel degree of data process
    # num_shards: total number devices for distribute training, which equals number shard of data # Number of devices
    # shard_id: the sequence of current device in all distribute training devices,  # Serial number of device in all machines
    #           which equals the data shard sequence for current device
    data_set = ds.ImageFolderDataset(dataset_path, num_parallel_workers=8, shuffle=do_train,
                                     num_shards=rank_size, shard_id=rank_id)

    mean = [0.485 * 255, 0.456 * 255, 0.406 * 255]
    std = [0.229 * 255, 0.224 * 255, 0.225 * 255]

    # define map operations
    trans = [
        C.Normalize(mean=mean, std=std),

    type_cast_op = C2.TypeCast(mstype.int32)  # Precision conversion

    # call data operations by map
    data_set =, input_columns="image", num_parallel_workers=8)
    data_set =, input_columns="label", num_parallel_workers=8)

    # apply batch operations batch_size
    data_set = data_set.batch(batch_size, drop_remainder=do_train)

    return data_set

  • Distributed training requires additional num_shard and shard_id two parameters

Subnet development: training subnet and loss subnet

  • Taking different modules or sub modules in the network as a subnet and developing them separately can ensure that each subnet can be developed in parallel without interference.

The ResNet50 network code can be divided into the following subnets:

  • conv1x1, conv3x3: different kernels are defined_ Convolution of size.
  • BasicBlock: the minimum subnet of ResNet18 and ResNet34 in ResNet series network, which is composed of Conv, BN, ReLU and residuals.
  • BottleNeck: the minimum subnet of ResNet50, ResNet101 and ResNet152 in ResNet series network has one more layer of Conv, BN and ReLU structure than BasicBlock, and the convolution position of down sampling has also been changed.
  • ResNet: a network that encapsulates BasiclBlock, BottleNeck and Layer structures. Different ResNet series networks can be constructed by passing in different parameters. In this structure, some PyTorch customized initialization functions are also used.

Redevelopment of conv3x3 and conv1x1

import mindspore.nn as nn

#  Convolution of 3x3
def _conv3x3(in_channel, out_channel, stride=1):
    return nn.Conv2d(in_channel, out_channel, kernel_size=3, stride=stride, padding=0, pad_mode='same')

#  Convolution of 1x1
def _conv1x1(in_channel, out_channel, stride=1):
    return nn.Conv2d(in_channel, out_channel, kernel_size=1, stride=stride, padding=0, pad_mode='same')

Redevelop BasicBlock and BottleNeck:

# ResNet50 ResNet101 ResNet152 residual subnet (input channel, output channel, step: convolution step): residualblock (3, 256, stripe=2)
class ResidualBlock(nn.Cell):
    expansion = 4  #

    def __init__(self, in_channel, out_channel, stride=1):
        super(ResidualBlock, self).__init__()
        self.stride = stride
        channel = out_channel // self.expansion
        self.conv1 = _conv1x1(in_channel, channel, stride=1)  # 1x1 convolution
        self.bn1 = _bn(channel)  # BatchNorm
        if self.stride != 1:  # Step size is not 1
            self.e2 = nn.SequentialCell([_conv3x3(channel, channel, stride=1), _bn(channel),
                                         nn.ReLU(), nn.MaxPool2d(kernel_size=2, stride=2, pad_mode='same')])
        else:  # Step size is 1
            self.conv2 = _conv3x3(channel, channel, stride=stride)
            self.bn2 = _bn(channel)

        self.conv3 = _conv1x1(channel, out_channel, stride=1)  # 1x1 convolution
        self.bn3 = _bn_last(out_channel)  # Last layer BatchNorm
        self.relu = nn.ReLU()  # Activate function

        self.down_sample = False  # Down sampling

        if stride != 1 or in_channel != out_channel:  # Down sampling
            self.down_sample = True
        self.down_sample_layer = None

        if self.down_sample:  # # Down sampling
            self.down_sample_layer = nn.SequentialCell([_conv1x1(in_channel, out_channel, stride), _bn(out_channel)])

    def construct(self, x):
        identity = x

        out = self.conv1(x)  # 1x1 convolution
        out = self.bn1(out)  # BatchNorm
        out = self.relu(out)  # activation
        if self.stride != 1:  # Step size is not 1
            out = self.e2(out)
        else:  # Step size is 1
            out = self.conv2(out)
            out = self.bn2(out)
            out = self.relu(out)
        out = self.conv3(out)  # 1x1 convolution
        out = self.bn3(out)  # BatchNorm

        if self.down_sample:  # Dimension conversion is required for downsampling
            identity = self.down_sample_layer(identity)

        out = out + identity  # residual
        out = self.relu(out)  # activation

        return out

# ResNet18 and ResNet34 residual subnet (input channel, output channel, step: convolution step): residualblock (3, 256, stripe=2)
class ResidualBlockBase(nn.Cell):
    def __init__(self, in_channel, out_channel, stride=1):
        super(ResidualBlockBase, self).__init__()
        self.conv1 = _conv3x3(in_channel, out_channel, stride=stride)  # 3x3 convolution
        self.bn1d = _bn(out_channel)  # BatchNorm
        self.conv2 = _conv3x3(out_channel, out_channel, stride=1)  # 3x3 convolution
        self.bn2d = _bn(out_channel)   # BatchNorm
        self.relu = nn.ReLU()  # activation

        self.down_sample = False  # With or without down sampling
        if stride != 1 or in_channel != out_channel:
            self.down_sample = True

        self.down_sample_layer = None  # Down sampling
        if self.down_sample:
            self.down_sample_layer = nn.SequentialCell([_conv1x1(in_channel, out_channel, stride),

    # The structure of Figure 2 in the paper shows the residual after two 3x3 subnets
    def construct(self, x):
        identity = x  # input

        out = self.conv1(x)  # 3x3 convolution step size customization
        out = self.bn1d(out)  # BatchNorm
        out = self.relu(out)  # activation

        out = self.conv2(out)  # 3x3 convolution step size is 1
        out = self.bn2d(out)  # BatchNorm

        if self.down_sample:  # Downsampling: if the input and output dimensions are different, the input dimension will be converted to the input dimension to facilitate the residual
            identity = self.down_sample_layer(identity)

        out = out + identity  # residual
        out = self.relu(out)  # activation

        return out

Redevelop ResNet

# ResNet 50 as an example
class ResNet(nn.Cell):
        block (Cell): Subnet
        layer_nums (list): Number of each subnet
        in_channels (list): Input dimension for each subnet
        out_channels (list): Output dimension per subnet
        strides (list): Step size per layer
        >>> ResNet(ResidualBlock,
        >>>        [3, 4, 6, 3],
        >>>        [64, 256, 512, 1024],
        >>>        [256, 512, 1024, 2048],
        >>>        [1, 2, 2, 2],
        >>>        10)

    def __init__(self, block, layer_nums, in_channels, out_channels, strides, num_classes):
        super(ResNet, self).__init__()

        if not len(layer_nums) == len(in_channels) == len(out_channels) == 4:  # Verify that the input is correct
            raise ValueError("the length of layer_num, in_channels, out_channels list must be 4!")
        # Layer 1 7x7 convolution + pooling step 2
        self.conv1 = _conv7x7(3, 64, stride=2)
        self.bn1 = _bn(64)  # BatchNorm
        self.relu = ops.ReLU()  # activation
        # Maximum pooled 3x3 convolution kernel step size 2 Padding
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, pad_mode="same")
        # 1st subnet 3*3 layer input 64 output 256 step 1
        self.layer1 = self._make_layer(block, layer_nums[0], in_channel=in_channels[0], out_channel=out_channels[0],
        # 2nd subnet 4*3 layer input 256 output 512 step 2
        self.layer2 = self._make_layer(block, layer_nums[1], in_channel=in_channels[1], out_channel=out_channels[1],
        # 3rd subnet 6*3 layer input 512 output 1024 step 2
        self.layer3 = self._make_layer(block, layer_nums[2], in_channel=in_channels[2], out_channel=out_channels[2],
        # 4th subnet 3*3 layer input 1024 output 2048 step 2
        self.layer4 = self._make_layer(block, layer_nums[3], in_channel=in_channels[3], out_channel=out_channels[3],
        # Output layer
        self.mean = ops.ReduceMean(keep_dims=True)  # Average pooling
        self.flatten = nn.Flatten()  # fold
        self.end_point = _fc(out_channels[3], num_classes)  # Full connection layer

    def _make_layer(self, block, layer_num, in_channel, out_channel, stride):
            block (Cell): Residual block
            layer_num (int): Number of each subnet
            in_channel (int): Input dimension per subnet
            out_channel (int): Output dimension per subnet
            stride (int): Step size of the first convolution.
            >>> _make_layer(ResidualBlock, 3, 128, 256, 2)
        layers = []  # network layer

        resnet_block = block(in_channel, out_channel, stride=stride)  # residual
        layers.append(resnet_block)  # Add the first residual block, with different input and output dimensions for different step sizes
        for _ in range(1, layer_num):  # Add residual error block in steps of 1
            resnet_block = block(out_channel, out_channel, stride=1)
        return nn.SequentialCell(layers)  # Composition subnet

    def construct(self, x):
        # Layer 1 7x7 convolution step 2
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        # Maximum pooled 3x3 convolution kernel step size 2 Padding
        c1 = self.maxpool(x)

        # 48th floor substructure in the middle of the 2nd-50th floor
        c2 = self.layer1(c1)  # 2-10
        c3 = self.layer2(c2)  # 11-22
        c4 = self.layer3(c3)  # 23-40
        c5 = self.layer4(c4)  # 41-49
        # Average pooling of output layer + full connection layer
        out = self.mean(c5, (2, 3))
        out = self.flatten(out)
        out = self.end_point(out)

        return out

Pass in ResNet50 layer information and construct the whole ResNet50 network:

#  class_num: number of classifications in the dataset. net = resnet50(10)
def resnet50(class_num=10):
    return ResNet(ResidualBlock,
                  [3, 4, 6, 3],
                  [64, 256, 512, 1024],
                  [256, 512, 1024, 2048],
                  [1, 2, 2, 2],

Other modules

  • Reverse construction, gradient clipping, optimizer, learning rate generation, etc

ResNet50 training mainly involves the following items:

  • SGD + Momentum optimizer used
  • WeightDecay function is used (but gamma and bias of BatchNorm are not used)
  • cosine LR schedule used
  • Label Smoothing used

Implement the SGD optimizer with Momentum. Except for the gamma and bias of BN, other weights apply WeightDecay:

#  Momentum's SGD optimizer
decayed_params = []
no_decayed_params = []
for param in net.trainable_params():
    if 'beta' not in and 'gamma' not in and 'bias' not in

group_params = [{'params': decayed_params, 'weight_decay': weight_decay},
                {'params': no_decayed_params},
                {'order_params': net.trainable_params()}]
opt = Momentum(group_params, lr, momentum)

Define the Loss function and implement Label Smoothing:

import mindspore.nn as nn
from mindspore import Tensor
from mindspore import dtype as mstype
from mindspore.nn import LossBase
import mindspore.ops as ops

# define cross entropy loss
class CrossEntropySmooth(LossBase):
    def __init__(self, sparse=True, reduction='mean', smooth_factor=0., num_classes=1000):
        super(CrossEntropySmooth, self).__init__()
        self.onehot = ops.OneHot()
        self.sparse = sparse
        self.on_value = Tensor(1.0 - smooth_factor, mstype.float32)
        self.off_value = Tensor(1.0 * smooth_factor / (num_classes - 1), mstype.float32)
        self.ce = nn.SoftmaxCrossEntropyWithLogits(reduction=reduction)

    def construct(self, logit, label):
        if self.sparse:
            label = self.onehot(label, ops.shape(logit)[1], self.on_value, self.off_value)
        loss = self.ce(logit, label)
        return loss

Get through the process

Stand alone training

  • Refactor the above code as follows:
├── scripts
│   ├──    # Start Ascend distributed training (8 cards)
│   ├──                # Start Ascend assessment
│   └──    # Start Ascend single machine training (single card)
├── src
│   ├──                  # configuration file
│   ├──    # Definition of loss
│   ├──                 # Data preprocessing
│   └──                  # network structure
├──                        # Reasoning process
└──                       # Training process

Where train Py definition

import os
import argparse
import ast
from mindspore import context, set_seed, Model
from mindspore.nn import Momentum
from mindspore.context import ParallelMode
from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, LossMonitor, TimeMonitor
from mindspore.communication import init
from mindspore.common import initializer
import mindspore.nn as nn

from src.config import config
from src.dataset import create_dataset
from src.resnet import resnet50
from src.cross_entropy_smooth import CrossEntropySmooth

# Set seed

# Load parameters
parser = argparse.ArgumentParser(description='Image classification')
parser.add_argument('--run_distribute', type=ast.literal_eval, default=False, help='Run distribute')  # Distributed search
parser.add_argument('--device_num', type=int, default=1, help='Device num.')  # Device

parser.add_argument('--dataset_path', type=str, default=None, help='Dataset path')  # Dataset storage path
parser.add_argument('--device_target', type=str, default='GPU', choices=("Ascend", "GPU", "CPU"),
                    help='Device target, support Ascend,GPU,CPU')  # Dataset storage path
args_opt = parser.parse_args()

if __name__ == '__main__':
    # 1 Resolve parameters and set basic environment
    # Distributed training acquisition equipment information environment variables
    device_id = int(os.getenv('DEVICE_ID', '0'))  # Default Device
    rank_size = int(os.getenv('RANK_SIZE', '1'))
    rank_id = int(os.getenv('RANK_ID', '0'))
    # init context training environment
    # Single machine dynamic graph mode: dynamic_ Mode static graph mode: GRAPH_MODE platform: ascend GPU CPU
    context.set_context(mode=context.GRAPH_MODE, device_target=args_opt.device_target, device_id=device_id)
    # Multi machine distributed
    if rank_size > 1:
        context.set_auto_parallel_context(device_num=rank_size, parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True)
        context.set_auto_parallel_context(all_reduce_fusion_config=[85, 160])

    # 2 define datasets
    # data set
    dataset = create_dataset(args_opt.dataset_path, config.batch_size, rank_size, rank_id)
    step_size = dataset.get_dataset_size()

    # 3 define network structure
    # Define network
    net = resnet50(class_num=config.class_num)
    # Weight initialization
    for _, cell in net.cells_and_names():
        if isinstance(cell, nn.Conv2d):  # Convolution initialization XavierUniform
            cell.weight.set_data(initializer.initializer(initializer.XavierUniform(), cell.weight.shape,
        if isinstance(cell, nn.Dense):  # Dense initialization TruncatedNormal
            cell.weight.set_data(initializer.initializer(initializer.TruncatedNormal(), cell.weight.shape,

    # 4 defining loss functions and optimizers
    # Learning rate decay warmup
    lr = nn.dynamic_lr.cosine_decay_lr(config.lr_end,, config.epoch_size * step_size, step_size, config.warmup)
    # SGD optimizer for Momentum attenuation policy
    decayed_params = []
    no_decayed_params = []
    for param in net.trainable_params():
        if 'beta' not in and 'gamma' not in and 'bias' not in

    group_params = [{'params': decayed_params, 'weight_decay': config.weight_decay},
                    {'params': no_decayed_params},
                    {'order_params': net.trainable_params()}]
    opt = Momentum(group_params, lr, config.momentum)

    # Cross entropy loss
    loss = CrossEntropySmooth(sparse=True, reduction="mean", smooth_factor=config.label_smooth_factor,

    # 5 define model and callback function
    # Model definition, network
    model = Model(net, loss_fn=loss, optimizer=opt, metrics={'acc'})  # metrics={'top_1_accuracy', 'top_5_accuracy'}
    # Callback, save training practice, model, etc
    time_cb = TimeMonitor(data_size=step_size)
    loss_cb = LossMonitor()
    cb = [time_cb, loss_cb]
    if config.save_checkpoint:  #
        # config_ck = CheckpointConfig(save_checkpoint_steps=config.save_checkpoint_epochs * step_size,
        config_ck = CheckpointConfig(save_checkpoint_steps=5, keep_checkpoint_max=config.keep_checkpoint_max)
        ckpt_cb = ModelCheckpoint(prefix="resnet", directory=config.save_checkpoint_path, config=config_ck)
        cb += [ckpt_cb]

    model.train(config.epoch_size, dataset, callbacks=cb, sink_size=step_size, dataset_sink_mode=False)

Operation training

source activate  py37_ms16

pip install easydict

python --dataset_path=./data/imagenet_original/train/

Error summary

Out of memory: device (id:0) memory isn't enough and allocate failed, kernel name:

  • Reduce batch_size done
  • Cause of insufficient memory: batch_ Too large size, too large model, too large data shape, etc


Tags: Deep Learning

Posted by samadams83 on Wed, 01 Jun 2022 17:11:34 +0530