Network migration debugging and tuning for mindspire learning
- ResNet50 as an example
Migration process
- Migration objectives: network implementation, data set, convergence accuracy, training performance
- Recurrence index: not only the training stage, but also the reasoning stage should be reproduced. Slight difference, belonging to the normal fluctuation range.
- Replication steps: single Step replication + network integration. Reproduce the running results of a single Step, that is, obtain the state of the network after only the first Step is executed, and then iterate out the running results of the entire network for many times (results after data preprocessing, weight initialization, forward calculation, loss calculation, reverse gradient calculation and optimizer update)
Preparation
-
Install mindspire, Python and other environments
-
Download source code: https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py
-
ResNet50 is a classic deep neural network in CV and the implementation of mainstream ResNet series networks (ResNet18, ResNet34, ResNet50, ResNet101, ResNet152). The dataset used by ResNet50 is ImageNet2012
network analysis
-
Mindspire supports both PyNative and static Graph modes. The dynamic Graph mode is flexible and easy to debug. Therefore, the dynamic Graph mode is mainly used for network debugging. The static Graph mode has good performance and is mainly used for network training. When analyzing missing operators and functions, these two modes should be analyzed respectively.
-
If missing operators and functions are found, first consider combining the missing operators and functions based on the current operator or function
-
ResNet series network structure
-
Operator analysis: available for reference Operator mapping
Supporting operators: (nn.Conv2D-nn.Conv2d, nn.BatchNorm2D-nn.BatchNom2d, nn.ReLU-nn.ReLU, nn.MaxPool2D-nn.MaxPool2d, nn.Linear-nn.Dense, torch.flatten-nn.Flatten)
Missing operator: nn Adaptiveavgpool2d -
Lack of operator alternatives: in the ResNet50 network, the input image shape is fixed, unified as N, 322424, where n is the batch size, 3 is the number of channels, 224 and 224 are the width and height of the image respectively. The operators that change the image size in the network are Conv2d and Maxpool2d, which have a fixed impact on the shape. Therefore, nn The input and output shapes of adaptiveavgpool2d can be determined in advance as long as we calculate nn The input and output shapes of adaptiveavgpool2d can be accessed through nn Avgpool or nn Reducemean, so the lack of this operator is replaceable and does not affect the network training.
-
Comparison of other functions
Pytoch usage features | Mindspire corresponding function |
---|---|
nn.init.kaiming_normal_ | initializer(init='HeNormal') |
nn.init.constant_ | initializer(init='Constant') |
nn.Sequential | nn.SequentialCell |
nn.Module | nn.Cell |
nn.distibuted | context.set_auto_parallel_context |
torch.optim.SGD | nn.optim.SGD or nn.optim.Momentum |
Network script development
-
CIFAR-10, CIFAR-100 dataset Download: http://www.cs.toronto.edu/~kriz/cifar.html
-
CIFAR-10: a total of 10 classes and 60000 32*32 color images. Binary file, data in dataset Py.
- Training set: 50000 images
- Test set: 10000 images
-
ImageNet2012: https://image-net.org/
-
ImageNet2012: a total of 1000 classes, 224*224 color images. Data format: JPEG, data in dataset Py.
-
Training set: 1281167 images in total
- Test set: 50000 images in total
Dataset processing
- Data preprocessing using MindData mainly includes the following steps:
- Pass in the data path and read the data file.
- Parse the data.
- Data processing (such as common data segmentation, shuffle, data enhancement, etc.).
- Data distribution (data is distributed in batch_size, and distributed training involves multi machine distribution).
- ResNet50 network uses ImageNet2012 dataset (PyTorch version)
# sample execution (requires torchvision) from PIL import Image from torchvision import transforms input_image = Image.open(filename) preprocess = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]) input_tensor = preprocess(input_image) input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model
- The main operations are Resize, CenterCrop and Normalize
Data processing based on MindData development
""" create train or eval dataset. """ from mindspore import dtype as mstype import mindspore.dataset as ds import mindspore.dataset.vision.c_transforms as C import mindspore.dataset.transforms.c_transforms as C2 # Create a dataset (path, batch\u size, rank\u size:device number, rank\u id:device serial number in all machines, training mode) def create_dataset(dataset_path, batch_size=32, rank_size=1, rank_id=0, do_train=True): # num_paralel_workers: parallel degree of data process # num_shards: total number devices for distribute training, which equals number shard of data # Number of devices # shard_id: the sequence of current device in all distribute training devices, # Serial number of device in all machines # which equals the data shard sequence for current device data_set = ds.ImageFolderDataset(dataset_path, num_parallel_workers=8, shuffle=do_train, num_shards=rank_size, shard_id=rank_id) mean = [0.485 * 255, 0.456 * 255, 0.406 * 255] std = [0.229 * 255, 0.224 * 255, 0.225 * 255] # define map operations trans = [ C.Decode(), C.Resize(256), C.CenterCrop(224), C.Normalize(mean=mean, std=std), C.HWC2CHW() ] type_cast_op = C2.TypeCast(mstype.int32) # Precision conversion # call data operations by map data_set = data_set.map(operations=trans, input_columns="image", num_parallel_workers=8) data_set = data_set.map(operations=type_cast_op, input_columns="label", num_parallel_workers=8) # apply batch operations batch_size data_set = data_set.batch(batch_size, drop_remainder=do_train) return data_set
- Distributed training requires additional num_shard and shard_id two parameters
Subnet development: training subnet and loss subnet
- Taking different modules or sub modules in the network as a subnet and developing them separately can ensure that each subnet can be developed in parallel without interference.
The ResNet50 network code can be divided into the following subnets:
- conv1x1, conv3x3: different kernels are defined_ Convolution of size.
- BasicBlock: the minimum subnet of ResNet18 and ResNet34 in ResNet series network, which is composed of Conv, BN, ReLU and residuals.
- BottleNeck: the minimum subnet of ResNet50, ResNet101 and ResNet152 in ResNet series network has one more layer of Conv, BN and ReLU structure than BasicBlock, and the convolution position of down sampling has also been changed.
- ResNet: a network that encapsulates BasiclBlock, BottleNeck and Layer structures. Different ResNet series networks can be constructed by passing in different parameters. In this structure, some PyTorch customized initialization functions are also used.
Redevelopment of conv3x3 and conv1x1
import mindspore.nn as nn # Convolution of 3x3 def _conv3x3(in_channel, out_channel, stride=1): return nn.Conv2d(in_channel, out_channel, kernel_size=3, stride=stride, padding=0, pad_mode='same') # Convolution of 1x1 def _conv1x1(in_channel, out_channel, stride=1): return nn.Conv2d(in_channel, out_channel, kernel_size=1, stride=stride, padding=0, pad_mode='same')
Redevelop BasicBlock and BottleNeck:
# ResNet50 ResNet101 ResNet152 residual subnet (input channel, output channel, step: convolution step): residualblock (3, 256, stripe=2) class ResidualBlock(nn.Cell): expansion = 4 # def __init__(self, in_channel, out_channel, stride=1): super(ResidualBlock, self).__init__() self.stride = stride channel = out_channel // self.expansion self.conv1 = _conv1x1(in_channel, channel, stride=1) # 1x1 convolution self.bn1 = _bn(channel) # BatchNorm if self.stride != 1: # Step size is not 1 self.e2 = nn.SequentialCell([_conv3x3(channel, channel, stride=1), _bn(channel), nn.ReLU(), nn.MaxPool2d(kernel_size=2, stride=2, pad_mode='same')]) else: # Step size is 1 self.conv2 = _conv3x3(channel, channel, stride=stride) self.bn2 = _bn(channel) self.conv3 = _conv1x1(channel, out_channel, stride=1) # 1x1 convolution self.bn3 = _bn_last(out_channel) # Last layer BatchNorm self.relu = nn.ReLU() # Activate function self.down_sample = False # Down sampling if stride != 1 or in_channel != out_channel: # Down sampling self.down_sample = True self.down_sample_layer = None if self.down_sample: # # Down sampling self.down_sample_layer = nn.SequentialCell([_conv1x1(in_channel, out_channel, stride), _bn(out_channel)]) def construct(self, x): identity = x out = self.conv1(x) # 1x1 convolution out = self.bn1(out) # BatchNorm out = self.relu(out) # activation if self.stride != 1: # Step size is not 1 out = self.e2(out) else: # Step size is 1 out = self.conv2(out) out = self.bn2(out) out = self.relu(out) out = self.conv3(out) # 1x1 convolution out = self.bn3(out) # BatchNorm if self.down_sample: # Dimension conversion is required for downsampling identity = self.down_sample_layer(identity) out = out + identity # residual out = self.relu(out) # activation return out # ResNet18 and ResNet34 residual subnet (input channel, output channel, step: convolution step): residualblock (3, 256, stripe=2) class ResidualBlockBase(nn.Cell): def __init__(self, in_channel, out_channel, stride=1): super(ResidualBlockBase, self).__init__() self.conv1 = _conv3x3(in_channel, out_channel, stride=stride) # 3x3 convolution self.bn1d = _bn(out_channel) # BatchNorm self.conv2 = _conv3x3(out_channel, out_channel, stride=1) # 3x3 convolution self.bn2d = _bn(out_channel) # BatchNorm self.relu = nn.ReLU() # activation self.down_sample = False # With or without down sampling if stride != 1 or in_channel != out_channel: self.down_sample = True self.down_sample_layer = None # Down sampling if self.down_sample: self.down_sample_layer = nn.SequentialCell([_conv1x1(in_channel, out_channel, stride), _bn(out_channel)]) # The structure of Figure 2 in the paper shows the residual after two 3x3 subnets def construct(self, x): identity = x # input out = self.conv1(x) # 3x3 convolution step size customization out = self.bn1d(out) # BatchNorm out = self.relu(out) # activation out = self.conv2(out) # 3x3 convolution step size is 1 out = self.bn2d(out) # BatchNorm if self.down_sample: # Downsampling: if the input and output dimensions are different, the input dimension will be converted to the input dimension to facilitate the residual identity = self.down_sample_layer(identity) out = out + identity # residual out = self.relu(out) # activation return out
Redevelop ResNet
# ResNet 50 as an example class ResNet(nn.Cell): """ block (Cell): Subnet layer_nums (list): Number of each subnet in_channels (list): Input dimension for each subnet out_channels (list): Output dimension per subnet strides (list): Step size per layer Examples: >>> ResNet(ResidualBlock, >>> [3, 4, 6, 3], >>> [64, 256, 512, 1024], >>> [256, 512, 1024, 2048], >>> [1, 2, 2, 2], >>> 10) """ def __init__(self, block, layer_nums, in_channels, out_channels, strides, num_classes): super(ResNet, self).__init__() if not len(layer_nums) == len(in_channels) == len(out_channels) == 4: # Verify that the input is correct raise ValueError("the length of layer_num, in_channels, out_channels list must be 4!") # Layer 1 7x7 convolution + pooling step 2 self.conv1 = _conv7x7(3, 64, stride=2) self.bn1 = _bn(64) # BatchNorm self.relu = ops.ReLU() # activation # Maximum pooled 3x3 convolution kernel step size 2 Padding self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, pad_mode="same") # 1st subnet 3*3 layer input 64 output 256 step 1 self.layer1 = self._make_layer(block, layer_nums[0], in_channel=in_channels[0], out_channel=out_channels[0], stride=strides[0]) # 2nd subnet 4*3 layer input 256 output 512 step 2 self.layer2 = self._make_layer(block, layer_nums[1], in_channel=in_channels[1], out_channel=out_channels[1], stride=strides[1]) # 3rd subnet 6*3 layer input 512 output 1024 step 2 self.layer3 = self._make_layer(block, layer_nums[2], in_channel=in_channels[2], out_channel=out_channels[2], stride=strides[2]) # 4th subnet 3*3 layer input 1024 output 2048 step 2 self.layer4 = self._make_layer(block, layer_nums[3], in_channel=in_channels[3], out_channel=out_channels[3], stride=strides[3]) # Output layer self.mean = ops.ReduceMean(keep_dims=True) # Average pooling self.flatten = nn.Flatten() # fold self.end_point = _fc(out_channels[3], num_classes) # Full connection layer def _make_layer(self, block, layer_num, in_channel, out_channel, stride): """ Args: block (Cell): Residual block layer_num (int): Number of each subnet in_channel (int): Input dimension per subnet out_channel (int): Output dimension per subnet stride (int): Step size of the first convolution. Examples: >>> _make_layer(ResidualBlock, 3, 128, 256, 2) """ layers = [] # network layer resnet_block = block(in_channel, out_channel, stride=stride) # residual layers.append(resnet_block) # Add the first residual block, with different input and output dimensions for different step sizes for _ in range(1, layer_num): # Add residual error block in steps of 1 resnet_block = block(out_channel, out_channel, stride=1) layers.append(resnet_block) return nn.SequentialCell(layers) # Composition subnet def construct(self, x): # Layer 1 7x7 convolution step 2 x = self.conv1(x) x = self.bn1(x) x = self.relu(x) # Maximum pooled 3x3 convolution kernel step size 2 Padding c1 = self.maxpool(x) # 48th floor substructure in the middle of the 2nd-50th floor c2 = self.layer1(c1) # 2-10 c3 = self.layer2(c2) # 11-22 c4 = self.layer3(c3) # 23-40 c5 = self.layer4(c4) # 41-49 # Average pooling of output layer + full connection layer out = self.mean(c5, (2, 3)) out = self.flatten(out) out = self.end_point(out) return out
Pass in ResNet50 layer information and construct the whole ResNet50 network:
# class_num: number of classifications in the dataset. net = resnet50(10) def resnet50(class_num=10): return ResNet(ResidualBlock, [3, 4, 6, 3], [64, 256, 512, 1024], [256, 512, 1024, 2048], [1, 2, 2, 2], class_num)
Other modules
- Reverse construction, gradient clipping, optimizer, learning rate generation, etc
ResNet50 training mainly involves the following items:
- SGD + Momentum optimizer used
- WeightDecay function is used (but gamma and bias of BatchNorm are not used)
- cosine LR schedule used
- Label Smoothing used
Implement the SGD optimizer with Momentum. Except for the gamma and bias of BN, other weights apply WeightDecay:
# Momentum's SGD optimizer decayed_params = [] no_decayed_params = [] for param in net.trainable_params(): if 'beta' not in param.name and 'gamma' not in param.name and 'bias' not in param.name: decayed_params.append(param) else: no_decayed_params.append(param) group_params = [{'params': decayed_params, 'weight_decay': weight_decay}, {'params': no_decayed_params}, {'order_params': net.trainable_params()}] opt = Momentum(group_params, lr, momentum)
Define the Loss function and implement Label Smoothing:
import mindspore.nn as nn from mindspore import Tensor from mindspore import dtype as mstype from mindspore.nn import LossBase import mindspore.ops as ops # define cross entropy loss class CrossEntropySmooth(LossBase): """CrossEntropy""" def __init__(self, sparse=True, reduction='mean', smooth_factor=0., num_classes=1000): super(CrossEntropySmooth, self).__init__() self.onehot = ops.OneHot() self.sparse = sparse self.on_value = Tensor(1.0 - smooth_factor, mstype.float32) self.off_value = Tensor(1.0 * smooth_factor / (num_classes - 1), mstype.float32) self.ce = nn.SoftmaxCrossEntropyWithLogits(reduction=reduction) def construct(self, logit, label): if self.sparse: label = self.onehot(label, ops.shape(logit)[1], self.on_value, self.off_value) loss = self.ce(logit, label) return loss
Get through the process
Stand alone training
- Refactor the above code as follows:
. ├── scripts │ ├── run_distribute_train.sh # Start Ascend distributed training (8 cards) │ ├── run_eval.sh # Start Ascend assessment │ └── run_standalone_train.sh # Start Ascend single machine training (single card) ├── src │ ├── config.py # configuration file │ ├── cross_entropy_smooth.py # Definition of loss │ ├── dataset.py # Data preprocessing │ └── resnet.py # network structure ├── eval.py # Reasoning process └── train.py # Training process
Where train Py definition
import os import argparse import ast from mindspore import context, set_seed, Model from mindspore.nn import Momentum from mindspore.context import ParallelMode from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, LossMonitor, TimeMonitor from mindspore.communication import init from mindspore.common import initializer import mindspore.nn as nn from src.config import config from src.dataset import create_dataset from src.resnet import resnet50 from src.cross_entropy_smooth import CrossEntropySmooth # Set seed set_seed(1) # Load parameters parser = argparse.ArgumentParser(description='Image classification') parser.add_argument('--run_distribute', type=ast.literal_eval, default=False, help='Run distribute') # Distributed search parser.add_argument('--device_num', type=int, default=1, help='Device num.') # Device parser.add_argument('--dataset_path', type=str, default=None, help='Dataset path') # Dataset storage path parser.add_argument('--device_target', type=str, default='GPU', choices=("Ascend", "GPU", "CPU"), help='Device target, support Ascend,GPU,CPU') # Dataset storage path args_opt = parser.parse_args() if __name__ == '__main__': # 1 Resolve parameters and set basic environment # Distributed training acquisition equipment information environment variables device_id = int(os.getenv('DEVICE_ID', '0')) # Default Device rank_size = int(os.getenv('RANK_SIZE', '1')) rank_id = int(os.getenv('RANK_ID', '0')) # init context training environment # Single machine dynamic graph mode: dynamic_ Mode static graph mode: GRAPH_MODE platform: ascend GPU CPU context.set_context(mode=context.GRAPH_MODE, device_target=args_opt.device_target, device_id=device_id) # Multi machine distributed if rank_size > 1: context.set_auto_parallel_context(device_num=rank_size, parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True) context.set_auto_parallel_context(all_reduce_fusion_config=[85, 160]) init() # 2 define datasets # data set dataset = create_dataset(args_opt.dataset_path, config.batch_size, rank_size, rank_id) step_size = dataset.get_dataset_size() # 3 define network structure # Define network net = resnet50(class_num=config.class_num) # Weight initialization for _, cell in net.cells_and_names(): if isinstance(cell, nn.Conv2d): # Convolution initialization XavierUniform cell.weight.set_data(initializer.initializer(initializer.XavierUniform(), cell.weight.shape, cell.weight.dtype)) if isinstance(cell, nn.Dense): # Dense initialization TruncatedNormal cell.weight.set_data(initializer.initializer(initializer.TruncatedNormal(), cell.weight.shape, cell.weight.dtype)) # 4 defining loss functions and optimizers # Learning rate decay warmup lr = nn.dynamic_lr.cosine_decay_lr(config.lr_end, config.lr, config.epoch_size * step_size, step_size, config.warmup) # SGD optimizer for Momentum attenuation policy decayed_params = [] no_decayed_params = [] for param in net.trainable_params(): if 'beta' not in param.name and 'gamma' not in param.name and 'bias' not in param.name: decayed_params.append(param) else: no_decayed_params.append(param) group_params = [{'params': decayed_params, 'weight_decay': config.weight_decay}, {'params': no_decayed_params}, {'order_params': net.trainable_params()}] opt = Momentum(group_params, lr, config.momentum) # Cross entropy loss loss = CrossEntropySmooth(sparse=True, reduction="mean", smooth_factor=config.label_smooth_factor, num_classes=config.class_num) # 5 define model and callback function # Model definition, network model = Model(net, loss_fn=loss, optimizer=opt, metrics={'acc'}) # metrics={'top_1_accuracy', 'top_5_accuracy'} # Callback, save training practice, model, etc time_cb = TimeMonitor(data_size=step_size) loss_cb = LossMonitor() cb = [time_cb, loss_cb] if config.save_checkpoint: # # config_ck = CheckpointConfig(save_checkpoint_steps=config.save_checkpoint_epochs * step_size, config_ck = CheckpointConfig(save_checkpoint_steps=5, keep_checkpoint_max=config.keep_checkpoint_max) ckpt_cb = ModelCheckpoint(prefix="resnet", directory=config.save_checkpoint_path, config=config_ck) cb += [ckpt_cb] model.train(config.epoch_size, dataset, callbacks=cb, sink_size=step_size, dataset_sink_mode=False)
Operation training
source activate py37_ms16 pip install easydict python train.py --dataset_path=./data/imagenet_original/train/
Error summary
Out of memory: device (id:0) memory isn't enough and allocate failed, kernel name:
- Reduce batch_size done
- Cause of insufficient memory: batch_ Too large size, too large model, too large data shape, etc
References:
https://www.mindspore.cn/docs/zh-CN/master/migration_guide/sample_code.html
https://gitee.com/mindspore/docs/tree/master/docs/sample_code/migration_sample
https://gitee.com/mindspore/models/tree/master/official/cv/resnet