Visualization of Mindinsight training optimization process

Neural network training is essentially an optimization process of high-dimensional nonconvex functions. Generally, the minimum value point can be found through the gradient descent method (as shown in Figure 1). However, the general neural network has tens of thousands or even hundreds of thousands of parameters, so it is difficult to directly display its optimized terrain in three-dimensional space. Through this function, the user can display the optimization space around the neural network training path based on direction reduction and rendering calculation.

Use steps

The specific use steps are divided into two steps. Taking LeNet as an example, the classification task and data set are MNIST. The sample code is as follows:

  1. Training data collection: in the training process, the SummaryCollector is used to collect the forward network weights of multiple models and the parameters required for topographic map drawing (expected drawing interval, topographic map resolution, etc.)
import mindspore.dataset as ds
import mindspore.dataset.vision.c_transforms as CV
import mindspore.dataset.transforms.c_transforms as C
from mindspore.dataset.vision import Inter
from mindspore import dtype as mstype
import mindspore.nn as nn

from mindspore.common.initializer import Normal
from mindspore import set_context, GRAPH_MODE
from mindspore import ModelCheckpoint, CheckpointConfig, LossMonitor, TimeMonitor, SummaryCollector
from mindspore import Model
from mindspore.nn import Accuracy
from mindspore import set_seed

set_seed(1)

def create_dataset(data_path, batch_size=32, repeat_size=1,
                   num_parallel_workers=1):
    """
    create dataset for train or test
    """
    # define dataset
    mnist_ds = ds.MnistDataset(data_path, shuffle=False)

    resize_height, resize_width = 32, 32
    rescale = 1.0 / 255.0
    shift = 0.0
    rescale_nml = 1 / 0.3081
    shift_nml = -1 * 0.1307 / 0.3081

    # define map operations
    resize_op = CV.Resize((resize_height, resize_width), interpolation=Inter.LINEAR)  # Bilinear mode
    rescale_nml_op = CV.Rescale(rescale_nml, shift_nml)
    rescale_op = CV.Rescale(rescale, shift)
    hwc2chw_op = CV.HWC2CHW()
    type_cast_op = C.TypeCast(mstype.int32)

    # apply map operations on images
    mnist_ds = mnist_ds.map(operations=type_cast_op, input_columns="label", num_parallel_workers=num_parallel_workers)
    mnist_ds = mnist_ds.map(operations=resize_op, input_columns="image", num_parallel_workers=num_parallel_workers)
    mnist_ds = mnist_ds.map(operations=rescale_op, input_columns="image", num_parallel_workers=num_parallel_workers)
    mnist_ds = mnist_ds.map(operations=rescale_nml_op, input_columns="image", num_parallel_workers=num_parallel_workers)
    mnist_ds = mnist_ds.map(operations=hwc2chw_op, input_columns="image", num_parallel_workers=num_parallel_workers)

    # apply DatasetOps
    buffer_size = 10000
    mnist_ds = mnist_ds.shuffle(buffer_size=buffer_size)  # 10000 as in LeNet train script
    mnist_ds = mnist_ds.batch(batch_size, drop_remainder=True)
    mnist_ds = mnist_ds.repeat(repeat_size)

    return mnist_ds

class LeNet5(nn.Cell):
    """
    Lenet network

    Args:
        num_class (int): Number of classes. Default: 10.
        num_channel (int): Number of channels. Default: 1.

    Returns:
        Tensor, output tensor
    Examples:
    LeNet(num_class=10)

    """
    def __init__(self, num_class=10, num_channel=1, include_top=True):
        super(LeNet5, self).__init__()
        self.conv1 = nn.Conv2d(num_channel, 6, 5, pad_mode='valid', weight_init=Normal(0.02))
        self.conv2 = nn.Conv2d(6, 16, 5, pad_mode='valid', weight_init=Normal(0.02))
        self.relu = nn.ReLU()
        self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2)
        self.include_top = include_top
        if self.include_top:
            self.flatten = nn.Flatten()
            self.fc1 = nn.Dense(16 * 5 * 5, 120)
            self.fc2 = nn.Dense(120, 84)
            self.fc3 = nn.Dense(84, num_class)

    def construct(self, x):
        x = self.conv1(x)
        x = self.relu(x)
        x = self.max_pool2d(x)
        x = self.conv2(x)
        x = self.relu(x)
        x = self.max_pool2d(x)
        if not self.include_top:
            return x
        x = self.flatten(x)
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

def train_lenet():
    set_context(mode=GRAPH_MODE, device_target="GPU")
    data_path = YOUR_DATA_PATH
    ds_train = create_dataset(data_path)

    network = LeNet5(10)
    net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")
    net_opt = nn.Momentum(network.trainable_params(), 0.01, 0.9)
    time_cb = TimeMonitor(data_size=ds_train.get_dataset_size())
    config_ck = CheckpointConfig(save_checkpoint_steps=1875, keep_checkpoint_max=10)
    ckpoint_cb = ModelCheckpoint(prefix="checkpoint_lenet", config=config_ck)
    model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()})
    summary_dir = "./summary/lenet_test2"
    interval_1 = [x for x in range(1, 4)]
    interval_2 = [x for x in range(7, 11)]
    ##Collector landscape information
    summary_collector = SummaryCollector(summary_dir, keep_default_action=True,
                                         collect_specified_data={'collect_landscape': {'landscape_size': 10,
                                                                                       'unit': "epoch",
                                                                                       'create_landscape': {'train': True,
                                                                                                            'result': True},
                                                                                       'num_samples': 512,
                                                                                        'intervals': [interval_1,
                                                                                                      interval_2
                                                                                                      ]
                                                                                        }
                                                                },
                                        collect_freq=1)

    print("============== Starting Training ==============")
    model.train(10, ds_train, callbacks=[time_cb, ckpoint_cb, LossMonitor(), summary_collector])

if __name__ == "__main__":
    train_lenet()

  1. summary_dir sets the path to save the parameters. summary_collector is an initialized SummaryCollector instance. Where collector_ Specified_ Collect in data_ The landscape contains all the parameter settings required for drawing the topographic map in the form of a dictionary:
    • landscape_size: indicates the resolution of the topographic map. 10 indicates that the resolution of the topographic map is 10*10. The higher the resolution, the finer the texture of the topographic map, and the longer the calculation time will be. The default is 40.
    • Unit: refers to the interval unit for saving parameters during training, which is divided into epoch/step. When using step, it must be in the model Set dataset in train_ Sink_ Model=false. The default is step.
    • create_landscape: indicates the method of drawing topographic map. Currently, it supports training process topographic map (with training track) and training result topographic map (without track). Default {'train': True, 'result': True}
    • num_samples: indicates the number of samples in the topographic map dataset. 512 indicates that the required sample for the topographic map is 512. The larger the number of samples, the more accurate the topographic map is, and the longer the calculation time will be. The default is 2048.
    • intervals: indicates the interval for drawing topographic map. Such as interval_1 means to draw 1-5 epoch topographic map with training track.
  2. Topographic map drawing: using the model parameters saved in the training process, the model and data set are consistent with the training, start a new script, and generate the topographic map information through forward calculation without further training. (applicable to Drawing Topographic Map by single card or multi card Parallel Computing)
import mindspore.dataset as ds
import mindspore.dataset.vision.c_transforms as CV
import mindspore.dataset.transforms.c_transforms as C
from mindspore.dataset.vision import Inter
from mindspore import dtype as mstype
import mindspore.nn as nn

from mindspore.common.initializer import Normal
from mindspore import Model
from mindspore.nn import Loss
from mindspore import SummaryLandscape

def create_dataset(data_path, batch_size=32, repeat_size=1,
                   num_parallel_workers=1):
    """
    create dataset for train or test
    """
    # define dataset
    mnist_ds = ds.MnistDataset(data_path, shuffle=False)

    resize_height, resize_width = 32, 32
    rescale = 1.0 / 255.0
    shift = 0.0
    rescale_nml = 1 / 0.3081
    shift_nml = -1 * 0.1307 / 0.3081

    # define map operations
    resize_op = CV.Resize((resize_height, resize_width), interpolation=Inter.LINEAR)  # Bilinear mode
    rescale_nml_op = CV.Rescale(rescale_nml, shift_nml)
    rescale_op = CV.Rescale(rescale, shift)
    hwc2chw_op = CV.HWC2CHW()
    type_cast_op = C.TypeCast(mstype.int32)

    # apply map operations on images
    mnist_ds = mnist_ds.map(operations=type_cast_op, input_columns="label", num_parallel_workers=num_parallel_workers)
    mnist_ds = mnist_ds.map(operations=resize_op, input_columns="image", num_parallel_workers=num_parallel_workers)
    mnist_ds = mnist_ds.map(operations=rescale_op, input_columns="image", num_parallel_workers=num_parallel_workers)
    mnist_ds = mnist_ds.map(operations=rescale_nml_op, input_columns="image", num_parallel_workers=num_parallel_workers)
    mnist_ds = mnist_ds.map(operations=hwc2chw_op, input_columns="image", num_parallel_workers=num_parallel_workers)

    # apply DatasetOps
    buffer_size = 10000
    mnist_ds = mnist_ds.shuffle(buffer_size=buffer_size)  # 10000 as in LeNet train script
    mnist_ds = mnist_ds.batch(batch_size, drop_remainder=True)
    mnist_ds = mnist_ds.repeat(repeat_size)

    return mnist_ds

class LeNet5(nn.Cell):
    """
    Lenet network

    Args:
        num_class (int): Number of classes. Default: 10.
        num_channel (int): Number of channels. Default: 1.

    Returns:
        Tensor, output tensor
    Examples:
    LeNet(num_class=10)

    """
    def __init__(self, num_class=10, num_channel=1, include_top=True):
        super(LeNet5, self).__init__()
        self.conv1 = nn.Conv2d(num_channel, 6, 5, pad_mode='valid', weight_init=Normal(0.02))
        self.conv2 = nn.Conv2d(6, 16, 5, pad_mode='valid', weight_init=Normal(0.02))
        self.relu = nn.ReLU()
        self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2)
        self.include_top = include_top
        if self.include_top:
            self.flatten = nn.Flatten()
            self.fc1 = nn.Dense(16 * 5 * 5, 120)
            self.fc2 = nn.Dense(120, 84)
            self.fc3 = nn.Dense(84, num_class)

    def construct(self, x):
        x = self.conv1(x)
        x = self.relu(x)
        x = self.max_pool2d(x)
        x = self.conv2(x)
        x = self.relu(x)
        x = self.max_pool2d(x)
        if not self.include_top:
            return x
        x = self.flatten(x)
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

def callback_fn():
    network = LeNet5(10)
    net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")
    metrics = {"Loss": Loss()}
    model = Model(network, net_loss, metrics=metrics)
    data_path = YOUR_DATA_PATH
    ds_eval = create_dataset(data_path)
    return model, network, ds_eval, metrics

if __name__ == "__main__":
    interval_1 = [x for x in range(1, 4)]
    interval_2 = [x for x in range(7, 11)]
    summary_landscape = SummaryLandscape('./summary/lenet_test2')
    # generate loss landscape
    summary_landscape.gen_landscapes_with_multi_process(callback_fn,
                                                        collect_landscape={"landscape_size": 10,
                                                                           "create_landscape": {"train": True,
                                                                                                "result": True},
                                                                           "num_samples": 512,
                                                                           "intervals": [interval_1, interval_2
                                                                                        ]},
                                                        device_ids=[1, 2])

  • callback_fn: the user needs to define the function callback_fn, this function has no input, and returns model (mindscore.model), network (mindscore.nn.cell), dataset (mindscore.dataset), and metrics (mindscore.nn.metrics).
  • collect_landscape: parameter definitions are consistent with SummaryCollector. Here, users can freely modify drawing parameters.
  • device_ids: specify the device required for topographic map drawing_ IDS, which supports single machine multi card computing.

After drawing, start MindInsight

# start-up 
mindinsight start --port 8000 --summary-base-dir /home/workspace/dockers/study/logs/summary

Multidimensional analysis of loss function

The multidimensional analysis of loss function describes the motion trajectory of the model in the training process. Users can view the multidimensional analysis of loss function to understand the motion trajectory of the model in the training process.

  1. Contour map, topographic map and 3D map: they represent different display forms of the same group of data, and users can freely choose to view them.
  2. Step selection: users can choose to display images of different sections through "please select section range".
  3. Visual display settings: by adjusting its parameters, users can view images from different angles, different topographic map colors, and different track colors and widths. Among them, the number of contour lines can be adjusted in the contour map (Figure 5) and topographic map to show the density of the image.
  4. Basic training information (Figure 5): the basic information of the model will be displayed in the basic training information, such as network name, optimizer, learning rate (currently the fixed learning rate is displayed), dimension reduction method, sampling point resolution, step/epoch.

matters needing attention

  1. During topographic map drawing, the drawing time is related to the size of model parameters and data set num_sample size and resolution landscape_size is directly related to size. Model, num_sample and landscape_ The larger the size, the longer it takes. For example, for a LeNet network with a resolution of 40*40, a graph takes 4 minutes. When two cards are used for calculation, the time of a graph can be reduced to 2 minutes. ResNet-50 network, with the same resolution, uses 4 cards to draw and calculate, and one drawing takes 20 minutes.
  2. In the MindInsight startup interface, the training log file is large. MindInsight needs more time to parse the training log file. Please wait patiently.
  3. This function is currently only supported through mindspore Model defined by.
  4. Currently, only the backend is supported: Ascend/GPU/CPU, mode: static graph mode, and platform: LINUX.
  5. This function currently only supports single card and multi card modes.
  6. This function does not support data sinking mode when drawing topographic map.

Tags: Deep Learning

Posted by Sergeant on Wed, 01 Jun 2022 07:43:45 +0530