# TVM: Use scheduling templates and AutoTVM optimization operators

This section learns how to use the TVM Tensor Expression (TE) language to write scheduling templates that can be searched by autoTVM to find the best scheduling. This process, called auto-Tuning, helps to optimize the automated process of tensor calculation.

This section is based on How to write matrix multiplication using TE On the basis of

The steps for auto-tuning are as follows:

• First Search Space
• The second run a search algorithm to explore this space

### Install dependencies

To use autoTVM packages in TVM, you need to install some additional dependencies

```pip3 install --user psutil xgboost cloudpickle
```

To make the TVM run faster in tuning, cython is recommended as the FFI of the TVM. In the root directory of the TVM, execute:

FFI:Foreign Function Interface, a program written in one programming language can call functions written in another programming language

```pip3 install --user cython
sudo make cython3
```

Import the required libraries:

```import logging
import sys

import numpy as np
import tvm
from tvm import te
import tvm.testing

# the module is called `autotvm`
from tvm import autotvm
```

### Basic Matrix Multiplication with TE

Wrap TE matrix multiplication with a python function definition. For simplicity, focus on split optimization, using a fixed value to define the reordered block size

```def matmul_basic(N, L, M, dtype):

A = te.placeholder((N, L), name="A", dtype=dtype)
B = te.placeholder((L, M), name="B", dtype=dtype)

k = te.reduce_axis((0, L), name="k")
C = te.compute((N, M), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k), name="C")
s = te.create_schedule(C.op)

# schedule
y, x = s[C].op.axis
k = s[C].op.reduce_axis

yo, yi = s[C].split(y, 8)
xo, xi = s[C].split(x, 8)

s[C].reorder(yo, xo, k, yi, xi)

return s, [A, B, C]
```

### Matrix Multiplication with AutoTVM

In the above matrix multiplication code, the constant "8" is used as the tiling factor. This may not be the best, and the best tiling factor depends on the actual hardware environment and the input shape.
If you want the dispatch code to be portable on a wider range of input shapes and target hardware, it is best to define a set of candidate values and select the best values based on measurements on the target hardware.

In autotvm, we can define an adjustable parameter, or a "knob", for such values.

```# Matmul V1: List candidate values
@autotvm.template("tutorial/matmul_v1")  # 1. use a decorator
def matmul_v1(N, L, M, dtype):
A = te.placeholder((N, L), name="A", dtype=dtype)
B = te.placeholder((L, M), name="B", dtype=dtype)

k = te.reduce_axis((0, L), name="k")
C = te.compute((N, M), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k), name="C")
s = te.create_schedule(C.op)

# schedule
y, x = s[C].op.axis
k = s[C].op.reduce_axis

# 2. get the config object
cfg = autotvm.get_config()

# 3. define search space
cfg.define_knob("tile_y", [1, 2, 4, 8, 16])
cfg.define_knob("tile_x", [1, 2, 4, 8, 16])

# 4. schedule according to config
yo, yi = s[C].split(y, cfg["tile_y"].val)
xo, xi = s[C].split(x, cfg["tile_x"].val)

s[C].reorder(yo, xo, k, yi, xi)

return s, [A, B, C]
```

Here, we make four modifications to the previous dispatch code to get a dispatchable "template". We can explain these changes one by one:

1. Mark this function as a simple template using an adorner.
2. Gets the config object. This cfg is considered a parameter of this function, but it can be obtained in different ways. With this parameter, the function is no longer a deterministic scheduling. Instead, we can pass different configurations to this function and get different schedules. A function that uses configuration objects like this is called a Template.
To make template functions more compact, there are two things we can do to define the parameter search space in a single function.
1. Defines a search space that spans a set of values. This is achieved by making cfg a ConfigSpace object. It will collect all the adjustable knobs in this function and create a search space from them.
2. Schedule according to an entity in this space. This is achieved by making CFG a ConfigEntity object. When it is a ConfigEntity, it ignores all spatial definition API s (that is, cfg.define_XXXXX(...)). Instead, it stores determinate values for all dispatchable knobs that we schedule against.

During the automatic scheduling process, we will first build the search space by invoking the template with the ConfigSpace object. We then invoke the template with different ConfigEntities in the built space to obtain different schedules. Finally, we'll measure the code generated by different schedules and pick the best one.

1. Define two dispatchable knobs. The first is tile_y, there are five possible values. The second is tile_x, has the same list of possible values. The two knobs are independent, so they span a search space of 25=5x5 in size.

2. The configuration knob is passed to the split dispatch operation, enabling us to schedule according to the 5x5 determinant we previously defined in the cfg.

#### A Matrix Multiplication Template with the Advanced Parameter API

In the previous template, all possible values for a knob are manually listed. This is the lowest API that defines space and gives a clear list of the parameter spaces to search for. However, we also provide another set of APIs to make the definition of search space easier and smarter. Where possible, we accept that you use this higher level API.
In the following example, we use ConfigSpace.define_split to define a split knob. It will list all possible ways to split an axis and build space.
There is also ConfigSpace.define_reorder is used to reorder knobs and ConfigSpace.define_annotate is used for annotations such as unroll, vectorization, thread binding, and so on. When an advanced API does not meet your requirements, you can always fall back to using a lower level API.

```@autotvm.template("tutorial/matmul")
def matmul(N, L, M, dtype):
A = te.placeholder((N, L), name="A", dtype=dtype)
B = te.placeholder((L, M), name="B", dtype=dtype)

k = te.reduce_axis((0, L), name="k")
C = te.compute((N, M), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k), name="C")
s = te.create_schedule(C.op)

# schedule
y, x = s[C].op.axis
k = s[C].op.reduce_axis

##### define space begin #####
cfg = autotvm.get_config()
cfg.define_split("tile_y", y, num_outputs=2)
cfg.define_split("tile_x", x, num_outputs=2)
##### define space end #####

# schedule according to config
yo, yi = cfg["tile_y"].apply(s, C, y)
xo, xi = cfg["tile_x"].apply(s, C, x)

s[C].reorder(yo, xo, k, yi, xi)

return s, [A, B, C]
```

PS: About cfg. Define_ More explanations for split
In this template, cfg.define_split("tile_y", y, num_outputs=2) lists all possible combinations that divide axis Y into two axes with a factor of Y length. For example, if y has a length of 32 and we want to split it into two axes with a factor of 32, then (the length of the outer axis, the length of the inner axis) has six possible values for (32, 1), (16, 2), (8, 4, 8), (2, 16) or (1, 32). These are tile_ There are six possible values for y.

During the scheduling process, CFG ["tile_y"] is a SplitEntity object. We store the lengths of the outer and inner axes in cfg['tile_y']. In size (a tuple with two elements). In this template, we use yo, yi = cfg['tile_y'].apply(s, C, y) to apply it. In fact, this is equivalent to yo, yi = s[C].split(y, cfg["tile_y"].size) or yo, Yi = s[C]. Split(y, nparts=cfg['tile_y'].size)

Use cfg. The benefit of the apply API is that it makes multilevel splitting easier when num_outputs >= 3.

### Step 2: Use AutoTVM to Optimize the Matrix Multiplication

In Step 1, we wrote a matrix multiplication template that allows us to parameterize the block sizes used in split scheduling. We can now search for this parameter space. The next step is to select an adjuster to guide the exploration of this space.

#### Auto-tuners in TVM

The adjustments are shown in the following pseudocode:

```ct = 0
while ct < max_number_of_trials:
propose a batch of configs
measure this batch of configs on real hardware and get results
ct += batch_size
```

When proposing the next batch of configurations, tuner can take different strategies. Some tuner strategies offered by TVM include:

• tvm.autotvm.tuner.RandomTuner: Enumerates spaces in random order.
• tvm.autotvm.tuner.GridSearchTuner: Enumerate space as a grid search.
• tvm.autotvm.tuner.GATuner: Searching for space using genetic algorithms
• tvm.autotvm.tuner.XGBTuner: Use a model-based approach. Train an XGBoost model to predict the rate of IR reduction and select the next batch based on the predicted results.

Tuners can be selected based on your space size, your time budget, and other factors. For example, if your space is very small (less than 1000), grid search tuners or random tuners are sufficient. If your space is at the level (this is the size of the conv2d operator on the CUDA GPU), XGBoostTuner can more effectively explore and find better configurations.

#### Begin tuning

First, create a tuning task, or check the initialization space. In this example, for a 512 * 512 matrix multiplication, the space size plays 10 * 10 = 100. Note that this task is independent of the search space and the tuner selected.

```N, L, M = 512, 512, 512
```

Output results:

```ConfigSpace (len=100, space_map=
0 tile_y: Split(policy=factors, product=512, num_outputs=2) len=10
1 tile_x: Split(policy=factors, product=512, num_outputs=2) len=10
)
```

Then you need to define a code to measure production and a selected tuner. Because we have very little space, a random tuner will do
In practice, you can do more trials based on your time budget. We will record the tuning results in a log file. This file can be used to select the best configuration that the tuner will discover later.

```# logging config (for printing tuning log to the screen)
logging.getLogger("autotvm").setLevel(logging.DEBUG)
```

There are two steps to measuring a configuration: build and run. By default, all CPU cores are used to compile programs. Then measure in order. To reduce the difference, this is measured five times and averaged

```measure_option = autotvm.measure_option(builder="local", runner=autotvm.LocalRunner(number=5))

# Begin tuning with RandomTuner, log records to file `matmul.log`
# You can use alternatives like XGBTuner.
tuner.tune(
n_trial=10,
measure_option=measure_option,
callbacks=[autotvm.callback.log_to_file("matmul.log")],
)
```

Output results:

```waiting for device...
waiting for device...It queries the scheduling context with its parameters and gets the best configuration with the same parameters.
device available
device available
Get devices for measurement successfully!
Get devices for measurement successfully!
No: 1   GFLOPS: 8.17/8.17       result: MeasureResult(costs=(0.0328701162,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.6089231967926025, timestamp=1658363130.4992497)  [('tile_y', [-1, 16]), ('tile_x', [-1, 16])],None,44
No: 1   GFLOPS: 8.17/8.17       result: MeasureResult(costs=(0.0328701162,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.6089231967926025, timestamp=1658363130.4992497)  [('tile_y', [-1, 16]), ('tile_x', [-1, 16])],None,44
No: 2   GFLOPS: 7.45/8.17       result: MeasureResult(costs=(0.0360509632,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.695908784866333, timestamp=1658363131.2425098)   [('tile_y', [-1, 8]), ('tile_x', [-1, 16])],None,43
No: 2   GFLOPS: 7.45/8.17       result: MeasureResult(costs=(0.0360509632,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.695908784866333, timestamp=1658363131.2425098)   [('tile_y', [-1, 8]), ('tile_x', [-1, 16])],None,43
No: 3   GFLOPS: 25.86/25.86     result: MeasureResult(costs=(0.010381651799999999,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.3583974838256836, timestamp=1658363131.6205783)  [('tile_y', [-1, 32]), ('tile_x', [-1, 128])],None,75
No: 3   GFLOPS: 25.86/25.86     result: MeasureResult(costs=(0.010381651799999999,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.3583974838256836, timestamp=1658363131.6205783)  [('tile_y', [-1, 32]), ('tile_x', [-1, 128])],None,75
No: 4   GFLOPS: 30.35/30.35     result: MeasureResult(costs=(0.0088443436,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.31720852851867676, timestamp=1658363131.9254267) [('tile_y', [-1, 1]), ('tile_x', [-1, 64])],None,60
No: 4   GFLOPS: 30.35/30.35     result: MeasureResult(costs=(0.0088443436,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.31720852851867676, timestamp=1658363131.9254267) [('tile_y', [-1, 1]), ('tile_x', [-1, 64])],None,60
No: 5   GFLOPS: 9.35/30.35      result: MeasureResult(costs=(0.028700182600000002,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.7562305927276611, timestamp=1658363132.5526962)  [('tile_y', [-1, 16]), ('tile_x', [-1, 128])],None,74
No: 5   GFLOPS: 9.35/30.35      result: MeasureResult(costs=(0.028700182600000002,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.7562305927276611, timestamp=1658363132.5526962)  [('tile_y', [-1, 16]), ('tile_x', [-1, 128])],None,74
No: 6   GFLOPS: 28.15/30.35     result: MeasureResult(costs=(0.0095366542,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.29946398735046387, timestamp=1658363132.9198403) [('tile_y', [-1, 8]), ('tile_x', [-1, 512])],None,93
No: 6   GFLOPS: 28.15/30.35     result: MeasureResult(costs=(0.0095366542,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.29946398735046387, timestamp=1658363132.9198403) [('tile_y', [-1, 8]), ('tile_x', [-1, 512])],None,93
No: 7   GFLOPS: 19.80/30.35     result: MeasureResult(costs=(0.0135571626,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.30517101287841797, timestamp=1658363133.3043413) [('tile_y', [-1, 512]), ('tile_x', [-1, 512])],None,99
No: 7   GFLOPS: 19.80/30.35     result: MeasureResult(costs=(0.0135571626,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.30517101287841797, timestamp=1658363133.3043413) [('tile_y', [-1, 512]), ('tile_x', [-1, 512])],None,99
No: 8   GFLOPS: 5.75/30.35      result: MeasureResult(costs=(0.0467114476,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.8827676773071289, timestamp=1658363134.2409544)  [('tile_y', [-1, 8]), ('tile_x', [-1, 32])],None,53
No: 8   GFLOPS: 5.75/30.35      result: MeasureResult(costs=(0.0467114476,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.8827676773071289, timestamp=1658363134.2409544)  [('tile_y', [-1, 8]), ('tile_x', [-1, 32])],None,53
No: 9   GFLOPS: 27.67/30.35     result: MeasureResult(costs=(0.0097023816,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.38503575325012207, timestamp=1658363134.582773)  [('tile_y', [-1, 2]), ('tile_x', [-1, 128])],None,71
No: 9   GFLOPS: 27.67/30.35     result: MeasureResult(costs=(0.0097023816,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.38503575325012207, timestamp=1658363134.582773)  [('tile_y', [-1, 2]), ('tile_x', [-1, 128])],None,71
No: 10  GFLOPS: 2.62/30.35      result: MeasureResult(costs=(0.1026008602,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.708867073059082, timestamp=1658363136.3652241)   [('tile_y', [-1, 8]), ('tile_x', [-1, 1])],None,3
No: 10  GFLOPS: 2.62/30.35      result: MeasureResult(costs=(0.1026008602,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.708867073059082, timestamp=1658363136.3652241)   [('tile_y', [-1, 8]), ('tile_x', [-1, 1])],None,3
```

When tuning is complete, select the configuration from the log file that has the best measurement performance for v and compile the schedule with the appropriate parameters. A quick validation can also be done to ensure that the schedule produces the correct answer. Can autotvm. Apply_ History_ The matmul function is called directly in the best context. When this function is called, it queries the scheduling context with its parameters and gets the best configuration with the same parameters.

```# apply history best from log file
with autotvm.apply_history_best("matmul.log"):
with tvm.target.Target("llvm"):
s, arg_bufs = matmul(N, L, M, "float32")
func = tvm.build(s, arg_bufs)

# check correctness
a_np = np.random.uniform(size=(N, L)).astype(np.float32)
b_np = np.random.uniform(size=(L, M)).astype(np.float32)
c_np = a_np.dot(b_np)

c_tvm = tvm.nd.empty(c_np.shape)
func(tvm.nd.array(a_np), tvm.nd.array(b_np), c_tvm)

tvm.testing.assert_allclose(c_np, c_tvm.numpy(), rtol=1e-4)
```

Output results:

```Finish loading 10 records 