This section learns how to use the TVM Tensor Expression (TE) language to write scheduling templates that can be searched by autoTVM to find the best scheduling. This process, called auto-Tuning, helps to optimize the automated process of tensor calculation.
This section is based on How to write matrix multiplication using TE On the basis of
The steps for auto-tuning are as follows:
- First Search Space
- The second run a search algorithm to explore this space
Install dependencies
To use autoTVM packages in TVM, you need to install some additional dependencies
pip3 install --user psutil xgboost cloudpickle
To make the TVM run faster in tuning, cython is recommended as the FFI of the TVM. In the root directory of the TVM, execute:
FFI:Foreign Function Interface, a program written in one programming language can call functions written in another programming language
pip3 install --user cython sudo make cython3
Import the required libraries:
import logging import sys import numpy as np import tvm from tvm import te import tvm.testing # the module is called `autotvm` from tvm import autotvm
Basic Matrix Multiplication with TE
Wrap TE matrix multiplication with a python function definition. For simplicity, focus on split optimization, using a fixed value to define the reordered block size
def matmul_basic(N, L, M, dtype): A = te.placeholder((N, L), name="A", dtype=dtype) B = te.placeholder((L, M), name="B", dtype=dtype) k = te.reduce_axis((0, L), name="k") C = te.compute((N, M), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k), name="C") s = te.create_schedule(C.op) # schedule y, x = s[C].op.axis k = s[C].op.reduce_axis[0] yo, yi = s[C].split(y, 8) xo, xi = s[C].split(x, 8) s[C].reorder(yo, xo, k, yi, xi) return s, [A, B, C]
Matrix Multiplication with AutoTVM
In the above matrix multiplication code, the constant "8" is used as the tiling factor. This may not be the best, and the best tiling factor depends on the actual hardware environment and the input shape.
If you want the dispatch code to be portable on a wider range of input shapes and target hardware, it is best to define a set of candidate values and select the best values based on measurements on the target hardware.
In autotvm, we can define an adjustable parameter, or a "knob", for such values.
# Matmul V1: List candidate values @autotvm.template("tutorial/matmul_v1") # 1. use a decorator def matmul_v1(N, L, M, dtype): A = te.placeholder((N, L), name="A", dtype=dtype) B = te.placeholder((L, M), name="B", dtype=dtype) k = te.reduce_axis((0, L), name="k") C = te.compute((N, M), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k), name="C") s = te.create_schedule(C.op) # schedule y, x = s[C].op.axis k = s[C].op.reduce_axis[0] # 2. get the config object cfg = autotvm.get_config() # 3. define search space cfg.define_knob("tile_y", [1, 2, 4, 8, 16]) cfg.define_knob("tile_x", [1, 2, 4, 8, 16]) # 4. schedule according to config yo, yi = s[C].split(y, cfg["tile_y"].val) xo, xi = s[C].split(x, cfg["tile_x"].val) s[C].reorder(yo, xo, k, yi, xi) return s, [A, B, C]
Here, we make four modifications to the previous dispatch code to get a dispatchable "template". We can explain these changes one by one:
- Mark this function as a simple template using an adorner.
- Gets the config object. This cfg is considered a parameter of this function, but it can be obtained in different ways. With this parameter, the function is no longer a deterministic scheduling. Instead, we can pass different configurations to this function and get different schedules. A function that uses configuration objects like this is called a Template.
To make template functions more compact, there are two things we can do to define the parameter search space in a single function.
- Defines a search space that spans a set of values. This is achieved by making cfg a ConfigSpace object. It will collect all the adjustable knobs in this function and create a search space from them.
- Schedule according to an entity in this space. This is achieved by making CFG a ConfigEntity object. When it is a ConfigEntity, it ignores all spatial definition API s (that is, cfg.define_XXXXX(...)). Instead, it stores determinate values for all dispatchable knobs that we schedule against.
During the automatic scheduling process, we will first build the search space by invoking the template with the ConfigSpace object. We then invoke the template with different ConfigEntities in the built space to obtain different schedules. Finally, we'll measure the code generated by different schedules and pick the best one.
-
Define two dispatchable knobs. The first is tile_y, there are five possible values. The second is tile_x, has the same list of possible values. The two knobs are independent, so they span a search space of 25=5x5 in size.
-
The configuration knob is passed to the split dispatch operation, enabling us to schedule according to the 5x5 determinant we previously defined in the cfg.
A Matrix Multiplication Template with the Advanced Parameter API
In the previous template, all possible values for a knob are manually listed. This is the lowest API that defines space and gives a clear list of the parameter spaces to search for. However, we also provide another set of APIs to make the definition of search space easier and smarter. Where possible, we accept that you use this higher level API.
In the following example, we use ConfigSpace.define_split to define a split knob. It will list all possible ways to split an axis and build space.
There is also ConfigSpace.define_reorder is used to reorder knobs and ConfigSpace.define_annotate is used for annotations such as unroll, vectorization, thread binding, and so on. When an advanced API does not meet your requirements, you can always fall back to using a lower level API.
@autotvm.template("tutorial/matmul") def matmul(N, L, M, dtype): A = te.placeholder((N, L), name="A", dtype=dtype) B = te.placeholder((L, M), name="B", dtype=dtype) k = te.reduce_axis((0, L), name="k") C = te.compute((N, M), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k), name="C") s = te.create_schedule(C.op) # schedule y, x = s[C].op.axis k = s[C].op.reduce_axis[0] ##### define space begin ##### cfg = autotvm.get_config() cfg.define_split("tile_y", y, num_outputs=2) cfg.define_split("tile_x", x, num_outputs=2) ##### define space end ##### # schedule according to config yo, yi = cfg["tile_y"].apply(s, C, y) xo, xi = cfg["tile_x"].apply(s, C, x) s[C].reorder(yo, xo, k, yi, xi) return s, [A, B, C]
PS: About cfg. Define_ More explanations for split
In this template, cfg.define_split("tile_y", y, num_outputs=2) lists all possible combinations that divide axis Y into two axes with a factor of Y length. For example, if y has a length of 32 and we want to split it into two axes with a factor of 32, then (the length of the outer axis, the length of the inner axis) has six possible values for (32, 1), (16, 2), (8, 4, 8), (2, 16) or (1, 32). These are tile_ There are six possible values for y.
During the scheduling process, CFG ["tile_y"] is a SplitEntity object. We store the lengths of the outer and inner axes in cfg['tile_y']. In size (a tuple with two elements). In this template, we use yo, yi = cfg['tile_y'].apply(s, C, y) to apply it. In fact, this is equivalent to yo, yi = s[C].split(y, cfg["tile_y"].size[1]) or yo, Yi = s[C]. Split(y, nparts=cfg['tile_y'].size[0])
Use cfg. The benefit of the apply API is that it makes multilevel splitting easier when num_outputs >= 3.
Step 2: Use AutoTVM to Optimize the Matrix Multiplication
In Step 1, we wrote a matrix multiplication template that allows us to parameterize the block sizes used in split scheduling. We can now search for this parameter space. The next step is to select an adjuster to guide the exploration of this space.
Auto-tuners in TVM
The adjustments are shown in the following pseudocode:
ct = 0 while ct < max_number_of_trials: propose a batch of configs measure this batch of configs on real hardware and get results ct += batch_size
When proposing the next batch of configurations, tuner can take different strategies. Some tuner strategies offered by TVM include:
- tvm.autotvm.tuner.RandomTuner: Enumerates spaces in random order.
- tvm.autotvm.tuner.GridSearchTuner: Enumerate space as a grid search.
- tvm.autotvm.tuner.GATuner: Searching for space using genetic algorithms
- tvm.autotvm.tuner.XGBTuner: Use a model-based approach. Train an XGBoost model to predict the rate of IR reduction and select the next batch based on the predicted results.
Tuners can be selected based on your space size, your time budget, and other factors. For example, if your space is very small (less than 1000), grid search tuners or random tuners are sufficient. If your space is at the level (this is the size of the conv2d operator on the CUDA GPU), XGBoostTuner can more effectively explore and find better configurations.
Begin tuning
First, create a tuning task, or check the initialization space. In this example, for a 512 * 512 matrix multiplication, the space size plays 10 * 10 = 100. Note that this task is independent of the search space and the tuner selected.
N, L, M = 512, 512, 512 task = autotvm.task.create("tutorial/matmul", args=(N, L, M, "float32"), target="llvm") print(task.config_space)
Output results:
ConfigSpace (len=100, space_map= 0 tile_y: Split(policy=factors, product=512, num_outputs=2) len=10 1 tile_x: Split(policy=factors, product=512, num_outputs=2) len=10 )
Then you need to define a code to measure production and a selected tuner. Because we have very little space, a random tuner will do
In practice, you can do more trials based on your time budget. We will record the tuning results in a log file. This file can be used to select the best configuration that the tuner will discover later.
# logging config (for printing tuning log to the screen) logging.getLogger("autotvm").setLevel(logging.DEBUG) logging.getLogger("autotvm").addHandler(logging.StreamHandler(sys.stdout))
There are two steps to measuring a configuration: build and run. By default, all CPU cores are used to compile programs. Then measure in order. To reduce the difference, this is measured five times and averaged
measure_option = autotvm.measure_option(builder="local", runner=autotvm.LocalRunner(number=5)) # Begin tuning with RandomTuner, log records to file `matmul.log` # You can use alternatives like XGBTuner. tuner = autotvm.tuner.RandomTuner(task) tuner.tune( n_trial=10, measure_option=measure_option, callbacks=[autotvm.callback.log_to_file("matmul.log")], )
Output results:
waiting for device... waiting for device...It queries the scheduling context with its parameters and gets the best configuration with the same parameters. device available device available Get devices for measurement successfully! Get devices for measurement successfully! No: 1 GFLOPS: 8.17/8.17 result: MeasureResult(costs=(0.0328701162,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.6089231967926025, timestamp=1658363130.4992497) [('tile_y', [-1, 16]), ('tile_x', [-1, 16])],None,44 No: 1 GFLOPS: 8.17/8.17 result: MeasureResult(costs=(0.0328701162,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.6089231967926025, timestamp=1658363130.4992497) [('tile_y', [-1, 16]), ('tile_x', [-1, 16])],None,44 No: 2 GFLOPS: 7.45/8.17 result: MeasureResult(costs=(0.0360509632,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.695908784866333, timestamp=1658363131.2425098) [('tile_y', [-1, 8]), ('tile_x', [-1, 16])],None,43 No: 2 GFLOPS: 7.45/8.17 result: MeasureResult(costs=(0.0360509632,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.695908784866333, timestamp=1658363131.2425098) [('tile_y', [-1, 8]), ('tile_x', [-1, 16])],None,43 No: 3 GFLOPS: 25.86/25.86 result: MeasureResult(costs=(0.010381651799999999,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.3583974838256836, timestamp=1658363131.6205783) [('tile_y', [-1, 32]), ('tile_x', [-1, 128])],None,75 No: 3 GFLOPS: 25.86/25.86 result: MeasureResult(costs=(0.010381651799999999,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.3583974838256836, timestamp=1658363131.6205783) [('tile_y', [-1, 32]), ('tile_x', [-1, 128])],None,75 No: 4 GFLOPS: 30.35/30.35 result: MeasureResult(costs=(0.0088443436,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.31720852851867676, timestamp=1658363131.9254267) [('tile_y', [-1, 1]), ('tile_x', [-1, 64])],None,60 No: 4 GFLOPS: 30.35/30.35 result: MeasureResult(costs=(0.0088443436,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.31720852851867676, timestamp=1658363131.9254267) [('tile_y', [-1, 1]), ('tile_x', [-1, 64])],None,60 No: 5 GFLOPS: 9.35/30.35 result: MeasureResult(costs=(0.028700182600000002,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.7562305927276611, timestamp=1658363132.5526962) [('tile_y', [-1, 16]), ('tile_x', [-1, 128])],None,74 No: 5 GFLOPS: 9.35/30.35 result: MeasureResult(costs=(0.028700182600000002,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.7562305927276611, timestamp=1658363132.5526962) [('tile_y', [-1, 16]), ('tile_x', [-1, 128])],None,74 No: 6 GFLOPS: 28.15/30.35 result: MeasureResult(costs=(0.0095366542,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.29946398735046387, timestamp=1658363132.9198403) [('tile_y', [-1, 8]), ('tile_x', [-1, 512])],None,93 No: 6 GFLOPS: 28.15/30.35 result: MeasureResult(costs=(0.0095366542,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.29946398735046387, timestamp=1658363132.9198403) [('tile_y', [-1, 8]), ('tile_x', [-1, 512])],None,93 No: 7 GFLOPS: 19.80/30.35 result: MeasureResult(costs=(0.0135571626,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.30517101287841797, timestamp=1658363133.3043413) [('tile_y', [-1, 512]), ('tile_x', [-1, 512])],None,99 No: 7 GFLOPS: 19.80/30.35 result: MeasureResult(costs=(0.0135571626,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.30517101287841797, timestamp=1658363133.3043413) [('tile_y', [-1, 512]), ('tile_x', [-1, 512])],None,99 No: 8 GFLOPS: 5.75/30.35 result: MeasureResult(costs=(0.0467114476,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.8827676773071289, timestamp=1658363134.2409544) [('tile_y', [-1, 8]), ('tile_x', [-1, 32])],None,53 No: 8 GFLOPS: 5.75/30.35 result: MeasureResult(costs=(0.0467114476,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.8827676773071289, timestamp=1658363134.2409544) [('tile_y', [-1, 8]), ('tile_x', [-1, 32])],None,53 No: 9 GFLOPS: 27.67/30.35 result: MeasureResult(costs=(0.0097023816,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.38503575325012207, timestamp=1658363134.582773) [('tile_y', [-1, 2]), ('tile_x', [-1, 128])],None,71 No: 9 GFLOPS: 27.67/30.35 result: MeasureResult(costs=(0.0097023816,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.38503575325012207, timestamp=1658363134.582773) [('tile_y', [-1, 2]), ('tile_x', [-1, 128])],None,71 No: 10 GFLOPS: 2.62/30.35 result: MeasureResult(costs=(0.1026008602,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.708867073059082, timestamp=1658363136.3652241) [('tile_y', [-1, 8]), ('tile_x', [-1, 1])],None,3 No: 10 GFLOPS: 2.62/30.35 result: MeasureResult(costs=(0.1026008602,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.708867073059082, timestamp=1658363136.3652241) [('tile_y', [-1, 8]), ('tile_x', [-1, 1])],None,3
When tuning is complete, select the configuration from the log file that has the best measurement performance for v and compile the schedule with the appropriate parameters. A quick validation can also be done to ensure that the schedule produces the correct answer. Can autotvm. Apply_ History_ The matmul function is called directly in the best context. When this function is called, it queries the scheduling context with its parameters and gets the best configuration with the same parameters.
# apply history best from log file with autotvm.apply_history_best("matmul.log"): with tvm.target.Target("llvm"): s, arg_bufs = matmul(N, L, M, "float32") func = tvm.build(s, arg_bufs) # check correctness a_np = np.random.uniform(size=(N, L)).astype(np.float32) b_np = np.random.uniform(size=(L, M)).astype(np.float32) c_np = a_np.dot(b_np) c_tvm = tvm.nd.empty(c_np.shape) func(tvm.nd.array(a_np), tvm.nd.array(b_np), c_tvm) tvm.testing.assert_allclose(c_np, c_tvm.numpy(), rtol=1e-4)
Output results:
Finish loading 10 records Finish loading 10 records
Matmul. The log is as follows: