Design of CNML/CNRT Integration

Introduction

Cambricon® is a company developing machine learning processors which are used in cloud servers, smart devices, etc. Products that we have released include IP cores (e.g., Cambricon-1A, Cambricon-1H and Cambricon-1M) for smart devices and AI accelerating card for servers (e.g., MLU100).

The Cambricon software stack is shown in Figure 1. As shown, CNRT^TM (Cambricon Neuware^TM Runtime Library) is a runtime library; CNML^TM (Cambricon Neuware Machine Learning Library) is a high-performance computing library for machine learning which depends on CNRT.

Fig 1. The Cambricon Software Stack

After integrating with CNML/CNRT, developers can work with MXNet on Cambricon MLUs.

This proposal firstly introduces CNML and CNRT, and then the design of integrating CNML/CNRT into MXNet. We discuss future works at the end.

Notes:

The Cambricon software stack is suitable for both IP core and AI accelerating card. This proposal uses “MLU (machine learning unit)” to represent both IP core and AI accelerating card.

Unless otherwise specified, an “operator” denotes a CNML operator, an "MLU operator" denotes an MXNet operator whose device type is ”mlu”. Moreover, “imperative mode” and “symbolic mode” denote the two programming styles in CNML, and “imperative programming” and ”symbolic programming” denote the two programming styles in MXNet.

CNML and CNRT

Features of CNML and CNRT

MLUs have an instruction set designed specifically for deep learning algorithms. Computational tasks are compiled by the CPU into Cambricon instructions which are stored in the host memory. The process of executing a computational task has three phases: a) the MLU reads the instructions and the data associated with this computational task from the host memory; b) the MLU executes instructions to process data; c) the MLU writes results back to the host memory. To improve the performance and reduce the memory usage, CNML uses JIT (Just In Time) compilation to compile operators before they are executed. The paradigm of programming with CNML is as below:

Create an operator instance: set the attributes of the operator, associate the input tensor descriptors with the operator instance.
Compile the operator instance in a JIT manner: the operator instance is compiled into a set of Cambricon instructions, meanwhile, the input tensors, such as the weight tensor and etc., are transformed for better performance.
Execute the operator instance: call the Forward() method of the operator instance to execute the forward computation.

Create operator instances

The attributes (including data types) of tensors on MLU may be different from that on CPU. CNML provides cnmlCpuTensor and cnmlTensor as the descriptors of tensors on CPU and MLU respectively. Some tensors are read-only, and some tensors can be updated during the computation. CNML uses enumerated values CNML_TENSOR, CNML_FILTER and CNML_CONST - of type cnmlTensorType_t - to label read/write characteristics of tensors. There are a lot of CNML_FILTER and CNML_CONST tensors in neural network inference. CNML provides an interface to bind cnmlCpuTensors and data addresses to their cnmlTensors when an operator is created. The data of the CNML_FILTER and CNML_CONST tensors will be preprocessed during the compilation of the operator instance.

Compile operator instances in a JIT manner

The basic operator of CNML is represented by the structure BaseOp. Developers can use cnmlCompileBaseOp() to compile a BaseOp. In general, a neural network is composed of many BaseOps. Compared with compiling these BaseOps one by one, compiling these BaseOps as a computation subgraph - which we call fusion compilation - may achieve better performance. As a whole, CNML uses a FusionOp structure to represent those BaseOps included in the computation subgraph. Meanwhile, CNML provides cnmlFuseOp() to add multiple BaseOps to a FusionOp one by one. Finally, the FusionOp is compiled by cnmlCompileFusionOp().

Execute operator instances

One important method of BaseOp and FusionOp is Forward(). When Forward() is executed, instructions and data are copied from host memory to MLU memory, then the MLU reads them and executes the instructions.

Forward() is executed asynchronously for better performance. Specifically, after the host process invokes the Forward() method, it can proceed with other tasks without waiting for the computation task on the MLU to finish. The asynchronous execution mechanism of the Forward() method is implemented through the cnrtQueue. The task types in the cnrtQueue include computation, synchronization and so on. Before the cnrtQueue is used, a cnrtQueue instance must be created. Every time the Forward() method of a BaseOp or FusionOp is invoked, a computation task is generated and then pushed into the cnrtQueue instance. Every time the cnrtSyncQueue() interface is invoked, a synchronization task is generated and then pushed into the cnrtQueue instance. When a host process invokes the cnrtSyncQueue() interface, the corresponding host process will be blocked until the synchronization task in the cnrtQueue instance is finished. The cnrtQueue has the following characteristics:

Tasks in one cnrtQueue are executed in a FIFO fashion.
Tasks from different cnrtQueues can be executed in parallel.

CNML and CNRT APIs

This section briefly introduces inference APIs from CNML and CNRT. More details are described in the attached document: Introduction to CNML&CNRT APIs(Introduction to CNML&CNRT APIs.pdf).

Tensor Descriptor	Create a tensor descriptor	cnmlCreateCpuTensor cnmlCreateTensor
	Destroy a tensor descriptor	cnmlDestroyCpuTensor cnmlDetroyTensor
	Data bind for CNML_FILTER and CNML_CONST tensor	cnmlBindConstData
	Handle MLU memory for CNML_TENSOR tensor after operator compilation	cnmlMallocBuffer cnmlMemcpyTensorToDevice cnmlMemcpyTensorToHost cnmlFreeBuffer
Operator	Normal operation about BaseOp	cnmlCreateConvOp cnmlCreateActiveOp cnmlCreate...... cnmlCompileBaseOp cnmlComputeConvOpForward cnmlComputeActiveOpForward cnmlCompute...... cnmlDestroyBaseOp
Operator	Normal operation about FusionOp	cnmlCreateFusionOp cnmlFuseOp cnmlSetFusionIO cnmlCompileFusionOp cnmlComputeFusionOpForward cnmlDetroyFusionOp
cnrtQueue	Create and Destroy a cnrtQueue	cnrtCreateQueue cnrtDestroyQueue
cnrtQueue	Send a sync task into a cnrtQueue instance	cnrtSyncQueue

Working with CNML

With CNML, MXNet runs a neural network model in one of two modes - the imperative mode or symbolic mode. This section illustrates how to work with CNML in a neural network which consists of a series of CNML operators.

In the imperative mode, BaseOps in the neural network are compiled by cnmlCompileBaseOp(). The Forward() method of each BaseOp is invoked in a topological order. The imperative mode can satisfy the need of imperative execution and facilitate development and debugging.

In the symbolic mode, BaseOps are put into one FusionOp by cnmlFuseOp(). The FusionOp is then compiled by cnmlCompileFusionOp(). Forward() of the FushionOp is executed for inference. When MXNet performs neural network inference for the same network, the symbolic mode can improve the utilization of the MLU hardware resources and achieve better performance than the imperative mode.

Working with CNML in the imperative mode

Create operator instances in a neural network: Create each operator instance in the neural network in turn.
Compile and execute all operators in the neural network in a topological order: Process each BaseOp in the neural network following the three steps below.
1. Compile the BaseOp: Invoke the cnmlCompileBaseOp() interface to compile the BaseOp.
2. Allocate MLU memory and copy input data from the host memory to the MLU memory: Allocate MLU memory for input/output tensor of the BaseOp, and copy input data of the BaseOp from the host memory to the MLU memory.
3. Perform forward computation for the BaseOp: Invoke the Forward() method of the BaseOp, and push the synchronization task into the created cnrtQueue instance by invoking the cnrtSyncQueue() interface.
Complete the neural network inference and release resources: Complete inference until all synchronous tasks which are pushed into the cnrtQueue are done. Copy results of the neural network inference to the host memory, and then release resources.

In the second step above, each CNML operator is executed once it has been compiled. Actually, we can also adopt other ways that satisfy the principle: compile an operator before execute it. For example, compile all the operators in a neural network first, then execute them in a topological order.

Working with CNML in the symbolic mode

Create operator instances in a neural network and add them into a FusionOp: Create each operator instance in the neural network in turn. Then, put all BaseOps in the neural network into a FusionOp by invoking the cnmlFuseOp() interface. Set the input/output cnmlTensors of FusionOp.
Compile and execute the FusionOp in the neural network: Process the FusionOp following the three steps below.
1. Compile the FusionOp: Compile the FusionOp by invoking the cnmlCompileFusionOp() interface.
2. Allocate MLU memory and copy input data from the host memory to the MLU memory: Allocate MLU memory for input/output tensor of the FusionOp, and copy input data of the FusionOp from the host memory to the MLU memory.
3. Perform forward computation for the FusionOp: Invoke the Forward() method of the FusionOp, and push the synchronization task into the created cnrtQueue instance by invoking the cnrtSyncQueue() interface.
Complete the neural network inference and release resources: Complete inference until all synchronous tasks which are pushed into the cnrtQueue are done. Copy results of the neural network inference to the host memory, and then release resources.

Examples of working with CNML

In this section, we construct a simple neural network which consists of "convolution + activation". Base on this demo network, we give pseudocodes for performing network inference in two modes respectively.

Example of working with CNML in imperative mode

/* Create all operator instances in the neural network.
* The conv_in_data, conv_filter_data, conv_bias_data and act_out_data are host memory pointers.
*/
cnmlCreateCpuTensor(&conv_in_cpu_tensor, CNML_TENSOR, ...)
/* The conv_filter_cpu_tensor(CNML_FILTER), conv_bias_cpu_tensor(CNML_CONST)
* and act_out_cpu_tensor(CNML_TENSOR) are also created.
*/
cnmlCreateTensor(&conv_in_tensor, CNML_TENSOR, ...)
/* The conv_filter_tensor(CNML_FILTER), conv_bias_tensor(CNML_CONST),
* conv_out_tensor(CNML_TENSOR) and act_out_tensor(CNML_TENSOR) are also created.
*/
cnmlBindConstData (conv_filter_tensor, conv_filter_cpu_tensor, conv_filter_data)
cnmlBindConstData (conv_bias_tensor, conv_bias_cpu_tensor, conv_bias_data)
cnmlCreateConvParam(&convParam, stride, dilation, ...)
cnmlCreateConvOp(&conv_op, convParam, ...)
cnmlActiveFunction_t actFuncParam = CNML_ACTIVE_SIGMOID
cnmlCreateActiveOp(&act_op, actFuncParam, ...)
cnrtCreateQueue(&queue)
/* Compile and execute all the operators in the neural network. */
cnmlCompileBaseOp(conv_op, ...)
/* The conv_in, conv_out are MLU memory pointers. */
cnmlMemcpyTensorToDevice(conv_in_cpu_tensor, conv_in_data, conv_in_tensor, conv_in)
cnrtInvokeFuncParam_t invoke1(affinity, ...)
cnmlComputeConvOpForward(conv_op, conv_in, conv_out, &invoke1, queue, ...)
cnrtSyncQueue(queue)
cnmlCompileBaseOp(act_op, ...)
cnrtInvokeFuncParam_t invoke2(affinity, ...)
/* The act_out is MLU memory pointer. */
cnmlComputeActiveOpForward(act_op, conv_out, act_out, &invoke2, queue,  ...)
cnrtSyncQueue(queue)
/* Finish the computation and release computing resources. */
cnmlMemcpyTensorToHost(act_out_tensor, act_out, act_out_cpu_tensor, act_out_data)
free data and resource ...

Example of working with CNML in symbolic mode

/* Create operator instances of the neural network, and add them into the same FusionOp.
* The conv_in_data, conv_filter_data, conv_conv_data and act_out_data are host memory pointers.
*/
cnmlCreateCpuTensor(&conv_in_cpu_tensor, CNML_TENSOR, ...)
/* The conv_filter_cpu_tensor(CNML_FILTER), conv_bias_cpu_tensor(CNML_CONST)
* and act_out_cpu_tensor(CNML_TENSOR) are also created.
*/
cnmlCreateTensor(&conv_in_tensor, CNML_TENSOR, ...)
/* The conv_filter_tensor(CNML_FILTER), conv_bias_tensor(CNML_CONST),
* conv_out_tensor(CNML_TENSOR) and act_out_tensor(CNML_TENSOR) are also created.
*/
cnmlBindConstData (conv_filter_tensor, conv_filter_cpu_tensor, conv_filter_data)
cnmlBindConstData (conv_bias_tensor, conv_bias_cpu_tensor, conv_bias_data)
cnmlCreateConvParam(&convParam, stride, dilation, ...)
cnmlCreateConvOp(&conv_op, convParam, ...)
cnmlActiveFunction_t actFuncParam = CNML_ACTIVE_SIGMOID
cnmlCreateActiveOp(&act_op, actFuncParam, ...)
cnmlCreateFusionOp(&fusion_op)
cnmlFuseOp(conv_op, fusion_op)
cnmlFuseOp(act_op, fusion_op)
cnmlSetFusionIO(fusion_op, ...)
cnrtCreateQueue(&queue)
/* Compile and execute the fusion_op in the neural network. */
cnmlCompileFusionOp(fusion_op)
/* The conv_in and act_out are MLU memory pointers. */
cnmlMemcpyTensorToDevice(conv_in_cpu_tensor, conv_in_data, conv_in_tensor, conv_in)
cnrtInvokeFuncParam_t invoke(affinity, ...)
cnmlComputeFusionOpForward(fusion_op, &conv_in, 1, &act_out, 1, &invoke, queue, ...)
cnrtSyncQueue(queue)
/* Finish the inferences and release computing resources. */
cnmlMemcpyTensorToHost(act_out_tensor, act_out, act_out_cpu_tensor, act_out_data)
free data and resource ...

Proposed approach of CNML/CNRT integration

Design and analysis

One design principle of integrating CNML/CNRT to MXNet is to make the latter use MLU operators to run neural networks efficiently with symbolic programming.

For an existing MXNet operator, we use one or more CNML operators to compose an MLU operator whose function is same as that of the MXNet operator. A user can set the device type of the MXNet operators to “mlu” so that they run on the MLU.

We need to integrate creation, compilation and execution of the MLU operator into MXNet existing symbolic programming inference process.

The steps that users use MXNet Python symbolic programming APIs to run inference are as follows:

Build a neural network and get the symbol object for the network.
Call the SimpleBind() or Bind()of the symbol object and get the executor object for the symbol object.
Fill the input data, weight data, into arg_dict of the executor object.
Invoke the Forward() method of the executor object and run inference.

We don’t want to change the above work flow while integrating CNML/CNRT to MXNet so we propose to integrate creation, compile and execution of the MLU operator as follows:

MLU operator creation: MLU operator creation should be finished before users fill data into arg_dict of executor, so we put MLU operator creation into SimpleBind() and Bind() methods of symbol.
MLU operator compilation: while CNML compiles an MLU operator, it needs to preprocess the data of weight tensor and so on. So we put MLU operator compilation into the Forward() method of the executor which is invoked after filling data into arg_dict of executor.
MLU operator execution: since MLU operator execution must begin after compilation is done, we put the MLU operator execution after the MLU operator compilation in the Forward() method of the executor.

We have some slight modification in the Operator and Executor Modules. We also updated some data structures - like NDArray, TBlob and mShadow Tensor - so that they can handle memory in MLU. Lastly, we also need to add a new attribute value “mlu” to the attribute of device type in the python::context module in MXNet. Overview of the MXNet modules which need to be modified in the integration design is shown below.

Fig 2. Overview of MLU modified modules

Next, we will introduce the details in the integration design. For the convenience of your understanding, we will introduce the CNML/CNRT integration for the imperative mode first, and then the symbolic mode.

CNML/CNRT integration for imperative mode

Extend MXNet context for MLU

A new value “mlu” will be introduced to the attribute of the device type in the context module in MXNet, so that users can set the device type of nodes in the network to “mlu”.

Extend MXNet operators and OpExecutor for MLU

An operator executor contains data, status flags and methods which are necessary in MXNet operator execution. MLU operator compilation and execution need some data results which are created during MLU operator creation. So we add a new abstract class MluOpExecutor which inherits the class OpExecutor for the purpose of adapting the OpExecutor for creation, compilation and execution of the MLU operator and avoiding influence on the original execution flow of the OpExecutor. There are two new fields called internal_ops and inter_data in the class MluOpExecutor. The internal_ops is used for saving the BaseOps which constitute an MLU operator, and the inter_data is used for saving the intermediate results which are produced by the BaseOps in the internal_ops. In addition, the class MluOpExecutor adds a new virtual method Init() to adapt the creation of the MLU operator. Furthermore, we also add two new classes, namely, class MluFComputeExecutor and class MluStatefulComputeExecutor, which inherit the abstract class MluOpExecutor. Before introducing the two new MLU operator executors, we will introduce the implementation of the MLU operator firstly.

Extend MXNet operators for MLU

An MXNet operator is now registered in two ways: a stateless MXNet operator is registered with NNVM and a stateful MXNet operator is defined in the legacy way.

For an existing stateless MXNet operator, our implementation of the corresponding MLU operator is as follows:

MLU operator creation: we add a new method Init() to the MLU operator to perform MLU operator creation. The Init() method creates all the BaseOps which constitute an MLU operator instance and the data structures of the intermediate results. The internal_ops and inter_data of the MluFComputeExecutor acting as output parameters of the Init() method are updated with the results of the Init() method, the BaseOps and the data structures, respectively.
MLU operator compilation: during the MLU operator compilation, each BaseOp in the internal_ops is compiled. The process is same for any MLU operators. So the MluFComputeExecutor is responsible for the stateless MLU operator compilation uniformly.
MLU operator execution: since the Forward() method of a CNML operator needs the cnrtQueue to implement asynchronous execution, the Forward() method of the MLU operator get the cnrtQueue from the mShadow Stream (the integration of the cnrtQueue into the mShadow Stream will be introduced later) for the BaseOps in the internal_ops. What’s more, we only implement the Forward() method of the MLU operator which can only handle the data structure of type TBlob this time. In the Forward() method, it gets the internal_ops from the MluFComputeExecutor and invokes Forward() methods of the BaseOps in the internal_ops in topological turn to push the computational tasks into the cnrtQueue instance got from the mShadow Stream.
Other methods of the MLU operator: for an MLU operator, its InferType() method, InferShape() method and so on are the same as that of the existing corresponding MXNet operator.

For an existing stateful MXNet operator, our implementation of the corresponding MLU operator is similar to that of a stateless MLU operator. The main difference between these two implementations is as follows. Since the stateful MXNet operator is a class object, we can add two new fields called internal_ops and inter_data to the class of the stateful MXNet operator. In this case, the Init() method of the stateful MLU operator can save the BaseOps into the internal_ops of the MLU operator and the data structures of intermediate results into inter_data of the MLU operator. The Forward() method of the MLU operator can use the two fields directly instead of getting them from its operator executor. However, since the MluStatefulComputeExecutor needs the results in internal_ops in the MLU operator compilation, the internal_ops acting as a returned value of the Init() method of the MLU operator is assigned to the internal_ops of the MluStatefulComputeExecutor.

OpExecutors for stateless MLU operator and stateful MLU operator

After introducing the implementation of the stateless and stateful MLU operators, we will introduce the extension of the OpExecutor for the MLU operator. At present, the two kinds of MLU operators both implement the Forward() methods which can only handle the data structure of type TBlob, so we add two new operator executors, namely, the MluFComputeExecutor and the MluStatefulComputeExecutor, for the two kinds of MLU operators, respectively.

The MluFComputeExecutor is the operator executor for the present type of Forward() method of the stateless MLU operator. For the creation, compilation and execution of the stateless MLU operator, the implementation of the MluFComputeExecutor is as follows:

MLU operator creation: the Init() method of the MluFComputeExecutorinvokes the Init() method of the MLU operator to complete the MLU operator creation and update the internal_ops and inter_data of the MluFComputeExecutor.
MLU operator compilation: since the MLU operator compilation is performed in the process of neural network inference, we put the compilation in the Run() method of the MluFComputeExecutor. Furthermore, the compilation must be done before the Run() method invokes the Forward() method of the MLU operator. In the MLU operator compilation, the MluFComputeExecutor compiles each BaseOp in the internal_ops of the MluFComputeExecutor. After the compilation is finished, the Run() method set the field compile_ to “true” to prevent the Run() method from compiling the MLU operator again in the next invocation.
MLU operator execution: in the Run() method of the MluFComputeExecutor, the Forward() method of the MLU operator is invoked after the MLU operator compilationis finished. When the Run() method of the MluFComputeExecutor is invoking the Forward() method of the MLU operator, it passes the internal_ops and inter_data as parameters to the Forward() method of the MLU operator.

The MluStatefulComputeExecutor is the operator executor for the present type of Forward() method of the stateful MLU operator, its implementation is similar to that of the MluFComputeExecutor. The main difference between these two implementations is caused by the fact that the stateless and stateful MLU operators save the internal_ops and inter_data in different positions. After the Init() method of the MluStatefulComputeExecutor invokes the Init() method of the MLU operator, it only updates the internal_ops of the MluStatefulComputeExecutor for the MLU operator compilation. What’s more, when the Forward() method of the MLU operator is invoked, it is not necessary to receive the internal_ops and inter_data from the Run() method of the MluStatefulComputeExecutor.

mShadow stream integrate with cnrtQueue

In the introduction to the implementation of the MLU operator, we mention that the cnrtQueue will be integrated into the mShadow Stream. Next, we will introduce the design of the cnrtQueue integration.

We specialize the template of the mShadow Stream for MLU and implement the class mShadow Stream<mlu>. The mShadow Stream<mlu> has a new field queue_ which is a pointer to a cnrtQueue instance, and provides a Wait() method which can push a synchronous task into the cnrtQueue instance pointed by the queue_. In addition, we implement interfaces, such as SetDevice(), NewStream(), DeleteStream and so on, for the mShadow Stream<mlu>.

In the aspect of use of the mShadow Stream<mlu>, the NaiveEngine creates and destroys the Stream<mlu> by the NewStream() interface and the DeleteStream() interface, and puts it into the RunContext for the MLU operator.

Adapt GraphExecutor for MluOpExecutor

The SimpleBind()/Bind() method of the MXNet front-end Symbol creates and initializes a GraphExecutor instance. The Forward() method of the front-end Executor invokes the Forward() method of the GraphExecutor to perform network inference. In the backend, the OpExecutor instances are created and initialized in the Init() method of the GraphExecutor. In addition, in the Forward() method of the GraphExecutor, the Run() method of the OpExecutor acting as an execution task is pushed into the MXNet Engine. Thus, the GraphExecutor plays an important role in the compatibility between the front-end and the backend. We adapt the Init() and Forward() methods of the GraphExecutor for the MluOpExecutor.

Adapt GraphExecutor::Init for MluOpExecutor

The main process of the Init() method of the GraphExecutor is as follows:

Create and initialize a computation graph: create a computation graph, and infer the attributes of the computation graph, such as context, dtype, shape, storage_type and so on.
Create MXNet operator executors: the AttachOpExecs() interface creates an suitable OpExecutor instance for each MXNet operator node in the computation graph, and puts all the OpExecutors into the op_execs attribute of the computation graph in topological turn.
Allocate global resources for the OpExecutors: in the AttachOpResource() interface, if the OpExecutor in the op_execs attribute needs global resources, the ResourceManager will allocate the needed global resources for the OpExecutor.
Initialize the field op_node_of the GraphExecutor: the InitCachedOps() method of the GraphExecutor updates the fields, such as in_array, out_array and so on, of the OpExecutor in the op_execs, and then the fields, such as opr_name, exec, cached_opr and so on, of the op_node_.
Segment the computation graph: the InitOpSegs() method of the GraphExecutor creates and initializes the segment operators for the possible segments in the computation graph.

Analyze the process above, creating the operator executor for the MLU operator should be better to be performed in the interface AttachOpExecs(). When the interface AttachOpExecs() is creating an operator executor instance, if the dispatch mode of the operator node is not “kFComputeEx” and the device type of the operator node is “mlu”, the AttachOpExecs() interface will create the suitable operator executor instance for the MLU operator. Moreover, the MLU operator creation doesn’t need the actual data of the inputs, but needs the shape and data type of the inputs/outputs. So after the InitCachedOps() method of the GraphExecutor updates the in_array and out_array of the OpExecutors in the op_execs, it invokes the Init() method of each OpExecutor in the op_execs to complete the MLU operator creation in the computation graph.

Adapt GraphExecutor::Forward for MluOpExecutor

The flow of the Forward() method of the GraphExecutor doesn’t need to be changed. The Forward() method of the GraphExecutor pushes the cached_opr of each OpNode in the computation graph into the MXNet Engine in topological turn to complete the MLU operator compilation and execution.

Extend NDArray, Tblob and mShadow Tensor for MLU

We need to adapt the MXNet data structures including NDArray, TBlob and mShadow Tensor, so that the MXNet can handle the memory in the MLU via these MXNet data structures.

For the purpose of showing the data information in the MLU, we add some new fields to the NDArray, TBlob and mShadow Tensor as follows:

NDArray

Class NDArray {
  public:
    ......
  private:
    Struct Chunk {
      cnmlTensor_t cnml_tensor; // new field, a pointer to the cnmlTensor.
      void* mptr; // new field, a pointer to the MLU memory address of the NDArray data.
      ......
    };
    ......
};

TBlob

Class TBlob {
  public:
    // new field, a pointer to the pointer to cnmlTensor.
    cnmlTensor_t* cnml_tensor_;
    // new field, a pointer to the cnmlTensorType_t.
    std::shared_ptr<cnmlTensorType_t> cnml_tensor_type_; 
    // new field, a pointer to the pointer to the MLU memory address of the TBlob data.
    void** mptr;
    ......
  private:
    ......
};

mShadow Tensor

template<typename Device, int dimension, typename DType ......>
Struct Tensor : public TRValue < Tensor<Device,dimension,DType>, ......> {
  public:
    // new field, a pointer to the pointer to cnmlTensor.
    cnmlTensor_t* cnml_tensor_;
    // new field, a pointer to the cnmlTensorType_t.
    std::shared_ptr<cnmlTensorType_t> cnml_tensor_type_;
    // new field, a pointer to the pointer to the MLU memory address of the Tensor data.
    void** mptr;
    ......
};

Besides the new fields above, we also add some new interfaces based on the mShadow Tensor with device type “mlu” to the mShadow module, such as InitCnmlTensor(), GetMptr(), AllocSpace(), FreeSpace(), Copy() and so on. The mShadow Tensor uses these interfaces for handling its MLU memory and cross-device copying its data between the host and the MLU. Meanwhile, NDArray implements its cross-device copy between the host and the MLU via these new interfaces in mShadow module.

In the present design of CNML, we cannot allocate the MLU memory for the tensors related to the MLU operator until the MLU operator is compiled. However, the MLU operator compilation is performed during the execution of the Forward() method of the GraphExecutor. So the input data related to the MLU operator need to be saved temporarily in the host buffer before the MLU operator is compiled. We use the storage::NaiveStorageManager<storage::CPUDeviceStorage> for managing the host buffer of the NDArray whose device type is “mlu”.

MXNet LibraryInitializer integrate with cnmlInit

In order to initialize/exit CNML during the period of the MXNet initialization/exit, some work needed is as follows:

We add a new class scopedCNMLLibraryInitializerto the file initialize.cc. The construction function of the scopedCNMLLibraryInitializer invokes the cnmlInit() interface to initialize CNML and the destruction function invokes the cnmlExit() interface to exit CNML.
We add a new field library_loader_whose type is scopedCNMLLibraryInitializer to the class LibraryInitializer. The construction function of the LibraryInitializer constructs the library_loader_ to initialize CNML and the destruction function destructs the library_loader_ to exit CNML.

CNML/CNRT integration for symbolic mode

When we use the MXNet to perform neural network inference, it is possible that device type of most of the nodes in the network is “mlu” and device type of the rest is host device type. In this case, if we want to perform network inference in symbolic mode, we cannot put all the MXNet operators into a FusionOp. Hence we need to divide the MLU operators of an initialized computation graph into several fusion segments before the fusion compilation. Each fusion segment is a computation subgraph consisting of the MLU operators which can be put into the same FusionOp.

In order to solve the problem above, we add a new class MluFuseGraphExecutor which inherits the class GraphExecutor. The MluFuseGraphExecutor acts as the computation graph executor of the network in the symbolic mode. Without compiling and executing each MLU operator in a fusion segment, the MluFuseGraphExecutor compiles and executes the FusionOp for each fusion segment instead. Thus all the integration work in the imperative mode is necessary for the symbolic mode except the work about MLU operator compilation and execution in the imperative mode .

Extend GraphExecutor for symbolic mode

In symbolic mode, the SimpleBind()/Bind() method of the front-end Symbol invokes the MluFuseSimpleBind()/MluFuseBind() method of the MluFuseGraphExecutor instead of the SimpleBind()/Bind() method of the GraphExecutor. The MluFuseSimpleBind()/MluFuseBind() initializes the MluFuseGraphExecutor via its own Init() method. During the MluFuseGraphExecutor initialization, the following work needs to be done. Firstly, the Init() method of the parent class GraphExecutor is invoked to complete the computation graph initialization, MLU operator and operator executor creation and so on. Then, check the initialized computation graph and generate the possible fusion segments. Lastly, if necessary, the Init() method of the MluFuseGraphExecutor will create the FusionOp for each fusion segment and set I/O for the FusionOp. In the process of MluFuseGraphExecutor initialization, generating fusion segments from a computation graph is an important work. Before we introduce the algorithm for this work, we introduce two important concepts about the fusion segment, namely, the input operator and output operator of a fusion segment.

The input operator of a fusion segment: For an MLU operator in a fusion segment, if and only if the MLU operator satisfies the condition that it has input data and part of the input data is not generated by an MLU operator in the fusion segment, it is an input operator of the fusion segment.
The output operator of a fusion segment: For an MLU operator in a fusion segment, if and only if the MLU operator satisfies the condition that it has output data and part of the output data does not act as the input data of an MLU operator in the fusion segment, it is an output operator of the fusion segment.

After introducing the two concepts, we will introduce the present algorithm for generating fusion segments from a graph as follows:

The Init() method of the parent class GraphExecutor generates aninitialized graph whose nodes are sorted in topological order.
The algorithm puts all the MLU operators (except the copy operator) of the initialized graph into a preliminary set.
By the MLU device id of the MLU operator, the algorithm divides the preliminary set into several subsets, so that the MLU operators in each subset all have the same MLU device id.
The algorithm divides each subset into several fusion segments. Generating a fusion segment should satisfy the following conditions: a) for a fusion segment, the maximal topological index of its input operators is smaller than the minimal topological index of its output operators; b) if condition a) is satisfied, the algorithm will put the MLU operators from the same subset into the same fusion segment as much as possible.

A front-end Executor object of the graph is got by invoking the SimpleBind()/Bind() method of the front-end Symbol. The Forward() method of the front-end Executor performs the network inference by invoking the MluRunOps() method of the MluFuseGraphExecutor. During the execution of the MluRunOps() method, the method compiles every possible fusion segment in the graph, and then completes the network inference by processing every node of the initialized graph in topological turn as follows:

Case 1: If the node is not in any fusion segment, the MluRunOps() method will process the node in the same way as the RunOps() method of the GraphExecutordoes.
Case 2: If the node is in a fusion segment and as an input operator with the topological index which is larger than any other topological indexes of input operators, the MluRunOps() method will invoke the Forward() method of the FusionOp related to the fusion segment.
Case 3: If the node is in a fusion segment but cannot satisfy the condition in the case 2, the MluRunOps() method will skip it and do nothing.

After implementing the design of CNML/CNRT integration, the MXNet can perform neural network inference on the MLU in the imperative mode and the symbolic mode. In this pull request, the MXNet integrating with CNML/CNRT only supports the symbolic programming and the MXNet Naive Engine in the network inference on the MLU.

Future works

According to the existing MXNet operators, we will implement more corresponding MLU operators.
Adapt the MXNet quantization for the MLU.
Adapt the MXNet to support the generation of the MLU off-line models for better deployment of deep learning models.
Support imperative programming for network inference on the MLU.
Support more kinds of the MXNet Engines.

Page tree