PR: https://github.com/apache/incubator-mxnet/pull/15210

How Custom Operators Work

MXNet allows users to create custom operators if the existing NDArray operators cannot meet their needs. However, profiling custom operator is not well supported currently.

To see the issues, we need to first understand how do custom operators work. First of all, CustomOperator is a singleton class, and it has its own task queue and worker threads. When a custom operator is executed by the engine, the engine thread will call Push() in CustomOperator. Next, Push() will create the actual work load (callback to forward() or backward, see below) and push it to CustomOperator's task queue, where its own worker thread will then pick the task up and run it.

When users create a new custom operator, they need to define both the forward() and backward() functions in python. When this operator is executed, one of these two functions will be called. And those two functions consist of two kinds of code: 1) python code and Numpy code (I call them pure python code) which will run in the CustomOperator's own worker thread 2) code that calls NDArray operators (I call them sub-operators) which will then be pushed to the engine to run asynchronously from the CustomOperator's worker threads.


Also in CustomOperator’s Push(), a special callback named “CustomOperator”(Now renamed to “Dummy_Wait”, refer to the screenshot above, we will use this new name below) is pushed to the engine. The idea is that “Dummy_Wait” has dependencies on the custom operator and it will get executed at last to make sure the custom operator event will span over the execution of both the pure python code as well as the sub-operators.

Issues and Motivation

With the above said, there are some issues with the current profiler:

* The start timestamp of custom operator events is correct, but the end timestamp will wait for “Dummy_Wait" which itself will wait for a bunch of variable deletions. The end result is that custom operator events will appear way longer than the actual computation. Also, because of parallelism, users can never know that at a specific point in a custom operator event, whether we are executing pure python code, the sub operators, or we are simply waiting for dependencies to clear.
* Because custom operator events will not end until “Dummy_Wait” starts, the former will not wrap around the latter. This violates the TraceCompass format will mess up chrome://tracing timeline (shown below). Each event has a begin and an end timestamp; however Chrome will treat one event as two separate ones, and say that one has no “BEGIN” and the other has no “END” (refer to the second screenshot below).
* All the custom operator events are called “Custom”. If users have more than one custom operator, they are not able to distinguish them.

image

To avoid confusion, those issues need to be fixed.

New Design

Main changes

Regarding custom operators, users care about the performance of both the pure python code and the sub-operators. So, in our enhanced custom operator profiling, we should dissect custom operator calls into fine-gained events for both categories. Specifically, we will create a new domain called “Custom Operators”. There, we will have: 1) Events that represent the execution of the pure python code. 2) Events that represent the execution of the sub-operators. 3) Also, for different custom operators, we should give events different namespace prefix.

This way, all the events will have accurate begin and end timestamps; there will also be no confusion about wait time.

To achieve 1), we will create a new singleton class called CustomOpProfiler in src/profiler. This singleton class has APIs OnCustomBegin() and OnCustomEnd(). We will call them right before and after the callback to forward() or backward() (as we defined in python) in custom-inl.h to profile pure python code. Also, we want to put those events in the new “Custom Operator” domain.

For 2), we already have operator profiling.

For 3), we will also use CustomOpProfiler. We will create a mapping from CustomOperator worker thread_id to the registered name of that custom operator. Then, in PushAsync() in threaded_engine.cc, we will call GenerateDisplayName() in CustomOpProfiler to see if we are in a CustomOperator worker thread. If so, then this operator being pushed to the engine is a sub-operator of a custom operator. We want to create a display name by concatenating the name of this operator to a prefix which is the name of the custom operator, something like “MyOp::_plus_scalar”. Furthermore, in class ProfileOperator in profiler.h, we need to check the display name of the operator. If the name contains “::”, then we profile them within domain “Custom Operator.”

More discussions

  • With this enhanced custom operator profiling, we also want to get rid of profiling "Custom" and “Dummy_Wait” entirely. This is done by adding a check in ProfileOperator in profiler.h.
  • Notice that because we are adding a function call to GenerateDisplayName() in PushAsync(), we are risking adding an overhead to every operator call (we need to get thread id and and the function has a lock). However in practice, because this function is short and has early return checks, this overhead is small enough to forgive. On my machine (2017 MacBook Pro 13’ i7), on average, for regular operator calls, this overhead is less than 1 micro second (it appears as 0). And for sub-operator calls, the overhead is always < 10 micro seconds and averages to < 5 micro seconds. This is to be compared to ~150 micro seconds taken by executing NDArray plus scalar on a 100*100 matrix. Notice this relative larger overhead will only happen to sub-operators of custom operators.
  • Currently CustomOperator has its own thread pool. In the future this may change so that custom operator calls will use engine worker threads directly. In that prospect, the proposed mapping from thread_id to custom operator names would continue to work.

Limitations

The first limitation is that with my current implementation, we are distinguishing regular operators and sub-operators in ProfileOperator profiler.h just base on the name. If the name contains "::", then we would think it is a sub-operator; if the name is "Custom", then we would think it is a custom operator. However, when uses register a new custom operator, they could actually name it "Custom" or "XX::XX", so ProfileOperator would misclassify. A solution is to add a check before user register new custom operators and reject "Custom" or any name that contains "::". But this risks forcing users to change existing models.


Alternatively, we could have a contact manager to help distinguish regular operators and sub-operators. This would involve wrapping frontend custom op forward() and backward() with a context manager and adding a bool "is_custom" to all the backend apis along the call stack when we push a operator to the engine. This method will cause much more damage than the proposed "thread_is mapping" method.

Visualization

Below is the new visualization after my change:



This is to be compared with the old visualization:
image

The two screenshot are produced with the same code:

import mxnet as mx
from mxnet import nd
from mxnet import profiler
from mxnet import context
import threading

# Define Custom Operator - Element wise Addition Multiplication
class CustomAddOne(mx.operator.CustomOp):
def forward(self, is_train, req, in_data, out_data, aux):
self.assign(out_data[0], req[0], in_data[0] + 1)

def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
self.assign(in_grad[0], req[0], out_grad[0])


@mx.operator.register("CustomAddOne")
class CustomAddOneProp(mx.operator.CustomOpProp):
def __init__(self):
super(CustomAddOneProp, self).__init__(need_top_grad=True)

def list_arguments(self):
return ['in']

def list_outputs(self):
return ['output']

def infer_shape(self, in_shape):
# inputs, outputs, aux
return [in_shape[0]], [in_shape[0]], []

def create_operator(self, ctx, shapes, dtypes):
return CustomAddOne()


class CustomAddTwo(mx.operator.CustomOp):
def forward(self, is_train, req, in_data, out_data, aux):
y = nd.sqrt(in_data[0])
self.assign(out_data[0], req[0], in_data[0] + 2)

def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
self.assign(in_grad[0], req[0], out_grad[0])


@mx.operator.register("CustomAddTwo")
class CustomAddTwoProp(mx.operator.CustomOpProp):
def __init__(self):
super(CustomAddTwoProp, self).__init__(need_top_grad=True)

def list_arguments(self):
return ['in']

def list_outputs(self):
return ['output']

def infer_shape(self, in_shape):
# inputs, outputs, aux
return [in_shape[0]], [in_shape[0]], []

def create_operator(self, ctx, shapes, dtypes):
return CustomAddTwo()


inp = mx.nd.zeros(shape=(500, 500))
inp2 = mx.nd.zeros(shape=(500, 500))

profiler.set_config(profile_all=True, aggregate_stats=True, continuous_dump = True)
profiler.set_state('run')

w = nd.Custom(inp, op_type="CustomAddOne")
x = nd.sqrt(inp)
y = nd.Custom(inp, op_type="CustomAddOne")
z = nd.Custom(inp, op_type="CustomAddTwo")

mx.nd.waitall()

profiler.set_state('stop')
profiler.dump()
print(profiler.dumps(reset=True))

Also, we can try using Gluon:

import numpy as np
import mxnet as mx
from mxnet import profiler
from mxnet import gluon, autograd
import os

class Dense(mx.operator.CustomOp):
def __init__(self, bias):
self._bias = bias

def forward(self, is_train, req, in_data, out_data, aux):
x = in_data[0].asnumpy()
weight = in_data[1].asnumpy()
y = x.dot(weight.T) + self._bias
mx.nd.zeros(shape=(500, 500))
self.assign(out_data[0], req[0], mx.nd.array(y))

def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
x = in_data[0].asnumpy()
dy = out_grad[0].asnumpy()
dx = dy.T.dot(x)
self.assign(in_grad[0], req[0], mx.nd.array(dx))

@mx.operator.register("dense")
class DenseProp(mx.operator.CustomOpProp):
def __init__(self, bias):
super(DenseProp, self).__init__(True)
# we use constant bias here to illustrate how to pass arguments
# to operators. All arguments are in string format so you need
# to convert them back to the type you want.
self._bias = float(bias)

def list_arguments(self):
return ['data', 'weight']

def list_outputs(self):
# this can be omitted if you only have 1 output.
return ['output']

def infer_shape(self, in_shapes):
data_shape = in_shapes[0]
weight_shape = in_shapes[1]
output_shape = (data_shape[0], weight_shape[0])
# return 3 lists representing inputs shapes, outputs shapes, and aux data shapes.
return (data_shape, weight_shape), (output_shape,), ()

def create_operator(self, ctx, in_shapes, in_dtypes):
# create and return the CustomOp class.
return Dense(self._bias)


class DenseBlock(mx.gluon.Block):
def __init__(self, in_channels, channels, bias, **kwargs):
super(DenseBlock, self).__init__(**kwargs)
self._bias = bias
self.weight = self.params.get('weight', shape=(channels, in_channels))

def forward(self, x):
ctx = x.context
return mx.nd.Custom(x, self.weight.data(ctx), bias=self._bias, op_type='dense')




dense = DenseBlock(3, 5, 0.1)
dense.initialize()
x = mx.nd.uniform(shape=(4, 3))


profiler.set_config(profile_all=True, aggregate_stats=True, continuous_dump = True)
profiler.set_state('run')


y = dense(x)


mx.nd.waitall()
profiler.set_state('stop')
profiler.dump()
print(profiler.dumps(reset=True))


  • No labels

3 Comments

  1. Awesome for making efforts to improve this.  One thing that I don't understand is why CustomOp has it's own thread pool instead of using the engine one. Could you clarify this if you know the answer?

    1. I ask Anirudh and his answer was "its to overcome gil limitation. you want to run python code individually but you should be able to execute backend operators in parallel."

  2. Oh I think I heard this before, maybe it would be better to have it push the work directly into the engine, I don't see why it has to have it's own thread pool. But anyway is out of the scope of your work unless you would want to dive deeper. Thanks.