Upstreaming of Mxnet HIP port

HIP MxNet port changes to merge with MXNet:

HIP is an open-source toolkit designed to convert CUDA code to portable C++. After conversion, the source code can compiled to run on AMD GPUs (compatible with the ROCm platform) or on NVIDIA GPUs (with same performance as the native CUDA code). HIP runtime and C++ language are familiar to CUDA programmers so the conversion process is lightweight and does little damage to the original code. The HIP code is easy to maintain since it is close to CUDA, and provides full C++ support including templates, classes, namespaces, etc.

AMD has developed an open-source optimized machine intelligence library called "MIOpen". This contains convolution, pooling, and other kernels which have been optimized for AMD GPUs. The MIOpen host runtime API accepts HIP parameters for things like streams, and also defines Tensor and other structures for memory management.

We have developed HIP Port of MXNet (https://github.com/ROCmSoftwarePlatform/mxnet/).We want to merge this HIP Port with master branch,so MXnet works on both NVidia and AMD hardware.

We have initiated disccusion with MXNet community on (https://github.com/apache/incubator-mxnet/issues/6257).



Following is Proposed Design changes:

1)Build system changes:

As per the disccusion, the community wants to retain the flag USE_CUDA. Removing it will break all legacy build systems and documents with USE_CUDA=1 enabled.

Proposed build system changes:

We plan to introduce a new build flag USE_HIP for HIP code. All HIP build related changes will be guarded against USE_HIP=1. In addition, it is not possible to have both USE_CUDA as well as USE_HIP being enabled at the same time. So there will be additional logic to test that both are not enabled at the same time.

ifeq ($(USE_CUDA),1)

ifeq ($(USE_HIP),1)

$(error Both CUDA and HIP backend cannot be enabled together. Please enable only one of the two)

endif

endif

The current mxnet(cuda based) build system at present is

                                                 




For the HIP port of proposed mxnet build system changes are




     


For the cudnn acceleration related build changes

2) Code changes:

The sample changes related to the code for runtime api’s , math libraries and convolution neural network acceleration library as follows

1) Runtime API changes sample code:

Native MXnet: https://github.com/dmlc/mshadow/blob/463c0dffe3eae8c39caf7989c85b7244823df27e/mshadow/stream_gpu-inl.h#L61

inline void Wait(void) {
   MSHADOW_CUDA_CALL(cudaStreamSynchronize(stream_));
 }
 /*!
  * \brief query whether the the stream is idle
  * \return true if the stream is idle and all the job have been completed
  */
 inline bool CheckIdle(void) {
   cudaError_t err = cudaStreamQuery(stream_);
   if (err == cudaSuccess) return true;
   if (err == cudaErrorNotReady) return false;
   LOG(FATAL) << cudaGetErrorString(err);
   return false;
}


Proposed changes:

 inline void Wait(void) {
#if USE_CUDA==1
   MSHADOW_CUDA_CALL(cudaStreamSynchronize(stream_));
#endif //USE_CUDA
#if USE_HIP==1
   MSHADOW_CUDA_CALL(hipStreamSynchronize(stream_));
#endif //USE_HIP
 }
 /*!
  * \brief query whether the the stream is idle
  * \return true if the stream is idle and all the job have been completed
  */
 inline bool CheckIdle(void) {
#if USE_CUDA==1
   cudaError_t err = cudaStreamQuery(stream_);
   if (err == cudaSuccess) return true;
   if (err == cudaErrorNotReady) return false;
   LOG(FATAL) << cudaGetErrorString(err);
   return false;
#endif //USE_CUDA
#if USE_HIP==1
   hipError_t err = hipStreamQuery(stream_);
   if (err == hipSuccess) return true;
   if (err == hipErrorNotReady) return false;
   LOG(FATAL) << hipGetErrorString(err);
   return false;
#endif //USE_HIP
 }


# kernel Launch sample code

Native Mxnet: https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/im2col.cuh#L114

inline void im2col_gpu(mshadow::Stream<gpu>* s,
                      const DType* data_im, const int channels,
                      const int height, const int width,
                      const int kernel_h, const int kernel_w,
                      const int pad_h, const int pad_w,
                      const int stride_h, const int stride_w,
                      const int dilation_h, const int dilation_w,
                      DType* data_col) {
 // We are going to launch channels * height_col * width_col kernels, each
 // kernel responsible for copying a single-channel grid.
 int height_col = (height + 2 * pad_h -
     (dilation_h * (kernel_h - 1) + 1)) / stride_h + 1;
 int width_col = (width + 2 * pad_w -
     (dilation_w * (kernel_w - 1) + 1)) / stride_w + 1;
 int num_kernels = channels * height_col * width_col;
 using namespace mxnet_op;
 // NOLINT_NEXT_LINE(whitespace/operators)
 im2col_gpu_kernel<DType><<<cuda_get_num_blocks(num_kernels), mshadow::cuda::kBaseThreadNum,
                            0, mshadow::Stream<gpu>::GetStream(s)>>>(
     num_kernels, data_im, height, width, kernel_h, kernel_w, pad_h,
     pad_w, stride_h, stride_w, dilation_h, dilation_w, height_col,
     width_col, data_col);
 MSHADOW_CUDA_POST_KERNEL_CHECK(im2col_gpu_kernel);
}

Proposed changes:
inline void im2col_gpu(mshadow::Stream<gpu>* s,
                      const DType* data_im, const int channels,
                      const int height, const int width,
                      const int kernel_h, const int kernel_w,
                      const int pad_h, const int pad_w,
                      const int stride_h, const int stride_w,
                      const int dilation_h, const int dilation_w,
                      DType* data_col) {
 // We are going to launch channels * height_col * width_col kernels, each
 // kernel responsible for copying a single-channel grid.
 int height_col = (height + 2 * pad_h -
     (dilation_h * (kernel_h - 1) + 1)) / stride_h + 1;
 int width_col = (width + 2 * pad_w -
     (dilation_w * (kernel_w - 1) + 1)) / stride_w + 1;
 int num_kernels = channels * height_col * width_col;
 using namespace mxnet_op;
 // NOLINT_NEXT_LINE(whitespace/operators)
#if USE_CUDA == 1
im2col_gpu_kernel<DType><<<cuda_get_num_blocks(num_kernels), mshadow::cuda::kBaseThreadNum,
                            0, mshadow::Stream<gpu>::GetStream(s)>>>(
#endif //USE_CUDA
#if USE_HIP == 1
 hipLaunchKernelGGL(HIP_KERNEL_NAME(im2col_gpu_kernel<DType>), dim3(cuda_get_num_blocks(num_kernels)), dim3(mshadow::cuda::kBaseThreadNum), 0, mshadow::Stream<gpu>::GetStream(s),
#endif //USE_HIP
     num_kernels, data_im, height, width, kernel_h, kernel_w, pad_h,
     pad_w, stride_h, stride_w, dilation_h, dilation_w, height_col,
     width_col, data_col);
 MSHADOW_CUDA_POST_KERNEL_CHECK(im2col_gpu_kernel);
}





2) Math Libraries

Sample code for math libraries related changes

https://github.com/dmlc/mshadow/blob/463c0dffe3eae8c39caf7989c85b7244823df27e/mshadow/dot_engine-inl.h#L677
Original reference code

 inline static void dot(Stream<gpu> *stream,
                        int n,
                        const float* X, int incX,
                        const float* Y, int incY,
                        float *ret) {
   cublasSetPointerMode(Stream<gpu>::GetBlasHandle(stream),
                        CUBLAS_POINTER_MODE_DEVICE);
   cublasStatus_t err = cublasSdot(Stream<gpu>::GetBlasHandle(stream),
                                   n, X, incX, Y, incY, ret);
   CHECK_EQ(err, CUBLAS_STATUS_SUCCESS) << "Cublas: Dot fail";
   cublasSetPointerMode(Stream<gpu>::GetBlasHandle(stream),
                        CUBLAS_POINTER_MODE_HOST);
}

Proposed changes:
 inline static void dot(Stream<gpu> *stream,
                        int n,
                        const float* X, int incX,
                        const float* Y, int incY,
                        float *ret) {
#if USE_CUDA==1
   cublasSetPointerMode(Stream<gpu>::GetBlasHandle(stream),
                        CUBLAS_POINTER_MODE_DEVICE);
   cublasStatus_t err = cublasSdot(Stream<gpu>::GetBlasHandle(stream),
                                   n, X, incX, Y, incY, ret);
   CHECK_EQ(err, CUBLAS_STATUS_SUCCESS) << "Cublas: Dot fail";
   cublasSetPointerMode(Stream<gpu>::GetBlasHandle(stream),
                        CUBLAS_POINTER_MODE_HOST);
#endif //USE_CUDA
#if USE_HIP==1
   hipblasSetPointerMode(Stream<gpu>::GetBlasHandle(stream),
                        HIPBLAS_POINTER_MODE_DEVICE);
   hipblasStatus_t err = hipblasSdot(Stream<gpu>::GetBlasHandle(stream),
                                   n, X, incX, Y, incY, ret);
   CHECK_EQ(err, HIPBLAS_STATUS_SUCCESS) << "Hipblas: Dot fail";
   hipblasSetPointerMode(Stream<gpu>::GetBlasHandle(stream),
                        HIPBLAS_POINTER_MODE_HOST);
#endif //USE_HIP
 }
3) Acceleration related libraries(cudnn/miopen)
https://github.com/dmlc/mshadow/blob/463c0dffe3eae8c39caf7989c85b7244823df27e/mshadow/stream_gpu-inl.h#L166

Native Mxnet:

 inline void CreateDnnHandle() {
// #if MSHADOW_USE_CUDNN == 1 && defined(__CUDACC__)
#if MSHADOW_USE_CUDNN == 1
   this->DestroyDnnHandle();
   cudnnStatus_t err = cudnnCreate(&dnn_handle_);
   CHECK_EQ(err, CUDNN_STATUS_SUCCESS) << cudnnGetErrorString(err);
   // At this point, we have the resource which may need to be freed
   this->dnn_handle_ownership_ = OwnHandle;
   err = cudnnSetStream(dnn_handle_, stream_);
   CHECK_EQ(err, CUDNN_STATUS_SUCCESS) << cudnnGetErrorString(err);
#endif
}


Proposed changes:

inline void CreateDnnHandle() {
#if MSHADOW_USE_CUDNN == 1
   this->DestroyDnnHandle();
   cudnnStatus_t err = cudnnCreate(&dnn_handle_);
   CHECK_EQ(err, CUDNN_STATUS_SUCCESS) << cudnnGetErrorString(err);
   err = cudnnSetStream(dnn_handle_, stream_);
   CHECK_EQ(err, CUDNN_STATUS_SUCCESS) << cudnnGetErrorString(err);
   this->dnn_handle_ownership_ = OwnHandle;
#endif
#if MSHADOW_USE_CUDNN == 1 && USE_HIP==1 && defined (HIP_PLATFORM_HCC)
   this->DestroyDnnHandle();
   miopenStatus_t  err = miopenCreate(&dnn_handle_);
   CHECK_EQ(err, miopenStatusSuccess) << (err);
   err = miopenSetStream(dnn_handle_, stream_);
   CHECK_EQ(err, miopenStatusSuccess) << (err);
   this->dnn_handle_ownership_ = OwnHandle;
#endif
 }




4)Pros and Cons of the following design approach


Pro:The mxnet can be leveraged on AMD gpu platforms retaining the support for nvidia gpu platform.

Con:The main drawback of this appraoch is to maintain parity ,as it takes more effort to maintain the two supported code paths, for example if any enhancements or changes are done in the cuda path , the same has to be replicated in the hip path as well

  • No labels

5 Comments

  1. Hi, the following feedback was given by Pedro Larroy regarding the previous version.


    "The calls to synchronize and wait looks very similar. Shall we use
    polymorphism or a bridge pattern to abstract this common calls instead of
    using the preprocessor? both seem to use the same abstraction (streams).
    Using the suggested pattern instead of preprocessor would lead to code that
    is easier to maintain and instrument. A shortcoming would be if the APIs to
    abstract would be too different.".


  2. Hi Srihari - thanks for the proposal to upstream MXNet HIP port.

    As others, I'm concerned about having duplicated code for Cuda and HIP. Your examples show a lot of similarities between Cuda and HIP API's.

    Is there not a more efficient approach using macro's or a thin interface layer to transform a CUDA call into a HIP call?

  3. Hi , thanks for the comments on design to upstream MXNet HIP port.

    We  proposed earlier an interface layer earlier which changes the api calls to CUDA call and HIP call  and eliminate duplicate  code in similar line as suggested by steffen.

    example:
    cudaMalloc will be replaced by gpuMalloc and based on platform selection in make system it will call
    cudaMalloc for cuda and hipMalloc for HIP. Please refer to the link for earlier discussion on proposal https://github.com/apache/incubator-mxnet/issues/6257.

    In make system USE_GPU is used to turn on either USE_CUDA or USE_HIP backend  in makefile and cmake.


     But the comment from community is

    This change will break all legacy build systems and documents with USE_CUDA=1 enabled , so we took alternate approach even though it duplicates code

  4. For the cudnn acceleration related build changes, why is USE_CUDA checked multiple times?