Upstreaming of Mxnet HIP port

HIP MxNet port changes to merge with MXNet:

HIP is an open-source toolkit designed to convert CUDA code to portable C++. After conversion, the source code can compiled to run on AMD GPUs (compatible with the ROCm platform) or on NVIDIA GPUs (with same performance as the native CUDA code). HIP runtime and C++ language are familiar to CUDA programmers so the conversion process is lightweight and does little damage to the original code. The HIP code is easy to maintain since it is close to CUDA, and provides full C++ support including templates, classes, namespaces, etc.

AMD has developed an open-source optimized machine intelligence library called "MIOpen". This contains convolution, pooling, and other kernels which have been optimized for AMD GPUs. The MIOpen host runtime API accepts HIP parameters for things like streams, and also defines Tensor and other structures for memory management.

We have developed HIP Port of MXNet (https://github.com/ROCmSoftwarePlatform/mxnet/).We want to merge this HIP Port with master branch,so MXnet works on both NVidia and AMD hardware.

We have initiated disccusion with MXNet community on (https://github.com/apache/incubator-mxnet/issues/6257).

Following is Proposed Design changes:

1)Build system changes:

As per the disccusion, the community wants to retain the flag USE_CUDA. Removing it will break all legacy build systems and documents with USE_CUDA=1 enabled.

Proposed build system changes:

We plan to introduce a new build flag USE_HIP for HIP code. All HIP build related changes will be guarded against USE_HIP=1. In addition, it is not possible to have both USE_CUDA as well as USE_HIP being enabled at the same time. So there will be additional logic to test that both are not enabled at the same time.

ifeq ($(USE_CUDA),1)

ifeq ($(USE_HIP),1)

$(error Both CUDA and HIP backend cannot be enabled together. Please enable only one of the two)

endif

The current mxnet(cuda based) build system at present is

For the HIP port of proposed mxnet build system changes are

For the cudnn acceleration related build changes

2) Code changes:

The sample changes related to the code for runtime api’s , math libraries and convolution neural network acceleration library as follows

1) Runtime API changes sample code:

Native MXnet: https://github.com/dmlc/mshadow/blob/463c0dffe3eae8c39caf7989c85b7244823df27e/mshadow/stream_gpu-inl.h#L61

inline void Wait(void) {
   MSHADOW_CUDA_CALL(cudaStreamSynchronize(stream_));
}
/*!
  * \brief query whether the the stream is idle
  * \return true if the stream is idle and all the job have been completed
  */
inline bool CheckIdle(void) {
   cudaError_t err = cudaStreamQuery(stream_);
   if (err == cudaSuccess) return true;
   if (err == cudaErrorNotReady) return false;
   LOG(FATAL) << cudaGetErrorString(err);
   return false;
}

Proposed changes:

inline void Wait(void) {
#if USE_CUDA==1
   MSHADOW_CUDA_CALL(cudaStreamSynchronize(stream_));
#endif //USE_CUDA
#if USE_HIP==1
   MSHADOW_CUDA_CALL(hipStreamSynchronize(stream_));
#endif //USE_HIP
}
/*!
  * \brief query whether the the stream is idle
  * \return true if the stream is idle and all the job have been completed
  */
inline bool CheckIdle(void) {
#if USE_CUDA==1
   cudaError_t err = cudaStreamQuery(stream_);
   if (err == cudaSuccess) return true;
   if (err == cudaErrorNotReady) return false;
   LOG(FATAL) << cudaGetErrorString(err);
   return false;
#endif //USE_CUDA
#if USE_HIP==1
   hipError_t err = hipStreamQuery(stream_);
   if (err == hipSuccess) return true;
   if (err == hipErrorNotReady) return false;
   LOG(FATAL) << hipGetErrorString(err);
   return false;
#endif //USE_HIP
}

# kernel Launch sample code

Native Mxnet: https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/im2col.cuh#L114

inline void im2col_gpu(mshadow::Stream<gpu>* s,
                      const DType* data_im, const int channels,
                      const int height, const int width,
                      const int kernel_h, const int kernel_w,
                      const int pad_h, const int pad_w,
                      const int stride_h, const int stride_w,
                      const int dilation_h, const int dilation_w,
                      DType* data_col) {
// We are going to launch channels * height_col * width_col kernels, each
// kernel responsible for copying a single-channel grid.
int height_col = (height + 2 * pad_h -
     (dilation_h * (kernel_h - 1) + 1)) / stride_h + 1;
int width_col = (width + 2 * pad_w -
     (dilation_w * (kernel_w - 1) + 1)) / stride_w + 1;
int num_kernels = channels * height_col * width_col;
using namespace mxnet_op;
// NOLINT_NEXT_LINE(whitespace/operators)
im2col_gpu_kernel<DType><<<cuda_get_num_blocks(num_kernels), mshadow::cuda::kBaseThreadNum,
                            0, mshadow::Stream<gpu>::GetStream(s)>>>(
     num_kernels, data_im, height, width, kernel_h, kernel_w, pad_h,
     pad_w, stride_h, stride_w, dilation_h, dilation_w, height_col,
     width_col, data_col);
MSHADOW_CUDA_POST_KERNEL_CHECK(im2col_gpu_kernel);
}

Proposed changes:
inline void im2col_gpu(mshadow::Stream<gpu>* s,
                      const DType* data_im, const int channels,
                      const int height, const int width,
                      const int kernel_h, const int kernel_w,
                      const int pad_h, const int pad_w,
                      const int stride_h, const int stride_w,
                      const int dilation_h, const int dilation_w,
                      DType* data_col) {
// We are going to launch channels * height_col * width_col kernels, each
// kernel responsible for copying a single-channel grid.
int height_col = (height + 2 * pad_h -
     (dilation_h * (kernel_h - 1) + 1)) / stride_h + 1;
int width_col = (width + 2 * pad_w -
     (dilation_w * (kernel_w - 1) + 1)) / stride_w + 1;
int num_kernels = channels * height_col * width_col;
using namespace mxnet_op;
// NOLINT_NEXT_LINE(whitespace/operators)
#if USE_CUDA == 1
im2col_gpu_kernel<DType><<<cuda_get_num_blocks(num_kernels), mshadow::cuda::kBaseThreadNum,
                            0, mshadow::Stream<gpu>::GetStream(s)>>>(
#endif //USE_CUDA
#if USE_HIP == 1
hipLaunchKernelGGL(HIP_KERNEL_NAME(im2col_gpu_kernel<DType>), dim3(cuda_get_num_blocks(num_kernels)), dim3(mshadow::cuda::kBaseThreadNum), 0, mshadow::Stream<gpu>::GetStream(s),
#endif //USE_HIP
     num_kernels, data_im, height, width, kernel_h, kernel_w, pad_h,
     pad_w, stride_h, stride_w, dilation_h, dilation_w, height_col,
     width_col, data_col);
MSHADOW_CUDA_POST_KERNEL_CHECK(im2col_gpu_kernel);
}

2) Math Libraries

Sample code for math libraries related changes

https://github.com/dmlc/mshadow/blob/463c0dffe3eae8c39caf7989c85b7244823df27e/mshadow/dot_engine-inl.h#L677
Original reference code

inline static void dot(Stream<gpu> *stream,
                        int n,
                        const float* X, int incX,
                        const float* Y, int incY,
                        float *ret) {
   cublasSetPointerMode(Stream<gpu>::GetBlasHandle(stream),
                        CUBLAS_POINTER_MODE_DEVICE);
   cublasStatus_t err = cublasSdot(Stream<gpu>::GetBlasHandle(stream),
                                   n, X, incX, Y, incY, ret);
   CHECK_EQ(err, CUBLAS_STATUS_SUCCESS) << "Cublas: Dot fail";
   cublasSetPointerMode(Stream<gpu>::GetBlasHandle(stream),
                        CUBLAS_POINTER_MODE_HOST);
}

Proposed changes:
inline static void dot(Stream<gpu> *stream,
                        int n,
                        const float* X, int incX,
                        const float* Y, int incY,
                        float *ret) {
#if USE_CUDA==1
   cublasSetPointerMode(Stream<gpu>::GetBlasHandle(stream),
                        CUBLAS_POINTER_MODE_DEVICE);
   cublasStatus_t err = cublasSdot(Stream<gpu>::GetBlasHandle(stream),
                                   n, X, incX, Y, incY, ret);
   CHECK_EQ(err, CUBLAS_STATUS_SUCCESS) << "Cublas: Dot fail";
   cublasSetPointerMode(Stream<gpu>::GetBlasHandle(stream),
                        CUBLAS_POINTER_MODE_HOST);
#endif //USE_CUDA
#if USE_HIP==1
   hipblasSetPointerMode(Stream<gpu>::GetBlasHandle(stream),
                        HIPBLAS_POINTER_MODE_DEVICE);
   hipblasStatus_t err = hipblasSdot(Stream<gpu>::GetBlasHandle(stream),
                                   n, X, incX, Y, incY, ret);
   CHECK_EQ(err, HIPBLAS_STATUS_SUCCESS) << "Hipblas: Dot fail";
   hipblasSetPointerMode(Stream<gpu>::GetBlasHandle(stream),
                        HIPBLAS_POINTER_MODE_HOST);
#endif //USE_HIP
}
3) Acceleration related libraries(cudnn/miopen)
https://github.com/dmlc/mshadow/blob/463c0dffe3eae8c39caf7989c85b7244823df27e/mshadow/stream_gpu-inl.h#L166

Native Mxnet:

inline void CreateDnnHandle() {
// #if MSHADOW_USE_CUDNN == 1 && defined(__CUDACC__)
#if MSHADOW_USE_CUDNN == 1
   this->DestroyDnnHandle();
   cudnnStatus_t err = cudnnCreate(&dnn_handle_);
   CHECK_EQ(err, CUDNN_STATUS_SUCCESS) << cudnnGetErrorString(err);
   // At this point, we have the resource which may need to be freed
   this->dnn_handle_ownership_ = OwnHandle;
   err = cudnnSetStream(dnn_handle_, stream_);
   CHECK_EQ(err, CUDNN_STATUS_SUCCESS) << cudnnGetErrorString(err);
#endif
}

Proposed changes:

inline void CreateDnnHandle() {
#if MSHADOW_USE_CUDNN == 1
   this->DestroyDnnHandle();
   cudnnStatus_t err = cudnnCreate(&dnn_handle_);
   CHECK_EQ(err, CUDNN_STATUS_SUCCESS) << cudnnGetErrorString(err);
   err = cudnnSetStream(dnn_handle_, stream_);
   CHECK_EQ(err, CUDNN_STATUS_SUCCESS) << cudnnGetErrorString(err);
   this->dnn_handle_ownership_ = OwnHandle;
#endif
#if MSHADOW_USE_CUDNN == 1 && USE_HIP==1 && defined (HIP_PLATFORM_HCC)
   this->DestroyDnnHandle();
   miopenStatus_t err = miopenCreate(&dnn_handle_);
   CHECK_EQ(err, miopenStatusSuccess) << (err);
   err = miopenSetStream(dnn_handle_, stream_);
   CHECK_EQ(err, miopenStatusSuccess) << (err);
   this->dnn_handle_ownership_ = OwnHandle;
#endif
}

4)Pros and Cons of the following design approach

Pro:The mxnet can be leveraged on AMD gpu platforms retaining the support for nvidia gpu platform.

Con:The main drawback of this appraoch is to maintain parity ,as it takes more effort to maintain the two supported code paths, for example if any enhancements or changes are done in the cuda path , the same has to be replicated in the hip path as well

Page tree

Upstreaming of Mxnet HIP port