You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Link to dev List discussion

https://lists.apache.org/thread.html/464712f0136fb51916ca9f1b702b99847e108dbdbd0b6a2b73fc91f1@%3Cdev.mxnet.apache.org%3E

Feature Shepherd

Need volunteer to help shepherd

Problem

Adding backend support for new accelerators is difficult and requires making changes the MXNet source code. This is difficult and has an extensive learning curve for accelerator vendors to come up to speed on how to modify MXNet. 

Proposed Approach

In this project, we will create a set of abstractions through an API that allows accelerator vendors to create external libraries to interface their custom hardware to MXNet without modifying the MXNet code base. We'll streamline how MXNet interacts with processors, and create a user-facing API to dynamically load accelerator libraries at runtime. This will allow accelerator vendors to distribute their library separately from MXNet, decoupling the release of MXNet versions from accelerator library versions. 

Here is one example diagram showing the interaction of the accelerator library with MXNet:

As shown above, there are 4 interfaces:

  • Processor Info - a discovery API for the library to provide context info to MXNet
  • Supported Ops - a mechanism for MXNet to check if an operator and a particular set of inputs/outputs/attributes can be executed on the processor
  • Executor - a mechanism for giving a set of work to the processor's execution engine
  • Notify - a mechanism for the processor's execution engine to let MXNet know about completion of work

Each of these interfaces will have a set of API for the low level operations required (ie. transferring graphs, input data, returning output data, etc.). 

UserExperience

ML users will interact with this new feature by calling an API to load an accelerator library:

Python API
import mxnet as mx

#load accelerator library
acc = mx.context.load_acc("/path/to/libmyacc.so")


Accelerator vendors will interact with this new feature by creating a library that implements functions defined in a header file "mxnet_acc.h":

mxnet_acc.h
#include <string>

typedef int (FCompute)(int, void*);
extern "C" std::string getAccName();
extern "C" FCompute* getFCompute(std::string);
myacc.cpp - example accelerator library implementation
#include "mxnet_acc.h"

std::string getAccName() {
  return std::string("myacc");
}

extern "C" FCompute* getFCompute(std::string) {
  return 0;

}

Then, accelerator vendors compile their library like:

g++ -shared -fPIC myacc.cpp -o libmyacc.so -I ../../include/mxnet


Goals/Usecases

As an ML user, I’d like to have a simple interface that allows me to use use a custom accelerator for training/inferring deep learning models.

As an accelerator vendor, I’d like to have a create an MXNet interface for my accelerator without having to be an expert in how the MXNet backend works.

Open Questions

What should the set of APIs be for accelerators to hook into the MXNet backend?

Proposed Approach

Prototype for front-end library loading returning an MXNet context has be implemented here: https://github.com/samskalicky/incubator-mxnet/tree/accel_api

Accelerator libraries will implement the functions defined in the header file "mxnet_acc.h": https://github.com/samskalicky/incubator-mxnet/blob/95a7ab06b6ab30a014a497db0d98cf62fa35df84/include/mxnet/mxnet_acc.h

Here is an example library implementation: https://github.com/samskalicky/incubator-mxnet/blob/95a7ab06b6ab30a014a497db0d98cf62fa35df84/example/accel_api/myacc.cpp

For data allocation, MXNet already has an abstraction for managing storage StorageManager. For this feature, we inherit from this class to call functions from the accelerator library.

Memory Managementhttps://github.com/samskalicky/incubator-mxnet/blob/95a7ab06b6ab30a014a497db0d98cf62fa35df84/src/storage/acc_storage_manager.h#L68-L84
Allocate on accextern "C" void* alloc(std::size_t size);
free on acc
extern "C" void free(void*);
direct-free on acc
extern "C" void directFree(void*);
release all (free all)
extern "C" void releaseAll();


For data movement, MXNet already has a templated Copy<to,from> approach. For this feature, we'll just leverage this to call functions from the accelerator library.

Data movementhttps://github.com/samskalicky/incubator-mxnet/blob/accel_api/src/ndarray/ndarray_function.cc#L54-L88
Copy from host-to-acc
extern "C" int copyTo(void* dst, void* src, size_t size);
Copy from acc-to-host
extern "C" int copyFrom(void* dst, void* src, size_t size);
Copy within acc
extern "C" int copyBetween(void* dst, void* src, size_t size);


Addition of New APIs

Python - context.py
def load_acc(path_to_lib)

Inputs: path to accelerator library

Returns: context to first instance of accelerator (0)

C API - c_api.cc

int MXLoadAccLib(const char *path, int *id, char *name)

Inputs: path to accelerator library, pointer to integer for dev_type and pointer to char array for context name are set by this function

Returns: success status of loading library and initializing context


Backward compatibility

No issues, this is a new functionality. Existing custom hardware backends for MKL/MKL-DNN/CUDNN/TensorRT will continue working.

Performance Considerations

We will performance analyze the overheads introduced by using a dynamically loaded library by creating a test accelerator library that simply reuses the existing CPU and GPU operator implementations. Then we'll compare these "accelerators" agains the current CPU and GPU contexts.

Test Plan

We will create a test accelerator library that simply reuses the existing CPU and GPU operator implementations an run all existing unit tests. 

Alternative Approaches

Currently, custom accelerators like TensorRT must be implemented by modifying the MXNet backend and learning how MXNet works at the lowest level. The team that implemented TensorRT support in MXNet ran through many hurdles and the learnings from that effort are being applied in this proposal. 

Technical Challenges 

We'll need to version the MXNet operators with accelerator libraries so that as operator implementations change we catch the mismatch against older accelerator libraries. 

Milestones

TBD

References

  • No labels