Page History

...

Currently only a very limited number of operators (such as exp) that support second or higher order gradient calculation. For other operators, if users try to get the second order gradient of an operator, MXNet would issue an error message such as "mxnet.base.MXNetError: [23:15:34] src/pass/gradient.cc:192: Operator _backward_XXX is non-differentiable because it didn't register FGradient attribute." This is because MXNet backend does not implement the FGradient function for the backward node of the operator and therefore cannot support second (and higher) order of gradient. See the example of sin operator below. The gradient function is registered through a hidden operator _backward_sin. However, the _backward_sin operator does not register FGradient function.

...

Higher order gradient calculation is required for many applications such as adaptive learning rate optimization [1], W-GAN network [2], network architecture search [3], and etc. Implementing higher order gradient can help unlock these applications and improve the usability and popularity of Apache MXNet framework.

...

This project will implement a set of operators to support adaptive learning rate optimization proposed by [1] . The following operators will be supported in this project. We will also keep an updateup-to-date document of the supported operators in MXNet repo.

...

To support higher order gradient, we will replace the “_backward_sin” operator with an explicit cos(x) operator. Because cos(x) operator will have its own FGradient function, by calling the FGradient functino function recursively we can achieve many order of derivative of the sin(x) operator. A preliminary implementation as proof-of-concept is here.

...

Replace the hidden “_backward_xxx” with a simple function with other forward operators just like we did with sin(x) operator. This needs efficient representation of the analytical derivative function for each operator and requires many a lot of expertise on a per operator base.
Register a FGradient function inside the “_backward_xxx” operator. This can be simpler in terms of implementation and less destructive to current code design pattern. However, it limits the gradient to second order and cannot automatically extend to higher order of gradients.

Case III: Operators implemented using third party library such as cuDNN and MKLDNN: many neural network specific operators are using implementations directly from cuDNN and/or MKLDNN libraries if available, such as batchnorm, convolution, pooling etc. An example of convolution operator using MKLDNN library is shown below:

In such case cases, we will have to depend on the third party library to provide higher order gradient support for these operators if users install MXNet library with other third party packages. For the first phase of this project, since we are dealing with multilayer perceptron network, we do not have to implement higher order derivative for these operators. We will survey the related third party packages and incorporate their higher order gradient implementation in MXNet.

...

1) Use finite-difference approximation to calculate second order gradients. The pro is its generic for almost most all operators and developers do not need to implement second order gradient calculation one by one. The con is picking the finite-difference value becomes another hyperparameter to tune and requires expertise from users.

2) Implement a code generation mechanism which will make it easier to create the differentiable backward operators. Currently Pytorch is using a code generation mechanism in which backward pass is implemented with Variables and defined in a yaml file, thus fully differentiable. We need to think big and provide a scalable implementation that allows for easy extensibility and external contributions. Moreover, also if implementation of a differentiable backward pass is easy to code and review, it will multiply developer productivity and maintainability.

...

Page tree

Versions Compared

Old Version 13

New Version 14

Key