Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Users want to bring a FP32 model to convert it to a mixed precision model to run inference on it. They want to use model zoo to convert pretrained models in Python and other frontends. They can somewhat achieve FP16 inference by casting the inputs and the blocks but the same params in gluon but mixed precision inference with certain layers running in FP16 while others running in FP32 cannot be achieved in a trivial way. Also, this cannot be done easily for symbolic models (json and params). Proposing to add API APIs to convert FP32 models to FP16mixed precision models.

There is some nice ongoing work to add automatic mixed precision support for training to mxnet[1]. Among other things, it automatically
adds cast layers, for conversion to FP16 or FP32 based on the operator. There are specific operator lists maintained for ops that should always run in FP16, ops that should always run in FP32 and op which should run in FP16 or FP32 based on whichever is the widest type among its inputs. It also takes into account , operators should run in specific precision only if a condition is met (for example Activation with act_type as softrelu).

I think we can use some of the ideas from AMP, to add an API to convert a model to mixed precision model and add it under AMP namespace. I elaborate more on the proposal below. I start with API changes and backend changes followed by a small section on API changes in gluon to make it easier to convert to mixed precision models.The proposal is elaborated more below:

API Addition


Code Block
languagepy
def convert_model(sym, arg_params, aux_params, target_dtype="float16", target_dtype_ops=None,
fp32_ops=None, widest_dtype_ops=None,
conditional_fp32_ops=None, excluded_sym_names=None):
"""API for converting a model from FP32 model to a mixed precision model.
MXNet tries to convert the FP32 model to mixed precision model by adding
cast layers using amp_cast and amp_multicast operators. The decision on
which cast layer to add is based on hardcoded lists for Automatic Mixed Precision
in MXNet. These lists can be overridden by the user by providing their own lists
using : targe_precision_ops, fp32_ops, widest_precision_ops, conditional_fp32_ops

Parameters
----------
sym : str or Symbol
Defines the structure of a neural network for FP32 types.
arg_params : dict
Dictionary of name to `NDArray`.
aux_params : dict
Dictionary of name to `NDArray`.
target_dtype : str
Currently only supports float16. The target dtype indicates to add cast layers
when possible so that lower precision computation can be leveraged.
target_dtype_ops : list of strs
Override the list of operator names casted to target_dtype.
If None, uses the framework's default list to be casted to target dtype.
fp32_ops : list of strs
Override the lists of operator names casted to FP32.
If None, uses the framework's default list to be casted to FP32.
widest_dtype_ops : list of strs
A list of op names provided by user which should run in widest precision among its inputs.
If None, uses the framework's default list of widest_precision_ops.
conditional_fp32_ops : list of (string, string, list of string)
Override the list of operators to be casted to FP32.
The format of the list is
(name of the function, name of the parameter,
list of values of the parameter that make the operator to be casted to
fp32)
excluded_sym_names : list of strs
A list of strings that represent the names of symbols that users want to exclude
from being quantized.
"""



target_dtype should decide which lists need to be overridden.
For example, in the future bfloat16 support may be added in which case these lists for operators running in bfloat16 will also be added to AMP.
In this case, target_dtype will allow users to choose the right dtype for the mixed precision model.

...

Code Block
languagepy
def convert_hybrid_block(block, target_dtype="float16", target_dtype_ops=None,
                         fp32_ops=None, widest_dtype_ops=None, conditional_fp32_ops=None,
                         excluded_sym_names=None, input_names=['data']):
    """Given a hybrid block/symbol block representing a neural network of data type FP32 and target_dtype,
    return a block which will addwith mixed precision support for the block

    Parameters
    ----------
    block : HybridBlock or SymbolBlock object
        FP32 HybridBlock or SymbolBlock object
    target_dtype : str or numpy
        currently only supports float16. The target dtype indicates to add cast layers
        when possible so that lower precision computation can be leveraged.
    target_precision_ops : list of strs
        Override the list of operator names casted to target_dtype.
        If None, uses the framework's default list to be casted to target dtype.
    fp32_ops : list of strs
        Override the lists of operator names casted to FP32.
        If None, uses the framework's default list to be casted to FP32.
    widest_precision_ops : list of strs
        Override the list of operator names which should run in widest precision among its
        input arguments.
        If None, uses the framework's default list of widest_precision_ops.
    conditional_fp32_ops : list of (string, string, list of string)
        Override the list of functions casted to FP32.
        The format of the list is
        (name of the function, name of the parameter,
         list of values of the parameter that make the operator to be casted to
        fp32)
    excluded_sym_names : list of strs
        A list of strings that represent the names of symbols that users want to exclude
        from being quantized.
    input_names : list of strs
        A list of strings representing the names of input variables
	"""

User experience will be similar to the export API experience today. Users will have to call hybridize followed by one forward pass before calling convert_model.

...