Release Notes for Apache MXNet (incubating) 1.3.1

General Remarks

Apache MXNet (incubating) 1.3.1 is a maintenance release incorporating important bug fixes and important performance improvements. All users of Apache MXNet (incubating) 1.3.0 or earlier are advised to upgrade. You can install Apache MXNet (incubating) 1.3.1 at the usual place. Please review these Release Notes to learn what is new in this version as well as important remarks concerning known issues and their workarounds.

Bug fixes

  • [MXNET-953] Fix oob memory read (v1.3.x) / #13118
    Simple bugfix addressing an out-of-bounds memory read.

  • [MXNET-969] Fix buffer overflow in RNNOp (v1.3.x) / #13119
    This fixes an buffer overflow detected by ASAN.

  • CudnnFind() usage improvements (v1.3.x) / #13123
    This PR improves the MXNet's use of cudnnFind() to address a few issues:

    1. With the gluon imperative style, cudnnFind() is called during forward(), and so might have its timings perturbed by other GPU activity (including potentially other cudnnFind() calls).
    2. With some cuda drivers versions, care is needed to ensure that the large I/O and workspace cudaMallocs() performed by cudnnFind() are immediately released and available to MXNet.
    3. cudnnFind() makes both conv I/O and workspace allocations that must be covered by the GPU global memory headroom defined by MXNET_GPU_MEM_POOL_RESERVE. Per issue #12662, large convolutions can result in out-of-memory errors, even when MXNet's storage allocator has free memory in its pool.

    This PR addresses these issues, providing the following benefits:

    1. Consistent algo choice for a given convolution type in a model, both for instances in the same GPU and in other GPUs in a multi-GPU training setting.
    2. Consistent algo choice from run to run, based on eliminating sources of interference of the cudnnFind() timing process.
    3. Consistent model global memory footprint, both because of the consistent algo choice (algo's can have markedly different workspace requirements) and changes to MXNet's use of cudaMalloc.
    4. Increased training performance based on being able to consistently run with models that approach the GPU's full global memory footprint.
    5. Adds a unittest for and solves issue #12662.
  • Set correct update on kvstore flag in dist_device_sync mode (v1.3.x) / #13121
    Fix 1st part of the issue - #12713 In dist mode, update_on_kvstore was getting overridden based on async/sparse weight, grad only. In this fix we handle - "dist_device_sync" and not ignoring kvstore_params i.e, kvstore_params['update_on_kvstore'] value. Add test case for the same. Credits @eric-haibin-lin - eric-haibin-lin@856b503

  • [MXNET-922] Fix memleak in profiler (v1.3.x) / #13120
    Fix a memleak reported locally by ASAN during a normal inference test.

  • Fix lazy record io when used with dataloader and multi_worker > 0 (v1.3.x) / #13124
    Fixes multi_worker data loader when record file is used. The MXRecordIO instance needs to require a new file handler after fork to be safely manipulated simultaneously.

    This fix also safely voids the previous temporary fixes #12093 #11370.

  • fixed symbols naming in RNNCell, LSTMCell, GRUCell (v1.3.x) / #13158
    This fixes #12783, by assigning all nodes in hybrid_forward a unique name. Some operations were in fact performed without attaching the appropriate (time) prefix to the name, which makes serialized graphs non-deserializable.

  • Fixed __setattr__ method of _MXClassPropertyMetaClass (v1.3.x) / #13157
    Fixed __setattr__ method

  • allow foreach on input with 0 length (v1.3.x) / #13151
    Fix #12470. With this change, outs shape can be inferred correctly.

  • Infer dtype in SymbolBlock import from input symbol (v1.3.x) / #13117
    Fix for the issue - #11849
    Currently, Gluon symbol block cannot import any symbol with type other than fp32. All the parameters are created as FP32 leading to failure in importing the params when it is of type fp16, fp64 etc,
    In this PR, we infer the type of the symbol being imported and create the Symbol Block Parameters with that inferred type.
    Added the tests

Documentation fixes

  • Document the newly added env variable (v1.3.x) / #13156
    Document the env variable: MXNET_ENFORCE_DETERMINISM added in PR: #12992

  • fix broken links (v1.3.x) / #13155
    This PR fixes broken links on the website.

  • fix broken Python IO API docs (v1.3.x) / #13154
    Fixes #12854: Data Iterators documentation is broken

    This PR manually specifies members of the IO module so that the docs will render as expected. This is workaround in the docs to deal with a bug introduced in the Python code/structure since v1.3.0. See the comments for more info.

    This PR also fixes another issue that may or may not be related. Cross references to same-named entities like name, shape, or type are confusing Sphinx and it seems to just link to whatever it last dealt with that has the same name, and not the current module. To fix this you have to be very specific. Don't use type, use np.type if that's what you want. Otherwise you might end up with mxnet.kvstore.KVStore.type. This is a known Sphinx issue, so it might be something we have to deal with for the time being.

    This is important for any future modules - that they recognize this issue and make efforts to map the params and other elements.

  • add/update infer_range docs (v1.3.x) / #13153
    This PR adds or updates the docs for the infer_range feature.

    Clarifies the param in the C op docs Clarifies the param in the the Scala symbol docs Adds the param for the the Scala ndarray docs Adds the param for the Python symbol docs Adds the param for the Python ndarray docs

Other Improvements

  • [MXNET-1179] Enforce deterministic algorithms in convolution layers (v1.3.x) / #13152
    Some of the CUDNN convolution algorithms are non-deterministic (see issue #11341). This PR adds an env variable to enforce determinism in the convolution operators. If set to true, only deterministic CUDNN algorithms will be used. If no deterministic algorithm is available, MXNet will error out.

Submodule updates

  • update mshadow (v1.3.x) / #13122
    Update mshadow for omp acceleration when nvcc is not present

Known issues

The test test_operator.test_dropout has issues and has been disabled on the branch:

  • Disable flaky test test_operator.test_dropout (v1.3.x) / #13200

For more information and examples, see full release notes

Installation Information

Installation instructions can be found at: http://mxnet.incubator.apache.org/install/index.html

Stay informed about MXNet (incubating)

Following resources could be useful for finding answers to questions:

You can also follow the project on Twitter.

How to build MXNet

Please follow the instructions at https://mxnet.incubator.apache.org/install/index.html

List of submodules used by Apache MXNet (Incubating) and when they were updated last

Submodule :: Last updated by MXNet :: Last update in submodule

  1. cub@:: Nov 29 2017 :: Jul 31 2017
  2. dlpack@: Mar 21 2018 :: Aug 23 2018
  3. dmlc-core@: Aug 16 2018 :: Nov 9 2018
  4. googletest@: Dec 14 2017 :: Nov 8 2018
  5. mkldnn@: May 8 2018 :: Nov 8 2018
  6. mshadow@: Wed Nov 7 2018 :: Nov 7 2018
  7. onnx-tensorrt@: Aug 22 2018 :: Nov 10 2018
  8. openmp@: Nov 22 2017 :: Nov 8 2018
  9. ps-lite@: Jun 11 2018 :: Oct 9 2018
  10. tvm@: Sep 3 2018 :: Nov 12 2018
  • No labels