Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Current MXNet only supports maximal tensors size around 4 billon (2^32). This is because uint32_t is used as default data type for tensor size as well as indexing variables. This limitation has created many problems when larger tensors are used in the model.
A naive solution to this problem is to replace all uint32_t in the MXNet backend source code by int64_t. This solution is not viable, however, because many data structures use uint32_t as data type for its members. Unnecessarily replacing these variables to int64_t will increase the memory consumption causing another limitation. Second, MXNet has many submodule dependencies. Updating the variable types in MXNet repository is not enough. We also need to make sure different libraries, such as MKLDNN, MShadow etc. supports the int64_t integer data type. Third, many front end APIs assumes unsigned 32-bit integer interface. Only updating the interface in C/C++ will cause all the language bindings to fail.

Therefore, we need a systematic approach to enhance MXNet to support large tensors.

Deliverables

  1. Support commonly used operators in MXNet
  2. Investigate ways to evaluate and mitigate performance impact
  3. Add Benchmark tests for large tensor operators
  4. Write a blog post about memory allocation research

Operators to be supported

This project is to enable MXNet to support large tensors. It should also provide guideline for future developers of the correct data type to choose when defining an integer variables in the MXNet backend. We should also provide a performance benchmark at operator level as well as model level between 64-bit integer and 32-bit integers. Moreover, we need to provide a mechanism to prevent future PRs breaking this support.

The following spreadsheet epic keeps track of the operators that need support large array operations:

...

operators that have been supported and the ones to be supported:

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyMXNET-1184

If support for additional operators is required then please add a task to the JIRA epic.

Open Questions

...

Open Questions

  • How to verify all operators support large tensors
  • Impact to GPU
  • MKLDNN support
  • CuDNN support

...

  • Add nightly test to test all existing operators with tensor size over 5 billion. To test each operator in Python, we can leverage the existing check_speed() utility function.

...