You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

JIRA link to this project: https://issues.apache.org/jira/browse/MXNET-1184

Problem

Current MXNet only supports maximal tensors size around 4 billon (2^32). This is because uint32_t is used as default data type for tensor size as well as indexing variables. This limitation has created many problems when larger tensors are used in the model.
A naive solution to this problem is to replace all uint32_t in the MXNet backend source code by int64_t. This solution is not viable, however, because many data structures use uint32_t as data type for its members. Unnecessarily replacing these variables to int64_t will increase the memory consumption causing another limitation. Second, MXNet has many submodule dependencies. Updating the variable types in MXNet repository is not enough. We also need to make sure different libraries, such as MKLDNN, MShadow etc. supports the int64_t integer data type. Third, many front end APIs assumes unsigned 32-bit integer interface. Only updating the interface in C/C++ will cause all the language bindings to fail.

Therefore, we need a systematic approach to enhance MXNet to support large tensors.

Goals

This project is to enable MXNet to support large tensors. It should also provide guideline for future developers of the correct data type to choose when defining an integer variables in the MXNet backend. We should also provide a performance benchmark at operator level as well as model level between 64-bit integer and 32-bit integers. Moreover, we need to provide a mechanism to prevent future PRs breaking this support.

Open Questions

  • How to verify all operators support large tensors
  • Impact to GPU
  • MKLDNN support
  • CuDNN support

Proposed Approach

To support large tensor operations in MXNet backend, we need to update the followings:
1) Support large tensor size in NDArray data structure. We need to make sure the data structure of a tensor can hold sufficiently large number of elements.

2) Allow index loop to go beyond 2^31:
In CPU operator implementation, the kernel always use a Map() function to process each data element. The indexing variable need to use int64_t
A PR has been submitted to address a subset of the operators:
https://github.com/apache/incubator-mxnet/pull/13418

3) Update different API interfaces
This involves the API interface between MXNet backend and different front end language.

There are two defined data types used in MXNET backend in addition to the native integer types: index_t, and dim_t. An earlier PR has been submitted to use int64_t for index_t and dim_t:
https://github.com/apache/incubator-mxnet/pull/11742
https://github.com/dmlc/mshadow/pull/348

Addition of New APIs

No new APIs will be added.

Backward compatibility

We should support all existing operators with uint32_t data types.

Performance Considerations

Since this only changes the data type of indexing variables, not the data type of elements themselves, we do not expect obvious performance impact in CPU. However, there may be performance impact in GPU and we need to verify that.

Test Plan

  • Add nightly test to test all existing operators with tensor size over 5 billion. To test each operator in Python, we can leverage the existing check_speed() utility function.

Alternative Approaches

TBD

Technical Challenges

  • How to address this problem across all submodules
  • How to address this problem across all language bindings
  • GPU and MKLDNN support

References



 

  • No labels