Page History

...

Current MXNet only supports maximal tensors size around 4 billon (2^32). This is because uint32_t is used as default data type for tensor size as well as indexing variables. This limitation has created many problems when larger tensors are used in the model.
A naive solution to this problem is to replace all uint32_t in the MXNet backend source code by int64_t. This solution is not viable, however, because many data structures use uint32_t as data type for its members. Unnecessarily replacing these variables to int64_t will increase the memory consumption causing another limitation. Second, MXNet has many submodule dependencies. Updating the variable types in MXNet repository is not enough. We also need to make sure different libraries, such as MKLDNN, MShadow etc. supports the int64_t integer data type. Third, many front end APIs assumes unsigned 32-bit integer interface. Only updating the interface in C/C++ will cause all the language bindings to fail.

Therefore, we need a systematic approach to enhance MXNet to support large tensors.

...

Deliverables

Support commonly used operators in MXNet
Investigate ways to evaluate and mitigate performance impact
Add Benchmark tests for large tensor operators
Write a blog post about memory allocation research

Operators required by DGL

This project is to enable MXNet to support large tensors. It should also provide guideline for future developers of the correct data type to choose when defining an integer variables in the MXNet backend. We should also provide a performance benchmark at operator level as well as model level between 64-bit integer and 32-bit integers. Moreover, we need to provide a mechanism to prevent future PRs breaking this support.

The following spreadsheet keeps track of the operators that need support large array operations:

Operators	Done	Test	Comments
ones	Y	test_large_array.py
zeros	Y	test_large_array.py
empty	Y	test_large_array.py
dot	Y	test_large_array.py
uniform	Y	test_large_array.py
broadcast_to	Y	test_large_array.py
clip	Y	test_large_array.py
take	Y	test_large_array.py
slice	Y	test_large_array.py
squeeze	Y	test_large_array.py
broadcast_div	Y	test_large_array.py
pick	Y	test_large_array.py
depth_to_space	PR under Review	test_large_array.py	PR: https://github.com/apache/incubator-mxnet/pull/14797
space_to_depth	PR under Review	test_large_array.py	PR: https://github.com/apache/incubator-mxnet/pull/14797
diag	N
pad	N
softmax	N
ravel_multi_index	N
unravel_index	N
topk	PR under Review		PR:
		https://github.com/zheng-da/incubator-mxnet/commit/bef7dffa8c90cb68a8f04aa8e88faf380c3fad2b	a list of operators currently not yet supported

Open Questions

How to verify all operators support large tensors
Impact to GPU
MKLDNN support
CuDNN support

...

Use index_t for indexing elements
Use dim_t for dimension size
Never use unsigned and avoid using uint32_t to declare a non negative number that is not exceeding 4 billion (see: https://google.github.io/styleguide/cppguide.html#Integer_Types)

Addition of New APIs

...

Challenges

How to address this problem across all submodules
How to address this problem across all language bindings
GPU and MKLDNN support
This change potentially suffers from memory fragmentation. Since allocation on very large contiguous chunks of memory is not easy for OS. It uses virtual memory to achieve this. But this is also limited to the page size used by OS to allocate memory on Physical Memory. If the memory is too fragmented that OS cannot even allocate memory of the size of smallest page chunk then it will give memory allocation error.
- Approach to mitigate this problem: A new data structure based on the requirements of customer's that require creation of very large tensors. The new data structure can be optimized for either random access or contiguous access and should address the problem of memory fragmentation by allocating memory of smaller sized chunks.

Backward compatibility

We should support all existing operators with uint32_t data types.

...

Add nightly test to test all existing operators with tensor size over 5 billion. To test each operator in Python, we can leverage the existing check_speed() utility function.

Alternative Approaches

TBD

Technical Challenges

How to address this problem across all submodules
How to address this problem across all language bindings
GPU and MKLDNN support

References

...

Page tree

Versions Compared

Old Version 6

New Version 7

Key