JIRA link to this project: https://issues.apache.org/jira/browse/MXNET-1184

Problem

Current MXNet only supports maximal tensors size around 4 billon (2^32). This is because uint32_t is used as default data type for tensor size as well as indexing variables. This limitation has created many problems when larger tensors are used in the model.
A naive solution to this problem is to replace all uint32_t in the MXNet backend source code by int64_t. This solution is not viable, however, because many data structures use uint32_t as data type for its members. Unnecessarily replacing these variables to int64_t will increase the memory consumption causing another limitation. Second, MXNet has many submodule dependencies. Updating the variable types in MXNet repository is not enough. We also need to make sure different libraries, such as MKLDNN, MShadow etc. supports the int64_t integer data type. Third, many front end APIs assumes unsigned 32-bit integer interface. Only updating the interface in C/C++ will cause all the language bindings to fail.

Therefore, we need a systematic approach to enhance MXNet to support large tensors.

Operators to be supported

This project is to enable MXNet to support large tensors. It should also provide guideline for future developers of the correct data type to choose when defining an integer variables in the MXNet backend. We should also provide a performance benchmark at operator level as well as model level between 64-bit integer and 32-bit integers. Moreover, we need to provide a mechanism to prevent future PRs breaking this support.

The following epic keeps track of operators that have been supported and the ones to be supported: MXNET-1184 - Getting issue details... STATUS

If support for additional operators is required then please add a task to the JIRA epic.

Open Questions

  • MKLDNN support
  • CuDNN support

Challenges

  • How to address this problem across all submodules
    • mixed data types are used in several submodules such as mshadow, dlpack etc.
  • How to address this problem across all language bindings
    • C APIs are used by many language bindings
  • CUDNN and MKLDNN support
    • Need to make sure CUDNN and MKLDNN operators also support large indices
  • Performance impact
    • there are known performance differences between int32 and int64 operations. How to make sure these differences would not cause severe performance regression in model training an inference
    • memory footprint
  • Backward compatibility

Proposed Approach

Due to the challenges mentioned above, we plan to take staged development for this feature.

Stage 1: Use a compiler to enable/disable int64 support. By default this switch is off to prevent any performance impact

Stage 2: Benchmark performance between int64 and int32; fixing the performance difference that cause severe impact to training and inference. Turn on compiler flag by default

To support large tensor operations in MXNet, we need to update the followings:
1) Support large tensor size in NDArray data structure. We need to make sure the data structure of a tensor can hold sufficiently large number of elements.

2) Allow index loop to go beyond 2^31:
In CPU operator implementation, the kernel always use a Map() function to process each data element. The indexing variable need to use int64_t
A PR has been submitted to address a subset of the operators:
https://github.com/apache/incubator-mxnet/pull/13418

3) Update different API interfaces
This involves the API interface between MXNet backend and different front end language.
E.g. On the language binding side (front-end language like Python), int64_t values needs to be passed to the C++ backend. (Python: `ctypes.c_int64`) Otherwise, the value gets truncated at the language binding side.


There are two defined data types used in MXNET backend in addition to the native integer types: index_t, and dim_t. An earlier PR has been submitted to use int64_t for index_t and dim_t:
https://github.com/apache/incubator-mxnet/pull/11742
https://github.com/dmlc/mshadow/pull/348

Since int32 are used by many language bindings, we will add extra 64-bit C APIs and use them in Python bindings first. Other language binds can choose to use the 64-bit APIs if they also plan to support large indices.

Future Development Guideline

We should also document the guideline for future development:

Backward compatibility

We should support all existing operators with uint32_t data types.

Performance Considerations

Since this only changes the data type of indexing variables, not the data type of elements themselves, we do not expect obvious performance impact in CPU. However, there may be performance impact in GPU and we need to verify that.

Test Plan

  • Add nightly test to test all existing operators with tensor size over 5 billion. To test each operator in Python, we can leverage the existing check_speed() utility function.




  • No labels

6 Comments

  1. Thanks, Yuan. I'm interested in this new feature and it sounds very attractive.


    As I known, a common trick is to get the better performance by keeping more temp buffers in the MXNet, such as the MKL-DNN engine will keep the allocated primitives, memory descriptions and buffers in the first run and then reuse it in the following iterations. 

    So, for the very large tensor as inputs, maybe we need to re-evaluate the balance between memory usages and the performance improvements. 


    Could you help to provide more background or (proxy) test cases of this kind of applications for the evaluation?

    Feel free to let me know if anything I can help.


    Thanks,

    --Patric


  2. Hi Patric,

    Currently, I am mostly evaluating based on the nightly test at https://github.com/apache/incubator-mxnet/blob/master/tests/nightly/test_large_array.py

    We are also running an internal end-to-end training on resnet50 to evaluate runtime and memory change.

    Do you have any suggestion regarding tools/frameworks to better calibrate the runtime and memory at operator levels? 


    Thanks,


    Lin

    1. The current test case is too simple to reflect the real situation. 

      Feel free to ping me if you encountered the memory issue for large inputs and then we can look into the details.

  3. The portable type for sizes and loops is size_t, not int64  if anything, uint32 should be changed to size_t


    I doubt that the increase in memory size in data structures that hold the size of elements would be a problem from changing to size_t where applicable. 

  4. Thanks Pedro. size_t is a good option whenever it's applicable. 

    However, there are a few places where int64_t is a better option:

    1) as data member inside a stuctr/class.

    The lenght of size_t is platform dependent. This may cause problem of a struct size dependent on platforms

    Please see more discussion in this PR: https://github.com/dmlc/mshadow/pull/348

    2) as looping index

    OpenMP requires looping index to be signed integer. It would be inconsitent to use size_t for non omp and int64_t for omp blocks.


  5. May be you can see if you can use this - https://github.com/sandeep-krishnamurthy/dl-operator-benchmark for benchmarking operator level performance. It provides online command to run benchmarks for all operators with different inputs (ex: Tensor shapes, dtype, ctx etc.) also provides defaults for all the operators. Supports all nd array operators and gluon blocks (ex: Conv2D etc..). It is still WIP, will send more details to dev@ list soon.