Background

cudamalloc and cudafree can be much more expensive compared to equivalent C standard library malloc and free [1].

To overcome this cost becoming a bottleneck for a script using MXNet, MXNet tries to provide customers with strategies for maintaining a memory pool.

Here is how the memory pool manager works:

  1. Before allocation, checks if the memory pool already has a pointer pointing to memory for the size we want to allocate.
  2. If not it checks if the new size to be allocated is less than the available unreserved memory and allocates memory with size provided. If size to be allocated is greater than available unreserved memory whole memory pool is freed.
  3. If it is already present it tries to reuse the memory from pool.
  4. When Free is called, it releases the pointer back to the pool.

There are two types of Memory pooling available for GPU: 1. Round 2. Naive. Round allows any size to be added to memory pool, while Round uses nearest pow2 (exponential) rounding and nearest multiple linear rounding for helping alleviate memory stress and can be useful in RNN use cases.


Problem

MXNet users try to profile memory of the scripts when running on GPU and they are not sure why the memory consumption is always high and not changing irrespective of the op executions. This is because memory may be reused from memory pool. Customers would want to visualize free and already in use parts of the memory pool to understand whether they want to allocate more memory for the pool (MXNET_GPU_MEM_POOL_RESERVE) and how it will behave on a host which is running other applications.


There are currently three different env variables for memory pool manager:

  1. MXNET_GPU_MEM_POOL_RESERVE,
  2. MXNET_GPU_MEM_POOL_TYPE,
  3. MXNET_GPU_MEM_POOL_ROUND_LINEAR_CUTOFF.
  4. MXNET_GPU_MEM_LARGE_ALLOC_ROUND_SIZE
  5. MXNET_GPU_MEM_POOL_PAGE_SIZE

You can find additional documentation on these variables here.[2]

Setting these environment variables can be a difficult task, given that there is no data except their code to make a decision on what MEM POOL TYPE to set, how big the pool should be etc.

For training use cases, visualizing the memory pool consumption for server and worker can give insights on performance improvements.

This can also be used for eia, we can provide customers with an api to obtain profiling results on the accelerator and detailed data for Allocations from Pool, Allocations from CUDA API, Used Memory in Pool, Free Memory in Pool.

This can be used for use cases where customers are running multiple MXNet processes on the same machine, split memory amongst multiple processes and want to get best performance out of each process


Solution

Currently when visualizing profiler with chrome tracing one can see the memory allocated but it doesn't show the size of the memory pool. Here is an example:


As you can see from the pic above the ndarray creation doesn't cause any change in the gpu memory. This is because it is being allocated from the pool.

The proposed solution is to make changes to profiler code so that the data for Occupied Pool Size, Free Pool Size, Memory allocated from Pool, Memory allocated from CUDA API should be recorded so that it is made available for visualization using chrome tracing.

These four additional metrics should allow customers to make a better choice for the memory pool type, amount of memory to reserve for the memory pool.

Examples below:

Running the following simple script:


import mxnet as mx

# Around 40 gb memory allocated and released.
arrays = []
for i in range(100):
print(i)
x = mx.nd.ones((10000 + i, 10000 + i), ctx=mx.gpu(0))
mx.nd.waitall()




First time I try with the following env variables:

export MXNET_GPU_MEM_POOL_RESERVE=99
export MXNET_GPU_MEM_POOL_TYPE=Round

Here is the profiler output:


From the above you can see the “free” and “used” and gaps (where memory is released and both use and free drop to 0)
You can see that the total memory allocations from pool is 39 gb and rest 1 gb is from the CUDA API.

You can compare it with other configurations for example:

export MXNET_GPU_MEM_POOL_RESERVE=99
export MXNET_GPU_MEM_POOL_TYPE=Naive



As you can see the number of memory allocations from pool is 36 gb and rest 3 gb is from CUDA API.

The reason is, for the example I used above, the shapes are different for each of the 100 ndarrays and for such scenario Round allows for lesser CUDA memory allocations and more reuse from pool.


How to run initial prototype


git clone --recursive https://github.com/anirudh2290/mxnet
git checkout memory_profiler_poc2
mkdir build
cd build && cmake VERBOSE=1 -DUSE_CUDA=ON -DUSE_CUDNN=ON -DUSE_OPENMP=ON -DCMAKE_BUILD_TYPE=Debug -DUSE_DIST_KVSTORE=0 -DUSE_OPENCV=0 -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda -DCUDNN_ROOT=/usr/local/cuda -DUSE_MKLDNN=1 -DUSE_MKL_IF_AVAILABLE=1 -DUSE_MKLML_MKL=1 -DUSE_ASAN=0 -GNinja -DUSE_OPERATOR_TUNING=1 -DUSE_CPP_PACKAGE=0 -DCUDA_ARCH_NAME=Auto ..
export MXNET_PROFILER_AUTOSTART=1
# Run any script to profile memory pool
python test_memory.py
# Open chrome tracing and load the profile.json




Additional Improvements

  1. Add profiling support for all cudnn layers.
  2. Add profiling support for MKLDNN memory and layers.
  3. Better profiling support for parameter server.


  • No labels