Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

cudamalloc and cudafree can be much more expensive compared to equivalent C standard library malloc and free [1].

To overcome this cost becoming a bottleneck for a script using MXNet, MXNet tries to provide customers with strategies for maintaining a memory pool.

Here is how the memory pool manager works:

  1. Before allocation, checks if the memory pool already has a pointer pointing to memory for the size we want to allocate.
  2. If not it checks if the new size to be allocated is less than the available unreserved memory and allocates memory memory with size provided. If size to be allocated is greater than available unreserved memory whole memory pool is freed.
  3. If it is already present it tries to reuse the memory from pool.
  4. When Free is called, it releases the pointer back to the pool.

...

You can find additional documentation on these variables here.[2]

Setting these environment variables can be a difficult task, given that there is no data except their code to make a decision on what MEM POOL TYPE to set, how big the pool should be etc.

For training use cases, visualizing the memory pool consumption for server and worker can give insights on performance improvements.

This can also be used for eia, we can provide customers with an api to obtain profiling results on the accelerator and detailed data for Allocations from Pool, Allocations from CUDA API, Used Memory in Pool, Free Memory in Pool.

This can be used for use cases where customers are running multiple mxnet MXNet processes on the same machine, split memory amongst multiple processes and want to get best performance out of each process

...

As you can see from the pic above the ndarray creation doesn't cause any change in the gpu memory. This is because it is being allocated from the pool.

The proposed solution is to make changes to profiler code so that the data for Occupied Pool Size, Free Pool Size, Memory allocated from Pool, Memory allocated from CUDA API should be recorded so that it is made available for visualization using chrome tracing.

These four additional metrics should allow customers to make a better choices choice for the memory pool type, amount of memory to reserve for the memory pool.

...

As you can see the number of memory allocations from pool is 36 gb and rest 3 gb is from CUDA API.

The reason is, for the example I used above, the shapes are different for each of the 100 hundred ndarrays and for such scenario Round allows for lesser CUDA memory allocations and more reuse from pool.

...