...
- Users have to track the MXNet objects manually and remember to call
dispose
. This is not Java idiomatic and not user friendly. Quoting a user: "this feels like I am writing C++ code which I stopped ages ago". - Leads to memory leaks if
dispose
is not called. - Many objects in MXNet-Scala are managed in native memory, needing to use
dispose
on them as well. - Bloated code with
dispose()
methods. - Hard to debug memory-leaks.
Goals of the project are: - Provide MXNet JVM users automated memory management that can release native memory when there are no references to JVM objects.
- Provide automated memory management for both GPU and CPU memory without performance degradation. More details can be found here: JVM Memory Management
Distributed Training with Horovod
Horovod is an open source distributed deep learning framework built with high performance communication primitive. It can significantly improve the scaling efficiency when training in distributed environment. Compared to the Parameter Server approach, training with Horovod does not need standalone instances to host parameter servers to achieve the same or even better performance, which can save costs for customers.
More design details can be found here: Horovod-MXNet Integration
Horovod PR to support MXNet: Horovod support for MXNet framework
Topology-aware AllReduce (experimental)
...