Page History

...

Users have to track the MXNet objects manually and remember to call dispose. This is not Java idiomatic and not user friendly. Quoting a user: "this feels like I am writing C++ code which I stopped ages ago".
Leads to memory leaks if dispose is not called.
Many objects in MXNet-Scala are managed in native memory, needing to use dispose on them as well.
Bloated code with dispose() methods.
Hard to debug memory-leaks.
Goals of the project are:
Provide MXNet JVM users automated memory management that can release native memory when there are no references to JVM objects.
Provide automated memory management for both GPU and CPU memory without performance degradation. More details can be found here: JVM Memory Management

Distributed Training with Horovod

Horovod is an open source distributed deep learning framework built with high performance communication primitive. It can significantly improve the scaling efficiency when training in distributed environment. Compared to the Parameter Server approach, training with Horovod does not need standalone instances to host parameter servers to achieve the same or even better performance, which can save costs for customers.

More design details can be found here: Horovod-MXNet Integration

Horovod PR to support MXNet: Horovod support for MXNet framework

Topology-aware AllReduce (experimental)

...

Page tree

Versions Compared

Old Version 17

New Version 18

Key

Distributed Training with Horovod

Topology-aware AllReduce (experimental)