Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This wiki page is intended to help you select the epochs and numbernum_of_iterations parameters for training multiple deep learning models in parallel with Apache MADlib and Greenplum database.  Because Greenplum is a distributed database, the concept of passes over the data is different than in single node systems.

For additional information, please refer to the section Deep Learning in the user documentation [1] and the Jupyter workbook examples [2].  A similar topic is discussed in the context of federated learning using parameter averaging in [5].


Table of Contents

tl;dr

number of passes over the data  = epochs * number_ofnum_iterations

The Keras fit parameter epochs means the number of passes over the data in each Greenplum segment (worker) within an iteration, so Keras epochs actually refers to sub-epochs in MADlib/Greenplum.  The numbernum_of_iterations parameter in the MADlib fit function is the outer loop.  

If you increase epochs then training will run faster, since there is less communication overhead, but there may be an impact on convergence.  If you set epochs=1, then numbernum_of_iterations is logically the same as the number of epochs for single-node systems, but with additional communications burden.

Model Configurations

First we define the model configurations that we want to train, meaning either model architectures or hyperparameters, and load them into a model selection table.  The combination of model architectures and hyperparameters constitutes the model configurations to train.  In the picture below there are three model configurations represented by the three different purple shapes:

...

Once we have model combinations in the model selection table, we call the fit function to train the models in parallel.  In the picture below, the three orange shapes represent the three models that have been trained:

The number_ofnum_iterations is discussed later on this page.

...

To determine how to set the epochs and numbernum_of_iterations parameters above, it is useful to understand how MADlib trains multiple models at a time.  

...

In this context, the Keras fit parameter epochs means the number of passes over the data in each Greenplum segment within each iteration, i.e., it is actually a sub-epoch in the picture above.  The number_ofnum_iterations parameter in the MADlib fit function is the outer loop controlling the total number of iterations to run.  That is:

number of passes over the data  = epochs * numbernum_of_iterations

Increasing epochs while reducing numbernum_of_iterations to maintain the same number of passes can result in significantly faster training because there is less hopping.  However, there may be an impact on convergence because visiting the same examples more than once per iteration violates logical sequential SGD, which theoretically has the best convergence efficiency.

If you set epochs=1, then numbernum_of_iterations is logically the same as number of epochs for single node systems.

Note that the number of model configurations does not need to be the same as the number of segments, like it is in the toy example above.  In fact, it usually will not be the same. If you have more model configurations than segments, some of the model configurations will be held in a queue while others are being trained.  The ones in the queue will be trained in a round robin fashion. Conversely, if you have fewer model configurations than segments, then some of the segments will not be busy 100% of the time since they will be waiting for model configurations to train.

Example

Below are results from training the well-known CIFAR-10 dataset using two different CNNs comprising 500K-1M weights and various hyperparameters.  In total there were 16 different model configurations trained on a cluster of 16 segments. (As mentioned above, the number of model configurations does not need to be the same as the number of segments.) The model configuration with the best validation accuracy is shown in the chart.

We used 50 passes over the data in total, comprising different combinations of epochs and numbernum_of_iterations.  You can see that logical sequential SGD with epochs=1 and number_ofnum_iterations=50 has the highest accuracy but takes the longest to train.  Conversely, the fastest training is with epochs=50 and numbernum_of_iterations=1 but the validation accuracy is lower.

Image RemovedImage Added

You should do some experimentation on your own project to determine what tradeoff works for your models architectures and dataset.

...

[4] Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems, DEEM’30, June 30, 2019, Amsterdam, Netherlands https://adalabucsd.github.io/papers/2019_Cerebro_DEEM.pdf

[5] Communication-Efficient Learning of Deep Networks from Decentralized Data, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS) 2017, Fort Lauderdale, Florida, USA. JMLR: W&CP volume 54, https://arxiv.org/pdf/1602.05629.pdf