You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 12 Current »

This wiki page is intended to help you select the epochs and number_of_iterations parameters for training multiple deep learning models in parallel with Apache MADlib and Greenplum database.  Because Greenplum is a distributed database, the concept of passes over the data is different than in single node systems.

For additional information, please refer to the section Deep Learning in the user documentation [1] and the Jupyter workbook examples [2].

tl;dr

number of passes over the data  = epochs * number_of_iterations

The Keras fit parameter epochs means the number of passes over the data in each Greenplum segment (worker) within an iteration, so Keras epochs actually refers to sub-epochs in MADlib/Greenplum.  The number_of_iterations parameter in the MADlib fit function is the outer loop.  

If you increase epochs then training will run faster, but there may be an impact on convergence.  If you set epochs=1, then number_of_iterations is logically the same as the number of epochs for single-node systems.

Model Configurations

First we define the model configurations that we want to train, meaning either model architectures or hyperparameters, and load them into a model selection table.  The combination of model architectures and hyperparameters constitutes the model configurations to train.  In the picture below there are three model configurations represented by the three different purple shapes:

The epochs parameter is discussed later on this page.

Fit

Once we have model combinations in the model selection table, we call the fit function to train the models in parallel.  In the picture below, the three orange shapes represent the three models that have been trained:

The number_of_iterations is discussed later on this page.

Parameter Selection

To determine how to set the epochs and number_of_iterations parameters above, it is useful to understand how MADlib trains multiple models at a time.  

We distribute data in the usual way across the segments of Greenplum database.   MADlib uses model hopper parallelism [3, 4] which involves locking the data in place on the segments, and moving (hopping) the model state between the segments in a round robin fashion, so that each model configuration visits all examples in the dataset in an iteration.  This way we can train many models at the same time. The advantage of the moving model state rather than the training data is that the former is far smaller than the latter for most practical deep learning problems.

In the picture below, with three segments this means there are two hops that each model state makes to visit all of the data.  If the database cluster had 20 segments, there would be 19 hops in an iteration.

In this context, the Keras fit parameter epochs means the number of passes over the data in each Greenplum segment within each iteration, i.e., it is actually a sub-epoch in the picture above.  The number_of_iterations parameter in the MADlib fit function is the outer loop controlling the total number of iterations to run.  That is:

number of passes over the data  = epochs * number_of_iterations

Increasing epochs while reducing number_of_iterations to maintain the same number of passes can result in significantly faster training because there is less hopping.  However, there may be an impact on convergence because visiting the same examples more than once per iteration violates logical sequential SGD, which theoretically has the best convergence efficiency.

If you set epochs=1, then number_of_iterations is logically the same as number of epochs for single node systems.

Note that the number of model configurations does not need to be the same as the number of segments, like it is in the toy example above.  In fact, it usually will not be the same. If you have more model configurations than segments, some of the model configurations will be held in a queue while others are being trained.  The ones in the queue will be trained in a round robin fashion. Conversely, if you have fewer model configurations than segments, then some of the segments will not be busy 100% of the time since they will be waiting for model configurations to train.

Example

Below are results from training the well-known CIFAR-10 dataset using two different CNNs comprising 500K-1M weights and various hyperparameters.  In total there were 16 different model configurations trained on a cluster of 16 segments. (As mentioned above, the number of model configurations does not need to be the same as the number of segments.) The model configuration with the best validation accuracy is shown in the chart.

We used 50 passes over the data in total, comprising different combinations of epochs and number_of_iterations.  You can see that logical sequential SGD with epochs=1 and number_of_iterations=50 has the highest accuracy but takes the longest to train.  Conversely, the fastest training is with epochs=50 and number_of_iterations=1 but the validation accuracy is lower.

You should do some experimentation on your own project to determine what tradeoff works for your models architectures and dataset.

References

[1] User docs for deep learning http://madlib.apache.org/docs/latest/group__grp__dl.html

[2] Juptyer workbook examples https://github.com/apache/madlib-site/tree/asf-site/community-artifacts/Deep-learning .

[3] Cerebro: A Data System for Optimized Deep Learning Model Selection, https://adalabucsd.github.io/papers/TR_2020_Cerebro.pdf

[4] Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems, DEEM’30, June 30, 2019, Amsterdam, Netherlands https://adalabucsd.github.io/papers/2019_Cerebro_DEEM.pdf



  • No labels