You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »

This wiki page is intended to help you select the epochs and number_of_iterations parameters for training multiple deep learning models in parallel with Apache MADlib and Greenplum database.  Because Greenplum is a distributed database, the concept of passes over the data is different than in single node systems.

For additional information, please refer to the section Deep Learning in the user documentation [1] and the Jupyter workbook examples [2].

tl;dr

number of passes over the data  = epochs * number_of_iterations

The Keras fit parameter epochs means the number of passes over the data in each Greenplum segment within an iteration, so Keras epochs actually refers to sub-epochs in MADlib/Greenplum.  The number_of_iterations parameter in the MADlib fit function is the outer loop.  

If you increase epochs then training will run faster, but there may be an impact on convergence.  If you set epochs=1, then number_of_iterations is logically the same as the number of epochs for single-node systems.

Model Configurations

First we define the model configurations that we want to train, meaning either model architectures or hyperparameters, and load them into a model selection table.  The combination of model architectures and hyperparameters constitutes the model configurations to train.  In the picture below there are three model configurations represented by the three different purple shapes:

The epochs parameter is discussed later on this page.

Fit

Once we have model combinations in the model selection table, we call the fit function to train the models in parallel.  In the picture below, the three orange shapes represent the three models that have been trained:

The number_of_iterations is discussed later on this page.

Parameter Selection

To determine how to set the epochs and number_of_iterations parameters above, it is useful to understand how MADlib trains multiple models at a time.  

We distribute data in the usual way across the segments of Greenplum database.   MADlib uses model hopper parallelism [3, 4] which involves locking the data in place on the segments, and moving (hopping) the model state between the segments in a round robin fashion, so that each model configuration visits all examples in the dataset in an iteration.  This way we can train many models at the same time. The advantage of the moving model state rather than the training data is that the former is far smaller than the latter for most practical deep learning problems.

In the picture below, with three segments this means there are two hops that each model state makes to visit all of the data.  If the database cluster had 20 segments, there would be 19 hops in an iteration.

In this context, the Keras fit parameter epochs means the number of passes over the data in each Greenplum segment within each iteration, i.e., it is actually a sub-epoch in the picture above.  The number_of_iterations parameter in the MADlib fit function is the outer loop controlling the total number of iterations to run.  That is:

number of passes over the data  = epochs * number_of_iterations

Increasing epochs can result in significantly faster training because there is less hopping, but there may be an impact on convergence because you are visiting the same examples more than once in an iteration, which violates logical sequential SGD which theoretically has the best convergence efficiency.

If you set epochs=1, then number_of_iterations is logically the same as number of epochs for single node systems.

Example

Below are results from training the well-known CIFAR-10 dataset using two different CNNs comprising 500K-1M weights and various hyperparameters.  In total there were 16 different model configurations trained on a cluster of 16 segments. The model configuration with the best validation accuracy is shown in the chart.

We used 50 passes over the data in total, comprising different combinations of epochs and number_of_iterations.  You can see that logical sequential SGD with epochs=1 and number_of_iterations=50 has the highest accuracy but takes the longest to train.  Conversely, the fastest training is with epochs=50 and number_of_iterations=1 but the validation accuracy is lower.

You should do some experimentation on your own project to determine what tradeoff works for your models architectures and dataset.

References

[1] User docs for deep learning http://madlib.apache.org/docs/latest/group__grp__dl.html

[2] Juptyer workbook examples https://github.com/apache/madlib-site/tree/asf-site/community-artifacts/Deep-learning .

[3] Cerebro: A Data System for Optimized Deep Learning Model Selection, https://adalabucsd.github.io/papers/TR_2020_Cerebro.pdf

[4] Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems, DEEM’30, June 30, 2019, Amsterdam, Netherlands https://adalabucsd.github.io/papers/2019_Cerebro_DEEM.pdf



  • No labels