This wiki page is intended to help you select the epochs and number_of_iterations parameters for training multiple deep learning models in parallel with Apache MADlib and Greenplum database. Because Greenplum is a distributed database, the concept of passes over the data is different than in single node systems.
For additional information, please refer to the section Deep Learning in the user documentation [1] and the Jupyter workbook examples [2].
tl;dr
number of passes over the data = epochs * number_of_iterations
The Keras fit epochs
means the number of passes over the data in each Greenplum segment within an iteration, so epochs
actually refers to sub-epochs. The number_of_iterations
parameter in the MADlib fit function is the outer loop.
If you increase epochs
then training will run faster, but there may be an impact on convergence. If you set epochs=1
, then number_of_iterations
is logically the same as the number of epochs for single-node systems.
Model Configurations
First we define the model configurations that we want to train, meaning either model architectures or hyperparameters, and load them into a model selection table. In the picture below there are three model configurations represented by the three different purple shapes:
The epochs
parameter is discussed later on this page.
Fit
Once we have model combinations in the model selection table, we call the fit function to train the models in parallel. In the picture below, the three orange shapes represent the three models that have been trained:
The number_of_iterations
is discussed later on this page.
Parameter Selection
To understand how to set the epochs
and number_of_iterations
parameters above, it is useful to understand how MADlib trains multiple models at a time.
We distribute data in the usual way across the segments of Greenplum database. MADlib uses model hopper parallelism [3, 4] which involves locking the data in place on the segments, and moving (hopping) the model state between the segments in a round robin fashion, so that each model configuration visits all examples in the dataset in an iteration. This way we can train many models at the same time. The advantage of the moving model state rather than the training data is that the former is far smaller than the latter for most practical deep learning problems.
In the picture below, with three segments this means there are two hops that each model state makes to visit all of the data. If the database cluster had 20 segments, there would be 19 hops in an iteration.
In this context, the Keras fit parameter epochs
means the number of passes over the data in each Greenplum segment within each iteration, i.e., it is actually a sub-epoch in the picture above. The number_of_iterations
parameter in the MADlib fit function is the outer loop controlling the total number of iterations to run. That is:
number of passes over the data = epochs * number_of_iterations
Increasing epochs
can result in significantly faster training because there is less hopping, but there may be an impact on convergence because you are visiting the same examples more than once in an iteration, which violates logical sequential SGD which theoretically has the best convergence efficiency.
If you set epochs=1
, then number_of_iterations
is logically the same as number of epochs for single node systems.
Example
Below are results from training the well-known CIFAR-10 dataset using two different CNNs comprising 500K-1M weights and various hyperparameters. In total there were 16 different model configurations trained on a cluster of 16 segments. The model configuration with the best validation accuracy is shown in the chart.
We used 50 passes over the data in total, comprising different combinations of epochs
and number_of_iterations
. You can see that logical sequential SGD with epochs=1
and number_of_iterations=50
has the highest accuracy but takes the longest to train. Conversely, the fastest training is with epochs=50
and number_of_iterations=1
but the validation accuracy is lower.
You should do some experimentation on your own project to determine what tradeoff works for your models architectures and dataset.
References
[1] User docs for deep learning http://madlib.apache.org/docs/latest/group__grp__dl.html
[2] Juptyer workbook examples https://github.com/apache/madlib-site/tree/asf-site/community-artifacts/Deep-learning .
[3] Cerebro: A Data System for Optimized Deep Learning Model Selection, https://adalabucsd.github.io/papers/TR_2020_Cerebro.pdf
[4] Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems, DEEM’30, June 30, 2019, Amsterdam, Netherlands https://adalabucsd.github.io/papers/2019_Cerebro_DEEM.pdf