This wiki page is intended to help you select the epochs
and numbernum_of_iterations
parameters for training multiple deep learning models in parallel with Apache MADlib and Greenplum database. Because Greenplum is a distributed database, the concept of passes over the data is different than in single node systems.
For additional information, please refer to the section Deep Learning in the user documentation [1] and the Jupyter workbook examples [2]. A similar topic is discussed in the context of federated learning using parameter averaging in [5].
Table of Contents
tl;dr
number of passes over the data = epochs * numbernum_of_iterations
The Keras fit parameter epochs
means the number of passes over the data in each Greenplum segment (worker) within an iteration, so Keras epochs
actually refers to sub-epochs in MADlib/Greenplum. The numbernum_of_iterations
parameter in the MADlib fit function is the outer loop.
If you increase epochs
then training will run faster, since there is less communication overhead, but there may be an impact on convergence. If you set epochs=1
, then numbernum_of_iterations
is logically the same as the number of epochs for single-node systems, but with additional communications burden.
Model Configurations
First we define the model configurations that we want to train, meaning either model architectures or hyperparameters, and load them into a model selection table. The combination of model architectures and hyperparameters constitutes the model configurations to train. In the picture below there are three model configurations represented by the three different purple shapes:
...
Once we have model combinations in the model selection table, we call the fit function to train the models in parallel. In the picture below, the three orange shapes represent the three models that have been trained:
The number_ofnum_iterations
is discussed later on this page.
...
To determine how to set the epochs
and numbernum_of_iterations
parameters above, it is useful to understand how MADlib trains multiple models at a time.
...
In this context, the Keras fit parameter epochs
means the number of passes over the data in each Greenplum segment within each iteration, i.e., it is actually a sub-epoch in the picture above. The numbernum_of_iterations
parameter in the MADlib fit function is the outer loop controlling the total number of iterations to run. That is:
number of passes over the data = epochs * number_ofnum_iterations
Increasing epochs
while reducing numbernum_of_iterations
to maintain the same number of passes can result in significantly faster training because there is less hopping. However, there may be an impact on convergence because visiting the same examples more than once per iteration violates logical sequential SGD, which theoretically has the best convergence efficiency.
If you set epochs=1
, then number_ofnum_iterations
is logically the same as number of epochs for single node systems.
...
We used 50 passes over the data in total, comprising different combinations of epochs
and number_ofnum_iterations
. You can see that logical sequential SGD with epochs=1
and numbernum_of_iterations=50
has the highest accuracy but takes the longest to train. Conversely, the fastest training is with epochs=50
and numbernum_of_iterations=1
but the validation accuracy is lower.
...
[4] Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems, DEEM’30, June 30, 2019, Amsterdam, Netherlands https://adalabucsd.github.io/papers/2019_Cerebro_DEEM.pdf
[5] Communication-Efficient Learning of Deep Networks from Decentralized Data, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS) 2017, Fort Lauderdale, Florida, USA. JMLR: W&CP volume 54, https://arxiv.org/pdf/1602.05629.pdf