Naming and ordering

Wiki page enumerating argument and function names used across MADlib

TODO: Go through the document and create a standardized convention for all functions in MADlib.

A common discrepancy not highlighted in the document above is `col` vs `column`. Some arguments names are of the form `*_column_name` while others are `*_col_name`. One of these two must be chosen and applied across the whole product (including internal source code).

Named parameters

Change the parameter lists to named parameters like scikit-learn, rather than the ordered set of parameters currently used in MADlib where you can't do things out of order.

sckikit-learn

class sklearn.neural_network.MLPClassifier(hidden_layer_sizes=(100, ), activation=’relu’, solver=’adam’, alpha=0.0001,
batch_size=’auto’, learning_rate=’constant’, learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None,
tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1,
beta_1=0.9, beta_2=0.999, epsilon=1e-08)

Named notation was introduced in PostgreSQL 9.0, which means Greenplum 4.x and 5.x would not support this without writing our own parser (they only support positional notation). Greenplum 6.x will support since it is based on a version of PostgreSQL after 9.0.

MADlib

mlp_classification(
source_table,
output_table,
independent_varname,
dependent_varname,
hidden_layer_sizes,
optimizer_params,
activation,
weights,
warm_start,
verbose,
grouping_col
)

Interfaces for Cross Validation

The general CV interface is clunky and difficult to use. Instead we should aim to incorporate CV capabilities within each relevant function. For example, the Elastic Net CV in scikit-learn includes a ‘cv’ parameter that picks best estimator using cross-validation.

We have a similar setup with our own EN function. The ‘optimizer_params’ includes multiple sections of parameters including following Cross validation parameters: n_folds, validation_result, lambda_value, n_lambdas, alpha

Similarly tree_train allows a ‘n_folds’ parameter within ‘pruning_params’ to perform CV over the various ‘cp’ values.

The downside of having specific CV interfaces is that user loses flexibility to perform CV over any of the parameters in the function. For example, tree_train does not allow CV over ‘max_depth’. Further, CV would be restricted to functions that do provide the functionality.

To overcome these limitations, we could retain the general interface, while providing specific interfaces for common use cases.

Interface for Classification vs Regression

There are multiple ways of choosing between classification and regression:

Auto-detect: In this method, specific types are associated with classification and regression. Common types for classification are boolean and text. Common types for regression are double precision and other numeric types.
It is unclear on the best type for integer response type. It is recommended to treat it like a classification task, with the option to cast it to double precision if regression is desired.
Example of this approach are decision tree and random forest.
Separate functions: An alternative method is to create separate functions for the classification and regression tasks. Examples of this approach are SVM and MLP. This method works best if the two forms of functions have different set of parameters.
Parameter to differentiate: A parameter could be used to distinguish between the tasks, with the response variable either cast to appropriate forms (eg. boolean cast to integer for regression or double precision cast to integer for classification). Currently MADlib does not include a function using this method.

Interface for Prediction

Input for prediction functions can be of the form:

SELECT *_predict(<model_table>

, <test_data_table>

, <point_id>

, <confidence>

[, algorithm parameters]);

<point_id> should support expressions or multi-column input
‘Confidence’ could include measures like probability, cost, distance. More effort needs to be spent to understand the ‘confidence’ metrics in various functions. More importantly, current functions distinguish between actual prediction and probability by using a parameter (which has various names including ‘predict_type’, ‘pred_type’, ‘type’)

Output for prediction should include:

ID column(s): These should be present only if a ‘point_id’ parameter is provided.
Prediction for each tuple in ‘test_data_table’
‘Confidence’ if required. The sign of the confidence should be such that higher values are better (eg. cost/distance would be negative while probability would be positive )

Another question: should we remove obligation for data table for predict to be in the same format as training table?

Misc

Independent variables should be allowed to be SQL expressions (including *). Further a ‘exclude’ parameter could be provided to remove features from a ‘*’ list. Columns that are used in ‘id’ or ‘grouping’ should automatically be removed. See ‘tree_train’ for examples.
Internal UDAs for simpler learning algorithms should be simple enough for external users to use them in situations where a table output is not desired.
Training functions, even though storing the model in an output table, should output a string with relevant information. Example of informative string elements include output table name, time taken for training, …
Each train function name should end in ‘_train’. Further, it should create an output table and a summary table and their formats should be standard across all learning algorithms.

Output formats

In the first form the model is a composite type and opaque to the user. We can provide introspection functions to understand the model.

group_col1	group_col2	model
u1	v1	<model for u1, v1>
u1	v2	<model for u1, v2>
...
u2	v1	<model for u2, v1>

In the second form, the model elements are stored as columns in output table, exposing the elements to the user.

group_col1	group_col2	coef	std_err	...
u1	v1	<coef for u1, v1>	<std. err for u1, v1>
u1	v2	<coef for u1, v2>
...
u2	v1	<coef for u2, v1>

Summary table could include:

- method

- source_table

- model_table

- dependent_varname

- independent_varname

- <other algo specific parameters>

- grouping_col

- optimizer_params

- num_all_groups

- num_failed_groups

- total_rows_processed

- total_rows_skipped

- time_stamp_start

- time_stamp_end

- elapsed_time

- user_string_1 (user label or name)

- user_string_2 (user description)

Page tree

Interface 2.0 - Ideas