Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Naming and ordering

Wiki page enumerating argument and function names used across MADlib: xxx

...

A common discrepancy not highlighted in the document above is `col` vs `column`. Some arguments names are of the form `*_column_name` while others are `*_col_name`. One of these two must be chosen and applied across the whole product (including internal source code).

Interfaces for Cross validation

The general CV interface is clunky and difficult to use. Instead we should aim to incorporate CV capabilities within each relevant function. For example, the Elastic Net CV in scikit-learn includes a ‘cv’ parameter that picks best estimator using cross-validation.

...

To overcome these limitations, we could retain the general interface, while providing specific interfaces for common use cases.

Interface for Classification vs Regression

There are multiple ways of choosing between classification and regression:

  1. Auto-detect: In this method, specific types are associated with classification and regression. Common types for classification are boolean and text. Common types for regression are double precision and other numeric types.
    It is unclear on the best type for integer response type. It is recommended to treat it like a classification task, with the option to cast it to double precision if regression is desired.
    Example of this method is tree_train.

  2. Separate functions: An alternative method is to create separate functions for the classification and regression tasks. Best example for this is ‘svm_classification’ and ‘svm_regression’. This method works best if the two forms of functions have different set of parameters.

  3. Parameter to differentiate: A parameter could be used to distinguish between the tasks, with the response variable either cast to appropriate forms (eg. boolean cast to integer for regression or double precision cast to integer for classification). Currently MADlib does not include a function using this method.

Interface for Prediction

Input for prediction functions can be of the form:

...

  • ID column(s): These should be present only if a ‘point_id’ parameter is provided.

  • Prediction for each tuple in ‘test_data_table’

  • ‘Confidence’ if required. The sign of the confidence should be such that higher values are better (eg. cost/distance would be negative while probability would be positive )

Misc

  • Independent variables should be allowed to be SQL expressions (including *). Further a ‘exclude’ parameter could be provided to remove features from a ‘*’ list. Columns that are used in ‘id’ or ‘grouping’ should automatically be removed. See ‘tree_train’ for examples.

  • Internal UDAs for simpler learning algorithms should be simple enough for external users to use them in situations where a table output is not desired.

  • Training functions, even though storing the model in an output table, should output a string with relevant information. Example of informative string elements include output table name, time taken for training, …

  • Each train function name should end in ‘_train’. Further, it should create an output table and a summary table and their formats should be standard across all learning algorithms.

Output formats

In the first form the model is a composite type and opaque to the user. We can provide introspection functions to understand the model.  

...