Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Auto-detect: In this method, specific types are associated with classification and regression. Common types for classification are boolean and text. Common types for regression are double precision and other numeric types.
    It is unclear on the best type for integer response type. It is recommended to treat it like a classification task, with the option to cast it to double precision if regression is desired.
    Example of this method is tree_trainapproach are decision tree and random forest.

  2. Separate functions: An alternative method is to create separate functions for the classification and regression tasks. Best example for this is ‘svm_classification’ and ‘svm_regression’. Examples of this approach are SVM and MLPThis method works best if the two forms of functions have different set of parameters.

  3. Parameter to differentiate: A parameter could be used to distinguish between the tasks, with the response variable either cast to appropriate forms (eg. boolean cast to integer for regression or double precision cast to integer for classification). Currently MADlib does not include a function using this method.

...

  • ID column(s): These should be present only if a ‘point_id’ parameter is provided.

  • Prediction for each tuple in ‘test_data_table’

  • ‘Confidence’ if required. The sign of the confidence should be such that higher values are better (eg. cost/distance would be negative while probability would be positive )

Misc

...

Another question:  should we remove obligation for data table for predict to be in the same format as training table?

Model Management

A propose approach to model management and model versioning is given in this JIRA.   It includes automatically saving models that have been run in the past, and adding more metadata to the summary table (e.g., time taken to train model).

Misc

  • Independent variables should be allowed to be SQL expressions (including *). Further a ‘exclude’ parameter could be provided to remove features from a ‘*’ list. Columns that are used in ‘id’ or ‘grouping’ should automatically be removed. See ‘tree_train’ for examples decision tree as an example.

  • Internal UDAs for simpler learning algorithms should be simple enough for external users to use them in situations where a table output is not desired. Training functions, even though storing the model in an output table, should output a string with relevant information. Example of informative string elements include output table name, time taken for training, …

  • Each train function name should end in ‘_train’. Further, it should create an output table and a summary table and their formats should be standard across all learning algorithms.

  • Should prediction metrics be part of the training output?

Output formats

In the first form the model is a composite type and opaque to the user. We can provide introspection functions to understand the model.  This is related to the model management section above.

 

group_col1

group_col2

model

u1

v1

<model for u1, v1>

u1

v2

<model for u1, v2>

...

  

u2

v1

<model for u2, v1>


In the second form, the model elements are stored as columns in output table, exposing the elements to the user.

...