Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • class AvgVarTransitionState overloads the AnyType() operator such that we can directly return state, an instance of AvgVarTransitionState, while the function is expected to return a AnyType.

Bridging class

...

Below is the method that overloads the operator += for the bridging class AvgVarTransitionState: c++ / @brief Merge with another State object We update mean and variance in a online fashion to avoid intermediate large sum. / template <class OtherHandle> AvgVarTransitionState  

Code Block
languagecpp
/**
* @brief Merge with another State object
* 
* We update mean and variance in a online fashion 
* to avoid intermediate large sum. 
*/ 
template <class OtherHandle> 
AvgVarTransitionState &operator+=( 
	const AvgVarTransitionState<OtherHandle> &inOtherState) {

...



	if (mStorage.size() != inOtherState.mStorage.size())
        throw std::logic_error("Internal error: Incompatible transition "
                               "states");
    double avg_ = inOtherState.avg;
    double var_ = inOtherState.var;
    uint16_t numRows_ = static_cast<uint16_t>(inOtherState.numRows);
    double totalNumRows = static_cast<double>(numRows + numRows_);
    double p = static_cast<double>(numRows) / totalNumRows;
    double p_ = static_cast<double>(numRows_) / totalNumRows;
    double totalAvg = avg * p + avg_ * p_;
    double a = avg - totalAvg;
    double a_ = avg_ - totalAvg;

    numRows += numRows_;
    var = p * var + p_ * var_ + p * a * a + p_ * a_ * a_;
    avg = totalAvg;
    return *this;
}

}  - Given the mean, variance and the size of two data sets, Welford’s method, computes the mean and variance of the two data sets combined.

4. Register the C++ header files

The SQL functions defined in avg_var.sql_in need to be able to locate the actual implementations from the C++ files. This is done by simply adding the following line to the file declarations.hpp under ./src/modules/c++ #include

Code Block
languagecpp
#include "hello_world/avg_var.hpp"

5. Running the new module

To use the new module, for example, we can launch psql terminal and apply it to the patients dataset, described here. The result below shows that half of the 20 patients have had second heart attacks within 1 year (yes = 1).

Code Block
languagesql
SELECT madlib.avg_var(second_attack) FROM patients;

    -- ************ --
    --    Result    --
    -- ************ --
    +-------------------+
    | avg_var           |
    |-------------------|
    | [0.5, 0.25, 20.0] |
    +-------------------+
-- (average, variance, count) --

 

...

The files for above exercise can be found in the examples folder of the source code.

Adding An Iterative UDF

In this session we demonstrate a slightly more complicated example which requires invoking a UDA iteratively. Such cases can often be found in many machine learning modules where the underlying optimization algorithm take takes iterative steps towards the optimum of the objective. In this example we implement a simple logistic regression solver as an iterative UDF. ParticularlyIn particular, the user will be able to type the following command in psql to train a logistic regression classifier:

Code Block
languagesql
SELECT madlib.logregr_simple_train('patients','logreg_mdl', 'second_attack', 'ARRAY[1, treatment, trait_anxiety]');

and to see the results:

Code Block
languagesql
SELECT * FROM logreg_mdl;

Here the data is stored in a SQL TABLE called patients. The target for logistic regression is the column second_attach and the features are columns treatment and trait_anxiety. The 1 entry in the ARRAY denotes an additional bias term in the model.

...

Compared to the steps presented in the last session, here we do not need to modify the Modules.yml file because we are not creating new module. Another difference is that we create an additional .py_in python file along with the .sql_in file. And that is where most of the iterative logic will be implemented.

1. Overview

The overall logic is split into three parts. All the UDF and UDA are defined in simple_logistic.sql_in. The transition, merge and final functions are implemented in C++. Those functions together constitute the UDA called __logregr_simple_step which takes one step from the current state to decrease the logistic regression objective. And finally in simple_logistic.py_in the plpy package is used to implement in python a UDF called logregr_simple_train which invokes __logregr_simple_step iteratively until convergence.

Note that The SQL function logregr_simple_train is defined in simple_logistic.sql_in as followingfollows:

Code Block
languagesql
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.logregr_simple_train ( 
	source_table 		VARCHAR, 
	out_table 			VARCHAR, 
	dependent_varname 	VARCHAR, 
	independent_varname VARCHAR, 
	max_iter 			INTEGER, 
	tolerance 			DOUBLE PRECISION,

...

 
	verbose 			BOOLEAN 
) RETURNS VOID AS $$ 
PythonFunction(hello_world, simple_logistic, logregr_simple_train) 
$$ LANGUAGE

...

 plpythonu 
m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `MODIFIES SQL DATA', `');

where PythonFunction(hello_world, simple_logistic, logregr_simple_train) denotes that the actual implementation is provided by a python function logregr_simple_train inside the file simple_logistic in module hello_world, as shown below: python def logregrsimpletrain( schemamadlib, sourcetable, outtable, dependentvarname, independentvarname, max

Code Block
languagepy
def logregr_simple_train( 
	schema_madlib, source_table, out_table, dependent_varname, 
	independent_varname, max_iter=None, 
	tolerance=None, 
	verbose=None, 
	**kwargs): 
""" 
Train logistic model

...



	@param schema_madlib Name of the MADlib schema, properly escaped/quoted
    @param source_table Name of relation containing the training data
    @param out_table Name of relation where model will be outputted
    @param dependent_varname Name of dependent column in training data (of type BOOLEAN)
    @param independent_varname Name of independent column in training data (of type
                   DOUBLE PRECISION[])
    @param max_iter The maximum number of iterations that are allowed.
    @param tolerance The precision that the results should have
    @param kwargs We allow the caller to specify additional arguments (all of
           which will be ignored though). The purpose of this is to allow the
           caller to unpack a dictionary whose element set is a superset of
           the required arguments by this function.

    @return A composite value which is __logregr_simple_result defined in simple_logistic.sql_in
    """

    return __logregr_train_compute(
        schema_madlib, source_table, out_table, dependent_varname,
        independent_varname, max_iter, tolerance,
        verbose, **kwargs)

...

 

 

2.

...

Iterative

...

procedures

...

in

...

`plpy`

...

The

...

iterative

...

logic

...

is

...

implemented

...

using

...

the

...

PL/Python

...

procedural

...

language

...

. In the beginning of `simple_logistic.py_in` we import a Python module called `plpy` which provides several functions to execute database commands. Implementing the iterative logic using `plpy` is simple, as demonstrated below:

 

Code Block
languagepy
update_plan = plpy.prepare(
	""" 
	SELECT 
		{schema_madlib}.simplestep( ({depcol})::boolean, ({

...

indcol})::double precision[], ($1)) FROM {

...

tblsource} """.format(

...

 tbloutput=tbloutput, schemamadlib=schemamadlib, depcol=depcol, indcol=indcol, tblsource=tblsource), ["double precision[]"])

...


state = None for it in range(0,

...

 maxiter):

...

 restuple = plpy.execute(

...

updateplan, [state]) state =

...

 restuple[0].values()[0] 

 

- thelogregrsimplestepis a UDA defined insimplelogistic.sqlinand implemented usingtransition,mergeandfinalfunctions provided in C++ files in./src/modules/hello_world.

-logregrsimplesteptakes three arguments, the target, the features and the previous state.

- the state is initialized asNonewhich is interpreted asnullvalue in SQL byplpy.

- a more sophisticated iterative scheme for logistic regression will also involve optimality verification and convergence guarantee procedures which are neglected on purpose here for simplicity.

- for a production-level implementation of logistic regression, refer to the moduleregress`.

3. Running the new iterative module

The example below demonstrates the usage of madlib.logregr_simple_train on the patients table. The trained classification model is stored in the table called logreg_mdl and can be viewed using standard SQL query.

 

Code Block
languagesql
SELECT madlib.logregr_simple_train( 
    'patients',                                 -- source table
    'logreg_mdl',                               -- output table
    'second_attack',                            -- labels
    'ARRAY[1, treatment, trait_anxiety]');      -- features
SELECT * FROM logreg_mdl;

-- ************ --
--    Result    --
-- ************ --
+--------------------------------------------------+------------------+
| coef                                             |   log_likelihood |
|--------------------------------------------------+------------------|
| [-6.27176619714, -0.84168872422, 0.116267554551] |         -9.42379 |
+--------------------------------------------------+------------------+
 

...

The files for above exercise can be found in the examples folder of the source code.