...
- class
AvgVarTransitionState
overloads theAnyType()
operator such that we can directly return state, an instance ofAvgVarTransitionState
, while the function is expected to return aAnyType
.
Bridging class
...
Below is the method that overloads the operator +=
for the bridging class AvgVarTransitionState
: c++ / @brief Merge with another State object We update mean and variance in a online fashion to avoid intermediate large sum. / template <class OtherHandle> AvgVarTransitionState
Code Block | ||
---|---|---|
| ||
/** * @brief Merge with another State object * * We update mean and variance in a online fashion * to avoid intermediate large sum. */ template <class OtherHandle> AvgVarTransitionState &operator+=( const AvgVarTransitionState<OtherHandle> &inOtherState) { |
...
if (mStorage.size() != inOtherState.mStorage.size()) throw std::logic_error("Internal error: Incompatible transition " "states"); double avg_ = inOtherState.avg; double var_ = inOtherState.var; uint16_t numRows_ = static_cast<uint16_t>(inOtherState.numRows); double totalNumRows = static_cast<double>(numRows + numRows_); double p = static_cast<double>(numRows) / totalNumRows; double p_ = static_cast<double>(numRows_) / totalNumRows; double totalAvg = avg * p + avg_ * p_; double a = avg - totalAvg; double a_ = avg_ - totalAvg; numRows += numRows_; var = p * var + p_ * var_ + p * a * a + p_ * a_ * a_; avg = totalAvg; return *this; } |
} - Given the mean, variance and the size of two data sets, Welford’s method, computes the mean and variance of the two data sets combined.
4. Register the C++ header files
The SQL functions defined in avg_var.sql_in
need to be able to locate the actual implementations from the C++ files. This is done by simply adding the following line to the file declarations.hpp
under ./src/modules/c++ #include
Code Block | ||
---|---|---|
| ||
#include "hello_world/avg_var.hpp" |
5. Running the new module
To use the new module, for example, we can launch psql
terminal and apply it to the patients
dataset, described here. The result below shows that half of the 20 patients have had second heart attacks within 1 year (yes = 1).
Code Block | ||
---|---|---|
| ||
SELECT madlib.avg_var(second_attack) FROM patients;
-- ************ --
-- Result --
-- ************ --
+-------------------+
| avg_var |
|-------------------|
| [0.5, 0.25, 20.0] |
+-------------------+
-- (average, variance, count) -- |
...
The files for above exercise can be found in the examples folder of the source code.
Adding An Iterative UDF
In this session we demonstrate a slightly more complicated example which requires invoking a UDA iteratively. Such cases can often be found in many machine learning modules where the underlying optimization algorithm take takes iterative steps towards the optimum of the objective. In this example we implement a simple logistic regression solver as an iterative UDF. ParticularlyIn particular, the user will be able to type the following command in psql
to train a logistic regression classifier:
Code Block | ||
---|---|---|
| ||
SELECT madlib.logregr_simple_train('patients','logreg_mdl', 'second_attack', 'ARRAY[1, treatment, trait_anxiety]'); |
and to see the results:
Code Block | ||
---|---|---|
| ||
SELECT * FROM logreg_mdl; |
Here the data is stored in a SQL TABLE called patients
. The target for logistic regression is the column second_attach
and the features are columns treatment
and trait_anxiety
. The 1
entry in the ARRAY
denotes an additional bias term in the model.
...
Compared to the steps presented in the last session, here we do not need to modify the Modules.yml
file because we are not creating new module. Another difference is that we create an additional .py_in
python file along with the .sql_in
file. And that is where most of the iterative logic will be implemented.
1. Overview
The overall logic is split into three parts. All the UDF and UDA are defined in simple_logistic.sql_in
. The transition
, merge
and final
functions are implemented in C++. Those functions together constitute the UDA called __logregr_simple_step
which takes one step from the current state to decrease the logistic regression objective. And finally in simple_logistic.py_in
the plpy
package is used to implement in python a UDF called logregr_simple_train
which invokes __logregr_simple_step
iteratively until convergence.
Note that The SQL function logregr_simple_train
is defined in simple_logistic.sql_in
as followingfollows:
Code Block | ||
---|---|---|
| ||
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.logregr_simple_train ( source_table VARCHAR, out_table VARCHAR, dependent_varname VARCHAR, independent_varname VARCHAR, max_iter INTEGER, tolerance DOUBLE PRECISION, |
...
verbose BOOLEAN ) RETURNS VOID AS $$ PythonFunction(hello_world, simple_logistic, logregr_simple_train) $$ LANGUAGE |
...
plpythonu m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `MODIFIES SQL DATA', `'); |
where PythonFunction(hello_world, simple_logistic, logregr_simple_train)
denotes that the actual implementation is provided by a python function logregr_simple_train
inside the file simple_logistic
in module hello_world
, as shown below: python def logregrsimpletrain( schemamadlib, sourcetable, outtable, dependentvarname, independentvarname, max
Code Block | ||
---|---|---|
| ||
def logregr_simple_train( schema_madlib, source_table, out_table, dependent_varname, independent_varname, max_iter=None, tolerance=None, verbose=None, **kwargs): """ Train logistic model |
...
@param schema_madlib Name of the MADlib schema, properly escaped/quoted @param source_table Name of relation containing the training data @param out_table Name of relation where model will be outputted @param dependent_varname Name of dependent column in training data (of type BOOLEAN) @param independent_varname Name of independent column in training data (of type DOUBLE PRECISION[]) @param max_iter The maximum number of iterations that are allowed. @param tolerance The precision that the results should have @param kwargs We allow the caller to specify additional arguments (all of which will be ignored though). The purpose of this is to allow the caller to unpack a dictionary whose element set is a superset of the required arguments by this function. @return A composite value which is __logregr_simple_result defined in simple_logistic.sql_in """ return __logregr_train_compute( schema_madlib, source_table, out_table, dependent_varname, independent_varname, max_iter, tolerance, verbose, **kwargs) |
...
2.
...
Iterative
...
procedures
...
in
...
`plpy`
...
The
...
iterative
...
logic
...
is
...
implemented
...
using
...
the
...
...
...
...
. In the beginning of `simple_logistic.py_in` we import a Python module called `plpy` which provides several functions to execute database commands. Implementing the iterative logic using `plpy` is simple, as demonstrated below:
Code Block | ||
---|---|---|
| ||
update_plan = plpy.prepare( """ SELECT {schema_madlib}.simplestep( ({depcol})::boolean, ({ |
...
indcol})::double precision[], ($1)) FROM { |
...
tblsource} """.format( |
...
tbloutput=tbloutput, schemamadlib=schemamadlib, depcol=depcol, indcol=indcol, tblsource=tblsource), ["double precision[]"]) |
...
state = None for it in range(0, |
...
maxiter): |
...
restuple = plpy.execute( |
...
updateplan, [state]) state = |
...
restuple[0].values()[0] |
- the
logregrsimplestepis a UDA defined in
simplelogistic.sqlinand implemented using
transition,
mergeand
finalfunctions provided in C++ files in
./src/modules/hello_world.
-
logregrsimplesteptakes three arguments, the target, the features and the previous state.
- the state is initialized as
Nonewhich is interpreted as
nullvalue in SQL by
plpy.
- a more sophisticated iterative scheme for logistic regression will also involve optimality verification and convergence guarantee procedures which are neglected on purpose here for simplicity.
- for a production-level implementation of logistic regression, refer to the module
regress`.
3. Running the new iterative module
The example below demonstrates the usage of madlib.logregr_simple_train
on the patients
table. The trained classification model is stored in the table called logreg_mdl
and can be viewed using standard SQL query.
Code Block | ||
---|---|---|
| ||
SELECT madlib.logregr_simple_train(
'patients', -- source table
'logreg_mdl', -- output table
'second_attack', -- labels
'ARRAY[1, treatment, trait_anxiety]'); -- features
SELECT * FROM logreg_mdl;
-- ************ --
-- Result --
-- ************ --
+--------------------------------------------------+------------------+
| coef | log_likelihood |
|--------------------------------------------------+------------------|
| [-6.27176619714, -0.84168872422, 0.116267554551] | -9.42379 |
+--------------------------------------------------+------------------+ |
...
The files for above exercise can be found in the examples folder of the source code.