...
In this session we demonstrate a slightly more complicated example which requires invoking a UDA iteratively. Such cases can often be found in many machine learning modules where the underlying optimization algorithm takes iterative steps towards the optimum of the objective function. In this example we implement a simple logistic regression solver as an iterative UDF. In particular, the user will be able to type the following command in psql
to train a logistic regression classifier:
...
Here the data is stored in a SQL TABLE called patients
. The target for logistic regression is the column second_attach
and attack
and the features are columns treatment
and trait_anxiety
. The 1
entry in the ARRAY
denotes an additional bias term in the model.
We add the solver to the hello_world
module created above. Here are the main steps to follow:
- in In
./src/ports/postgres/modules/hello_world
- create file
__init__.py_in
- create file
simple_logistic.py_in
- create file
simple_logistic.sql_in
- create file
- in In
./src/modules/hello_world
- create file
simple_logistic.cpp
- create file
simple_logistic.hpp
- create file
- in In
./src/modules
- modify file
declarations.hpp
-- appending:
append a new line#include "hello_world/simple_logistic.hpp"
to the end.
- modify file
Compared to the steps presented in the last session, here we do not need to modify the Modules.yml
file because we are not creating new module. Another difference is that we create an additional .py_in
python file along with the .sql_in
file. And that That is where most of the iterative logic will be implemented.
...
The overall logic is split into three parts. All the UDF and UDA are defined in simple_logistic.sql_in
. The transition
, merge
and final
functions are implemented in C++. Those functions together constitute the UDA called __logregr_simple_step
which takes one step from the current state to decrease the logistic regression objective. And finally in simple_logistic.py_in
the plpy
package is used to implement in python a UDF called logregr_simple_train
which invokes __logregr_simple_step
iteratively until convergence.
Note that The the SQL function logregr_simple_train
is defined in simple_logistic.sql_in
as follows:
Code Block | ||
---|---|---|
| ||
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.logregr_simple_train ( source_table VARCHAR, out_table VARCHAR, dependent_varname VARCHAR, independent_varname VARCHAR, max_iter INTEGER, tolerance tolerance DOUBLE PRECISION, verbose BOOLEAN BOOLEAN ) RETURNS VOID AS $$ $$ PythonFunction(hello_world, simple_logistic, logregr_simple_train) $$ LANGUAGE plpythonu plpythonu m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `MODIFIES SQL DATA', `'); |
...
Code Block | ||
---|---|---|
| ||
def logregr_simple_train( schema_madlib, source_table, out out_table, dependent dependent_varname, independent_varname, max max_iter=None, tolerance=None, verbose=None, **kwargs): """ Train logistic model @param schema_madlib Name of the MADlib schema, properly escaped/quoted @param source_table Name of relation containing the training data @param out_table Name of relation where model will be outputted @param dependent_varname Name of dependent column in training data (of type BOOLEAN) @param independent_varname Name of independent column in training data (of type DOUBLE PRECISION[]) @param max_iter The maximum number of iterations that are allowed. @param tolerance The precision that the results should have @param kwargs We allow the caller to specify additional arguments (all of which will be ignored though). The purpose of this is to allow the caller to unpack a dictionary whose element set is a superset of the required arguments by this function. @return A composite value which is __logregr_simple_result defined in simple_logistic.sql_in """ return __logregr_train_compute( schema_madlib, source_table, out_table, dependent_varname, independent_varname, max_iter, tolerance, verbose, **kwargs) |
...
2.
...
Iterative procedures
...
in plply
The iterative logic is implemented using the the PL/Python procedural language. In the beginning of `simpleof simple_logistic.py_in` we in
we import a Python module called `plpy` which called plpy
which provides several functions to execute database commands. Implementing the iterative logic using `plpy` is using plpy
is simple, as demonstrated below:
Code Block | ||
---|---|---|
| ||
update_plan = plpy.prepare( """ SELECT SELECT {schema_madlib}.simplestep( ({depcol__logregr_simple_step( ({dep_col})::boolean, ({indcolind_col})::double precision[], ($1)) FROM {tblsource} """.format( tbloutput=tbloutput, schemamadlib=schemamadlib, depcol=depcol, indcol=indcol, tblsource=tblsource, ($1)) FROM {tbl_source} """.format( tbl_output=tbl_output, schema_madlib=schema_madlib, dep_col=dep_col, ind_col=ind_col, tbl_source=tbl_source), ["double precision[]"]) state = None for it in range(0, maxitermax_iter): restuple res_tuple = plpy.execute(updateplanupdate_plan, [state]) state = restuple res_tuple[0].values()[0] |
...
- The
__logregr_simple_step
is
...
- a UDA defined
...
- in
simple_logistic.sql_in
...
- and implemented
...
- using
transition
,merge
...
- and
final
...
- functions provided in C++ files in
./src/modules/hello_world
.
...
__logregr_simple_step
...
- takes three arguments, the target, the features and the previous state.
...
- The state is initialized
...
- as
None
...
- which is interpreted
...
- as
null
...
- value in SQL
...
- by
plpy
.
...
- A more sophisticated iterative scheme for logistic regression
...
- would also
...
- include optimality verification and convergence guarantee procedures, which are neglected on purpose here for simplicity.
...
- For a production-level implementation of logistic regression, refer to the
...
- module
regress
.
3. Running the new iterative module
The example below demonstrates the usage of madlib.logregr_simple_train
on the patients
table we used earlier. The trained classification model is stored in the table called logreg_mdl
and can be viewed using standard SQL query.
...
Code Block | ||
---|---|---|
| ||
SELECT madlib.logregr_simple_train( 'patients', -- source table 'logreg_mdl', -- output table 'second_attack', -- labels 'ARRAY[1, treatment, trait_anxiety]'); -- features SELECT * FROM logreg_mdl; -- ************ -- -- Result -- -- ************ -- +--------------------------------------------------+------------------+ | coef | log_likelihood | |--------------------------------------------------+------------------| | [-6.27176619714, -0.84168872422, 0.116267554551] | -9.42379 | +--------------------------------------------------+------------------+ |
...
The files for the above exercise can be found in the examples the hello world folder of the source code repository.