Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In this session we demonstrate a slightly more complicated example which requires invoking a UDA iteratively. Such cases can often be found in many machine learning modules where the underlying optimization algorithm takes iterative steps towards the optimum of the objective function. In this example we implement a simple logistic regression solver as an iterative UDF. In particular, the user will be able to type the following command in psql to train a logistic regression classifier:

...

Here the data is stored in a SQL TABLE called patients. The target for logistic regression is the column second_attach and attack and the features are columns treatment and trait_anxiety. The 1 entry in the ARRAY denotes an additional bias term in the model.

We add the solver to the hello_world module created above. Here are the main steps to follow:

  • in In ./src/ports/postgres/modules/hello_world
    • create file __init__.py_in
    • create file simple_logistic.py_in
    • create file simple_logistic.sql_in
  • in In ./src/modules/hello_world
    • create file simple_logistic.cpp
    • create file simple_logistic.hpp
  • in In ./src/modules
    • modify file declarations.hpp -- appending : append a new line  #include "hello_world/simple_logistic.hpp" to the end.

Compared to the steps presented in the last session, here we do not need to modify the Modules.yml file because we are not creating new module. Another difference is that we create an additional .py_in python file along with the .sql_in file. And that  That is where most of the iterative logic will be implemented.

...

The overall logic is split into three parts. All the UDF and UDA are defined in simple_logistic.sql_in. The transition, merge and final functions are implemented in C++. Those functions together constitute the UDA called __logregr_simple_step which takes one step from the current state to decrease the logistic regression objective. And finally in simple_logistic.py_in the plpy package is used to implement in python a UDF called logregr_simple_train which invokes __logregr_simple_step iteratively until convergence.

Note that The the SQL function logregr_simple_train is defined in simple_logistic.sql_in as follows:

Code Block
languagesql
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.logregr_simple_train ( 
	    source_table 		        VARCHAR, 
	    out_table 			          VARCHAR, 
	    dependent_varname   	VARCHAR, 
	    independent_varname VARCHAR, 
	    max_iter 			            INTEGER, 
	tolerance 			    tolerance           DOUBLE PRECISION, 
	    verbose 			BOOLEAN              BOOLEAN
) RETURNS VOID AS $$ $$
PythonFunction(hello_world, simple_logistic, logregr_simple_train) 
$$ LANGUAGE plpythonu plpythonu
m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `MODIFIES SQL DATA', `');

...

Code Block
languagepy
def logregr_simple_train( 
	
        schema_madlib, source_table, out out_table, dependent dependent_varname, 
	
        independent_varname, max max_iter=None, 
	 
        tolerance=None, 
	 verbose=None, 
	 **kwargs): 
    """ 
    Train logistic model

	    @param schema_madlib Name of the MADlib schema, properly escaped/quoted
    @param source_table Name of relation containing the training data
    @param out_table Name of relation where model will be outputted
    @param dependent_varname Name of dependent column in training data (of type BOOLEAN)
    @param independent_varname Name of independent column in training data (of type
                   DOUBLE PRECISION[])
    @param max_iter The maximum number of iterations that are allowed.
    @param tolerance The precision that the results should have
    @param kwargs We allow the caller to specify additional arguments (all of
           which will be ignored though). The purpose of this is to allow the
           caller to unpack a dictionary whose element set is a superset of
           the required arguments by this function.

    @return A composite value which is __logregr_simple_result defined in simple_logistic.sql_in
    """

    return __logregr_train_compute(
        schema_madlib, source_table, out_table, dependent_varname,
        independent_varname, max_iter, tolerance,
        verbose, **kwargs)

...

 

2.

...

 Iterative procedures

...

in plply

The iterative logic is implemented using the the PL/Python procedural language. In the beginning of `simpleof simple_logistic.py_in` we in we import a Python module called `plpy` which called plpy which provides several functions to execute database commands. Implementing the iterative logic using `plpy` is using plpy is simple, as demonstrated below:

 

Code Block
languagepy
update_plan = plpy.prepare(
	    """ 
	SELECT 
		
    SELECT
        {schema_madlib}.simplestep( ({depcol__logregr_simple_step(
            ({dep_col})::boolean,
            ({indcolind_col})::double precision[], ($1)) FROM {tblsource} """.format( tbloutput=tbloutput, schemamadlib=schemamadlib, depcol=depcol, indcol=indcol, tblsource=tblsource,
            ($1))
    FROM {tbl_source}
    """.format(
        tbl_output=tbl_output,
        schema_madlib=schema_madlib,
        dep_col=dep_col,
        ind_col=ind_col,
        tbl_source=tbl_source), ["double precision[]"])

state = None 
for it in range(0, maxitermax_iter): restuple
    res_tuple = plpy.execute(updateplanupdate_plan, [state])
    state = restuple res_tuple[0].values()[0] 

 

...

  • The __logregr_simple_step is

...

  • a UDA defined

...

  • in simple_logistic.sql_in

...

  •  and implemented

...

  • using transition, merge

...

  •  and final

...

  •  functions provided in C++ files in./src/modules/hello_world.

...

  • __logregr_simple_step

...

  •  takes three arguments, the target, the features and the previous state.

...

  • The state is initialized

...

  • as None

...

  •  which is interpreted

...

  • as null

...

  •  value in SQL

...

  • by plpy.

...

  • A more sophisticated iterative scheme for logistic regression

...

  • would also

...

  • include optimality verification and convergence guarantee procedures, which are neglected on purpose here for simplicity.

...

  • For a production-level implementation of logistic regression, refer to the

...

  • module regress.

3. Running the new iterative module

The example below demonstrates the usage of madlib.logregr_simple_train on the patients table we used earlier. The trained classification model is stored in the table called logreg_mdl and can be viewed using standard SQL query.

...

Code Block
languagesql
SELECT madlib.logregr_simple_train( 
    'patients',                                 -- source table
    'logreg_mdl',                               -- output table
    'second_attack',                            -- labels
    'ARRAY[1, treatment, trait_anxiety]');      -- features
SELECT * FROM logreg_mdl;

-- ************ --
--    Result    --
-- ************ --
+--------------------------------------------------+------------------+
| coef                                             |   log_likelihood |
|--------------------------------------------------+------------------|
| [-6.27176619714, -0.84168872422, 0.116267554551] |         -9.42379 |
+--------------------------------------------------+------------------+
 

...

The files for the above exercise can be found in the examples the hello world folder of the source code repository.