This guide explains all of the elements needed to successfully develop and plug in a new MADlib^® module.

Install MADlib by following the steps in the Installation Guide for MADlib or use the Docker image instructions below.

MADlib source code is organized such that the core logic of a machine learning or statistical module is located in a common location, and the database-port specific code is located in a ports folder. Since all currently supported databases are based on Postgres, the postgres port contains all the port-specific files, with greenplum and hawq inheriting from it. Before proceeding with this guide, it is recommended that you familiarize yourself with the MADlib module anatomy., with greenplum and hawq inheriting from it. Before proceeding with this guide, it is recommended that you familiarize yourself with the MADlib module anatomy.

Anchor
Dock
Dock
Docker Image

We provide a Docker image with necessary dependencies required to compile and test MADlib on PostgreSQL 9.6. You can view the dependency docker file at ./tool/docker/base/Dockerfile_postgres_9_6. The image is hosted on docker hub at madlib/postgres_9.6:latest. Later we will provide a similar Docker image for Greenplum Database.

Some useful commands to use the Docker file:

Code Block

language	text

## 1) Pull down the `madlib/postgres_9.6:latest` image from docker hub:
docker pull madlib/postgres_9.6:latest
## 2) Launch a container corresponding to the MADlib image, mounting the source code folder to the container:
docker run -d -it --name madlib -v (path to incubator-madlib directory):/incubator-madlib/ madlib/postgres_9.6
where incubator-madlib is the directory where the MADlib source code resides.
############################################## * WARNING * ##################################################
# Please be aware that when mounting a volume as shown above, any changes you make in the "incubator-madlib" 
# folder inside the Docker container will be reflected on your local disk (and vice versa). This means that
# deleting data in the mounted volume from a Docker container will delete the data from your local disk also.
#############################################################################################################
## 3) When the container is up, connect to it and build MADlib:
docker exec -it madlib bash
mkdir /incubator-madlib/build-docker
cd /incubator-madlib/build-docker
cmake ..
make
make doc
make install
## 4) Install MADlib:
src/bin/madpack -p postgres -c postgres/postgres@localhost:5432/postgres install
## 5) Several other madpack commands can now be run:
# Run install check, on all modules:
src/bin/madpack -p postgres -c postgres/postgres@localhost:5432/postgres install-check
# Run install check, on a specific module, say svm:
src/bin/madpack -p postgres -c postgres/postgres@localhost:5432/postgres install-check -t svm
# Run dev check, on all modules (more comprehensive than install check):
src/bin/madpack -p postgres -c postgres/postgres@localhost:5432/postgres dev-check
# Run dev check, on a specific module, say svm:
src/bin/madpack -p postgres -c postgres/postgres@localhost:5432/postgres dev-check -t svm
# Reinstall MADlib:
src/bin/madpack -p postgres -c postgres/postgres@localhost:5432/postgres reinstall
## 6) Kill and remove containers (after exiting the container):
docker kill madlib
docker rm madlib

Anchor
Adding New Module
Adding New Module
Adding A New Module

...

Register the module.
Define the SQL functions.
Implement the functions in C++.Register the C++ header files.
Register the C++ header files.

The files for this exercise can be found in the hello world folder of the source code repository.

1. Register the module

Add the following line to the file called Modules.yml under ./src/config/

...

Code Block

language	cpp

   /**
     * @brief Update state with a new data point
     */
    template <class OtherHandle>
    AvgVarTransitionState &operator+=(const double x){
        double diff = (x - avg);
        double normalizer = static_cast<double>(numRows + 1);
        // online update mean
        this.avg->avg += diff / normalizer;
        // online update variance
        double new_diff = (x - avg);
        double a = static_cast<double>(state.numRows) / normalizer;
        this.var->var = (var * a) + (diff * new_diff) / normalizer;
    }
 
/**
 * @brief Merge with another State object
 *
 * We update mean and variance in a online fashion
 * to avoid intermediate large sum. 
 */
template <class OtherHandle>
AvgVarTransitionState &operator+=(
    const AvgVarTransitionState<OtherHandle> &inOtherState) {

    if (mStorage.size() != inOtherState.mStorage.size())
        throw std::logic_error("Internal error: Incompatible transition "
                               "states");
    double avg_ = inOtherState.avg;
    double var_ = inOtherState.var;
    uint16uint64_t numRows_ = static_cast<uint16cast<uint64_t>(inOtherState.numRows);
    double totalNumRows = static_cast<double>(numRows + numRows_);
    double p = static_cast<double>(numRows) / totalNumRows;
    double p_ = static_cast<double>(numRows_) / totalNumRows;
    double totalAvg = avg * p + avg_ * p_;
    double a = avg - totalAvg;
    double a_ = avg_ - totalAvg;

    numRows += numRows_;
    var = p * var + p_ * var_ + p * a * a + p_ * a_ * a_;
    avg = totalAvg;
    return *this;
}

...

Now let's run an example using the new module. First, rebuild and reinstall MADLib according to the instructions from Installation Guide. We use the patients dataset from the MADlib Quick Start Guide for Users for testing purposes. From the psql terminal, the result below shows that half of the 20 patients have had second heart attacks within 1 year (yes = 1):

Code Block

language	sql

SELECT madlib.avg_var(second_attack) FROM patients;

    -- ************ --
    --    Result    --
    -- ************ --
    +-------------------+
    | avg_var           |
    |-------------------|
    | [0.5, 0.25, 20.0] |
    +-------------------+
-- (average, variance, count) --

...

Anchor
Adding Iterative Module
Adding Iterative Module
Adding An Iterative UDF

...

Compared to the steps presented in the last session, here we do not need to modify the Modules.yml file because we are not creating new module. Another difference is that we create an additional .py_in python file along with the .sql_in file. That is where most of the iterative logic will be implemented.

The files for this exercise can be found in the hello world folder of the source code repository. Please note that __init__.py_in is not included in this folder as an empty file will be sufficient for the purposes of this exercise.

1. Overview

The overall logic is split into three parts. All the UDF and UDA are defined in simple_logistic.sql_in. The transition, merge and final functions are implemented in C++. Those functions together constitute the UDA called __logregr_simple_step which takes one step from the current state to decrease the logistic regression objective. And finally in simple_logistic.py_in the plpy package is used to implement in python a UDF called logregr_simple_train which invokes __logregr_simple_step iteratively until convergence.

...

The example below demonstrates the usage of madlib.logregr_simple_train on the patients table we used earlier. The trained classification model is stored in the table called logreg_mdl and can be viewed using standard SQL query.

Code Block

language	sql

SELECT madlib.logregr_simple_train( 
    'patients',                                 -- source table
    'logreg_mdl',                               -- output table
    'second_attack',                            -- labels
    'ARRAY[1, treatment, trait_anxiety]');      -- features
SELECT * FROM logreg_mdl;

-- ************ --
--    Result    --
-- ************ --
+--------------------------------------------------+------------------+
| coef                                             |   log_likelihood |
|--------------------------------------------------+------------------|
| [-6.27176619714, -0.84168872422, 0.116267554551] |         -9.42379 |
+--------------------------------------------------+------------------+

...

Page tree

Versions Compared

Old Version 22

New Version Current

Key

Anchor
Dock
Dock
Docker Image

Anchor
Adding New Module
Adding New Module
Adding A New Module

1. Register the module

Anchor
Adding Iterative Module
Adding Iterative Module
Adding An Iterative UDF

1. Overview

Page tree

Page History

Versions Compared

Old Version 22

New Version Current

Key

AnchorDockDockDocker Image

AnchorAdding New ModuleAdding New ModuleAdding A New Module

1. Register the module

AnchorAdding Iterative ModuleAdding Iterative ModuleAdding An Iterative UDF

1. Overview

Anchor
Dock
Dock
Docker Image

Anchor
Adding New Module
Adding New Module
Adding A New Module

Anchor
Adding Iterative Module
Adding Iterative Module
Adding An Iterative UDF