Image Added
This guide explains all of the elements needed to successfully develop and plug in a new MADlib® module.
- Prerequisites
- Docker Image
- Adding a New Module
- Adding an Iterative UDF
...
MADlib source code is organized such that the core logic of a machine learning or statistical module is located in a common location, and the database-port specific code is located in a ports
folder. Since all currently supported databases are based on Postgres, the postgres
port contains all the port-specific files, with greenplum
and hawq
inheriting from it. Before proceeding with this guide, it is recommended that you familiarize yourself with the MADlib module anatomy.
...
...
Docker ImageWe provide a Docker image with necessary dependencies required to compile and test MADlib on PostgreSQL 9.6. You can view the dependency docker file at ./tool/docker/base/Dockerfile_postgres_9_6
. The image is hosted on docker hub at madlib/postgres_9.6:latest
. Later we will provide a similar Docker image for Greenplum Database.
...
Code Block |
---|
|
## 1) Pull down the `madlib/postgres_9.6:latest` image from docker hub:
docker pull madlib/postgres_9.6:latest
## 2) Launch a container corresponding to the MADlib image, mounting the source code folder to the container:
docker run -d -it --name madlib -v (path to incubator-madlib directory):/incubator-madlib/ madlib/postgres_9.6
where incubator-madlib is the directory where the MADlib source code resides.
############################################## * WARNING * ##################################################
# Please be aware that when mounting a volume as shown above, any changes you make in the "incubator-madlib"
# folder inside the Docker container will be reflected on your local disk (and vice versa). This means that
# deleting data in the mounted volume from a Docker container will delete the data from your local disk also.
#############################################################################################################
## 3) When the container is up, connect to it and build MADlib:
docker exec -it madlib bash
mkdir /incubator-madlib/build-docker
cd /incubator-madlib/build-docker
cmake ..
make
make doc
make install
## 4) Install MADlib:
src/bin/madpack -p postgres -c postgres/postgres@localhost:5432/postgres install
## 5) Several other madpack commands can now be run:
# Run install check, on all modules:
src/bin/madpack -p postgres -c postgres/postgres@localhost:5432/postgres install-check
# Run install check, on a specific module, say svm:
src/bin/madpack -p postgres -c postgres/postgres@localhost:5432/postgres install-check -t svm
# Run dev check, on all modules (more comprehensive than install check):
src/bin/madpack -p postgres -c postgres/postgres@localhost:5432/postgres dev-check
# Run dev check, on a specific module, say svm:
src/bin/madpack -p postgres -c postgres/postgres@localhost:5432/postgres dev-check -t svm
# Reinstall MADlib:
src/bin/madpack -p postgres -c postgres/postgres@localhost:5432/postgres reinstall
## 6) Kill and remove containers (after exiting the container):
docker kill madlib
docker rm madlib |
...
Code Block |
---|
|
/**
* @brief Update state with a new data point
*/
template <class OtherHandle>
AvgVarTransitionState &operator+=(const double x){
double diff = (x - avg);
double normalizer = static_cast<double>(numRows + 1);
// online update mean
this->avg += diff / normalizer;
// online update variance
double new_diff = (x - avg);
double a = static_cast<double>(numRows) / normalizer;
this->var = (var * a) + (diff * new_diff) / normalizer;
}
/**
* @brief Merge with another State object
*
* We update mean and variance in a online fashion
* to avoid intermediate large sum.
*/
template <class OtherHandle>
AvgVarTransitionState &operator+=(
const AvgVarTransitionState<OtherHandle> &inOtherState) {
if (mStorage.size() != inOtherState.mStorage.size())
throw std::logic_error("Internal error: Incompatible transition "
"states");
double avg_ = inOtherState.avg;
double var_ = inOtherState.var;
uint16uint64_t numRows_ = static_cast<uint16cast<uint64_t>(inOtherState.numRows);
double totalNumRows = static_cast<double>(numRows + numRows_);
double p = static_cast<double>(numRows) / totalNumRows;
double p_ = static_cast<double>(numRows_) / totalNumRows;
double totalAvg = avg * p + avg_ * p_;
double a = avg - totalAvg;
double a_ = avg_ - totalAvg;
numRows += numRows_;
var = p * var + p_ * var_ + p * a * a + p_ * a_ * a_;
avg = totalAvg;
return *this;
} |
...
Code Block |
---|
|
SELECT madlib.avg_var(second_attack) FROM patients;
-- ************ --
-- Result --
-- ************ --
+-------------------+
| avg_var |
|-------------------|
| [0.5, 0.25, 20.0] |
+-------------------+
-- (average, variance, count) -- |
...
Anchor |
---|
| Adding Iterative Module |
---|
| Adding Iterative Module |
---|
|
Adding An Iterative UDF
...
The example below demonstrates the usage of madlib.logregr_simple_train
on the patients
table we used earlier. The trained classification model is stored in the table called logreg_mdl
and can be viewed using standard SQL query.
Code Block |
---|
|
SELECT madlib.logregr_simple_train(
'patients', -- source table
'logreg_mdl', -- output table
'second_attack', -- labels
'ARRAY[1, treatment, trait_anxiety]'); -- features
SELECT * FROM logreg_mdl;
-- ************ --
-- Result --
-- ************ --
+--------------------------------------------------+------------------+
| coef | log_likelihood |
|--------------------------------------------------+------------------|
| [-6.27176619714, -0.84168872422, 0.116267554551] | -9.42379 |
+--------------------------------------------------+------------------+ |
...