Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

MADlib source code is organized such that the core logic of a machine learning or statistical module is located in a common location and the database-port specific code is located in a ports folder. Since  Since all currently supported databases are based on Postgres, the postgres port contains all the port-specific files, with greenplum and hawq inheriting from it. Before  Before proceeding with this guide, it is recommended that you familiarize yourself with the [[Module anatomy| MADlib Module Anatomy]] MADlib architecture.

Anchor
Adding New Module
Adding New Module
Adding A New Module

Let us add a new module called hello_world. Inside this module we implement a User-Defined SQL Aggregate (UDA), called avg_var which computes the mean and variance for a given numerical column of a table. We We'll implement a distributed version of Welford's online algorithm for computing the mean and variance.

Unlike an ordinary UDA in PostgreSQL, avg_var will also work on a distributed database and take advantage of the underlying distributed network for parallel computations. The  The usage of avg_var is very simple: users simply run the following command in psql:

Code Block
languagesql
sql select avg_var(bath) from houses

which will print three numbers on the screen: the mean, variance and number of rows in column bath of table table houses.

Below are the main steps we will go through in this guide:

  1. register Register the module.
  2. define Define the sql SQL functions.
  3. implement Implement the functions in C++.
  4. register Register the C++ header files.

1. Register the module

Add the following line to the file called Modules.yml under ./src/config/ yaml

...

Create file avg_var.sql_in under folder ./src/ports/postgres/modules/hello_world. Inside  Inside this file we define the aggregate function and other helper functions for computing mean and variance. The actual implementations of those functions will be in separate C++ files which we will describe in the next section.

At the beginning of file avg_var.sql_in the command m4_include(SQLCommon.m4')` is necessary to run the m4 macro processor. M4  M4 is used to add platform-specific commands in the SQL definitions and is run while deploying MADlib to the database.

We define the aggregate function avg_var using built-in PostgreSQL command CREATE AGGREGATE.

Code Block
languagesql
DROP AGGREGATE IF EXISTS MADLIB_SCHEMA.avg_var(DOUBLE PRECISION);
CREATE AGGREGATE MADLIB_SCHEMA.avg_var(DOUBLE PRECISION) (
    SFUNC=MADLIB_SCHEMA.avg_var_transition,
    STYPE=double precision[],
    FINALFUNC=MADLIB_SCHEMA.avg_var_final,
    m4_ifdef(`__POSTGRESQL__', `', `PREFUNC=MADLIB_SCHEMA.avg_var_merge_states,')
    INITCOND='{0, 0, 0}'
);

...