This guide explains all of the elements needed to successfully develop and plug in a new MADlib module.
...
Anchor | ||||
---|---|---|---|---|
|
Follow Install MADlib by following the steps in the Installation Guide for MADlib.
MADlib source code is organized such that the core logic of a machine learning or statistical module is located in a common location, and the database-port specific code is located in a ports
folder. Since all currently supported databases are based on Postgres, the postgres
port contains all the port-specific files, with greenplum
and hawq
inheriting from it. Before proceeding with this guide, it is recommended that you familiarize yourself with the MADlib architecture.
...
Let's add a new module called hello_world
. Inside this module we implement a User-Defined SQL Aggregate (UDA), called avg_var
which which computes the mean and variance for a given numerical column of a table. We'll implement a distributed version of Welford's online algorithm for computing the mean and variance.
Unlike an ordinary UDA in PostgreSQL, avg_var
will also work on a distributed database and take advantage of the underlying distributed network for parallel computations. The usage of avg_var
is very simple: ; users simply run the following command in psql:
...
Below are the main steps we will go through in this guide:
- Register the module.
- Define the SQL functions.
- Implement the functions in C++.
- Register the C++ header files.
...
Add the following line to the file called Modules.yml
under ./src/config/
yaml
Code Block | ||
---|---|---|
| ||
- name: hello_world |
...
Code Block | ||
---|---|---|
| ||
DROP AGGREGATE IF EXISTS MADLIB_SCHEMA.avg_var(DOUBLE PRECISION); CREATE AGGREGATE MADLIB_SCHEMA.avg_var(DOUBLE PRECISION) ( SFUNC=MADLIB_SCHEMA.avg_var_transition, STYPE=double precision[], FINALFUNC=MADLIB_SCHEMA.avg_var_final, m4_ifdef(`__POSTGRESQL__', `', `PREFUNC`prefunc=MADLIB_SCHEMA.avg_var_merge_states,') INITCOND='{0, 0, 0}' ); |
We also define parameters passed to CREATE AGGREGATE
:
...