...
MADlib source code is organized such that the core logic of a machine learning or statistical module is located in a common location and the database-port specific code is located in a ports
folder. Since Since all currently supported databases are based on Postgres, the postgres
port contains all the port-specific files, with greenplum
and hawq
inheriting from it. Before Before proceeding with this guide, it is recommended that you familiarize yourself with the [[Module anatomy| MADlib Module Anatomy]] MADlib architecture.
Anchor | ||||
---|---|---|---|---|
|
Let us add a new module called hello_world. Inside this module we implement a User-Defined SQL Aggregate (UDA), called avg_var which computes the mean and variance for a given numerical column of a table. We We'll implement a distributed version of Welford's online algorithm for computing the mean and variance.
Unlike an ordinary UDA in PostgreSQL, avg_var
will also work on a distributed database and take advantage of the underlying distributed network for parallel computations. The The usage of avg_var
is very simple: users simply run the following command in psql:
Code Block | ||
---|---|---|
| ||
sql select avg_var(bath) from houses |
which will print three numbers on the screen: the mean, variance and number of rows in column bath
of table table houses
.
Below are the main steps we will go through in this guide:
- register Register the module.
- define Define the sql SQL functions.
- implement Implement the functions in C++.
- register Register the C++ header files.
1. Register the module
Add the following line to the file called Modules.yml
under ./src/config/
yaml
...
Create file avg_var.sql_in
under folder ./src/ports/postgres/modules/hello_world
. Inside Inside this file we define the aggregate function and other helper functions for computing mean and variance. The actual implementations of those functions will be in separate C++ files which we will describe in the next section.
At the beginning of file avg_var.sql_in
the command m4_include(
SQLCommon.m4')` is necessary to run the m4 macro processor. M4 M4 is used to add platform-specific commands in the SQL definitions and is run while deploying MADlib to the database.
We define the aggregate function avg_var
using built-in PostgreSQL command CREATE AGGREGATE
.
Code Block | ||
---|---|---|
| ||
DROP AGGREGATE IF EXISTS MADLIB_SCHEMA.avg_var(DOUBLE PRECISION); CREATE AGGREGATE MADLIB_SCHEMA.avg_var(DOUBLE PRECISION) ( SFUNC=MADLIB_SCHEMA.avg_var_transition, STYPE=double precision[], FINALFUNC=MADLIB_SCHEMA.avg_var_final, m4_ifdef(`__POSTGRESQL__', `', `PREFUNC=MADLIB_SCHEMA.avg_var_merge_states,') INITCOND='{0, 0, 0}' ); |
...