Introduction
Naming Conventions
- 2.1. Schema
- 2.2. Functions & and Aggregates
Functions and Languages
Function Name Overloading
Guide to Driver UDFs
- 5.1. Input Definition
- 5.2. Output Definition
- 5.3. Logging
- 5.4. Parameter Validation
- 5.5. Multi-User and Multi-Session Execution
Support Modules
- 6.1. DB connectivity moduleConnectivity: plpy.py
- 6.2. Python abstraction layerAbstraction Layer

Anchor
Introduction
Introduction
1. Introduction

...

This is applicable to all database objects (tables, views, functions, function parameters, data types, operators, etc).

Anchor
Schema
Schema
2.1. Schema

All database objects should be created in the default MADlib schema. Use MADLIB_SCHEMA as the schema prefix for your tables/views/functions/etc. in any scripts. This literal will be replaced during the installation with the target schema name (configured by the user in Config.yml). Code examples below use prefix madlib for illustration purposes only.

Anchor
Functions and Aggregates
Functions and Aggregates
2.2. Functions

...

and Aggregates

All non-user facing routines should be named with a "__" (double underscore) prefix to make the catalog easier to read.

...

Code Block

language	sql

madlib.argmax (integer key, float8 value)

Anchor
Functions and Languages
Functions and Languages
3. Functions and Languages

To simplify this guide, we'd like to introduce three categories of user-defined functions:

UDAs - user-defined aggregates, which perform a single scan of the data source and return an aggregated value for a group of rows. All UDA component functions should be written in PL/C (C/C++) for performance and portability reasons.
Row Level UDFs - functions that operate on their arguments only and do not dispatch any SQL statements. These functions generate a result for each argument set, or for each tuple they are executed on. Recommended language is the same as for UDAs.
Driver UDFs - functions that usually drive an execution of an algorithm, and may perform multiple SQL operations including data modification. In order to make this part of the code portable we suggest using PL/Python wrapper functions based on plain Python modules. The DB access inside the Python modules should be implemented using "classic" PyGreSQL interface (http://www.pygresql.org/pg.html).

Anchor
Function Name Overloading
Function Name Overloading
4. Function Name Overloading

The suggestions below on name overloading apply to all the above-mentioned types of user-defined functions.

...

If ANYELEMENT/ANYARRAY functionality is not available or not feasible, function name overloading can be used for different argument data types. For example, function F1 from module M1 can have the following versions:

TEXT data type example:

Code Block

language	sql

madlib.m1_f1( arg1 TEXT)

...

Code Block

language	sql

madlib.m1_f1( reqarg1, ..., reqargN, optarg1, ..., optargN)

Anchor

...

Guide to Driver UDFs
Guide to Driver UDFs
5. Guide to Driver UDFs

Should follow the Should follow the naming conventions described in Section 2.
Should follow the function overloading rules as described in Section 4. On Greenplum and PostgreSQL this can be achieved via PL/Python wrapper UDFs based on the same main Python code.

Anchor
Input Definition
Input Definition
5.1. Input Definition

Parameters of the execution should be supplied directly in the function call (as opposed to passing a reference ID to a parameter-set stored in a table). For example:

...

The input relation and its attributes needed by the function should be validated using primitive functions from the helper.py module. See Section 5.4 for more information.

Anchor
Output Definition
Output Definition
5.2. Output Definition

Returning Simple Results or Models

...

Returning a model (Linear Regression):

Code Block

language	sql

SELECT mregr_coef(price, array[1, bedroom, bath, size]) from houses;
             mregr_coef             
------------------------------------
 {27923.4,-35524.8,2269.34,130.794}

...

If a particular method returns a complex model that is represented in multiple rows it should be saved into a table with a pre-defined structure. The name of this table (including target schema) should be passed in the function call as an argument.

Example ([[Decision Tree|http://doc.madlib.net/groupgrpdectree.html]]):

...

Code Block

language	sql

SELECT * FROM madlib.dtree_train( 'user_schema.user_table', 3, 10);  
 output_table 
------------------------
 user_schema.user_table

...

Code Block

language	sql

CREATE OR REPLACE FUNCTION madlib.kmeans_dummy()
  RETURNS SETOF madlib.kmeans_result
AS $$
  return ( [ 10, 100.0, 'my_schema', 'my_centroids', 'my_points' ]);
$$ LANGUAGE plpythonu;

Anchor
Logging
Logging
5.3. Logging

ERROR

If a function encounters a problem it should raise an error using the plpy.error( message) function (see section 6.1). This will ensure the proper end of the execution and error propagation to the calling environment.

...

Code Block

language	sql

SQL> select madlib.kmeans_run( 'my_schema.data_set_1', 10, 1, 'run1', 'my_schema', 1);
INFO: Parameters:
INFO:  * k = 10 (number of centroids)
INFO:  * input_table = my_schema.data_set_1
INFO:  * goodness = 1 (GOF test on)
INFO:  * run_id = run1
INFO:  * output_schema = my_schema
INFO:  * verbose = 1 (on)
INFO: Seeding 10 centroids...
INFO: Using sample data set for analysis... (9200 out of 10000 points)
INFO: ...Iteration 1
INFO: ...Iteration 2
INFO: Exit reason: fraction of reassigned nodes is smaller than the limit: 0.001
INFO: Expanding cluster assignment to all points...
INFO: Calculating goodness of fit...
...

...

Anchor
Parameter Validation
Parameter Validation
5.4. Parameter Validation

Parameter validation should be performed in each function to avoid any preventable errors.

...

For table/view and column arguments please see Section 6.2 (describing usage of the helper.py module).

Anchor
Multi-User and Multi-Session Execution
Multi-User and Multi-Session Execution
5.5. Multi-User and Multi-Session Execution

In order to avoid unpleasant situations of over-writing or deleting results, MADlib functions should be ready for execution in multi-session or multi-user environment. Hence the following requirements should be met:

Input relations (tables or views) should be used for read only purposes.
Any user output table given as an argument must not overwrite an existing database relation. In such case an error should be returned.
Any execution specific tables should be locked in EXCLUSIVE MODE after creation. This functionality will be implemented inside the Python abstraction layer. There is no need to release LOCKS as they will persist anyway until the end of the main UDF.

Anchor
Support Modules
Support Modules
6. Support Modules

A set of Python modules to make MADlib development easier.

Anchor
DB Connectivity
DB Connectivity
6.1. DB

...

Connectivity: plpy.py

This module serves as the database access layer. Even though currently not used this module will provide easy portability between various MADlib platforms and interfaces. To clarify: PostrgreSQL PL/Python language currently uses an internal plpy.py module to implement seamless DB access (using "classic" PyGreSQL interface - see http://www.pygresql.org/pg.html). By adding a MADlib version of plpy.py we'll be able to more easily port code written for MADlib.

Currently implemented functionality:

Code Block

language	py

   def connect ( dbname, host, port, user, passwd)
   def close()
   def execute( sql)
   def info( msg)
   def error( msg)

Anchor
Python Abstraction Layer
Python Abstraction Layer
6.2. Python

...

Abstraction Layer

This module consists of a set of functions to support common data validation and database object management tasks.

...

Page tree

Versions Compared

Old Version 4

New Version Current

Key

Anchor
Introduction
Introduction
1. Introduction

Anchor
Schema
Schema
2.1. Schema

Anchor
Functions and Aggregates
Functions and Aggregates
2.2. Functions

and Aggregates

Anchor
Functions and Languages
Functions and Languages
3. Functions and Languages

Anchor
Function Name Overloading
Function Name Overloading
4. Function Name Overloading

Anchor

Guide to Driver UDFs
Guide to Driver UDFs
5. Guide to Driver UDFs

Anchor
Input Definition
Input Definition
5.1. Input Definition

Anchor
Output Definition
Output Definition
5.2. Output Definition

Anchor
Logging
Logging
5.3. Logging

Anchor
Parameter Validation
Parameter Validation
5.4. Parameter Validation

Anchor
Multi-User and Multi-Session Execution
Multi-User and Multi-Session Execution
5.5. Multi-User and Multi-Session Execution

Anchor
Support Modules
Support Modules
6. Support Modules

Anchor
DB Connectivity
DB Connectivity
6.1. DB

Connectivity: plpy.py

Anchor
Python Abstraction Layer
Python Abstraction Layer
6.2. Python

Abstraction Layer

Page tree

Page History

Versions Compared

Old Version 4

New Version Current

Key

AnchorIntroductionIntroduction1. Introduction

AnchorSchemaSchema2.1. Schema

AnchorFunctions and AggregatesFunctions and Aggregates2.2. Functions

and Aggregates

AnchorFunctions and LanguagesFunctions and Languages3. Functions and Languages

AnchorFunction Name OverloadingFunction Name Overloading4. Function Name Overloading

Anchor

Guide to Driver UDFsGuide to Driver UDFs5. Guide to Driver UDFs

AnchorInput DefinitionInput Definition5.1. Input Definition

AnchorOutput DefinitionOutput Definition5.2. Output Definition

AnchorLoggingLogging5.3. Logging

AnchorParameter ValidationParameter Validation5.4. Parameter Validation

AnchorMulti-User and Multi-Session ExecutionMulti-User and Multi-Session Execution5.5. Multi-User and Multi-Session Execution

AnchorSupport ModulesSupport Modules6. Support Modules

AnchorDB ConnectivityDB Connectivity6.1. DB

Connectivity: plpy.py

AnchorPython Abstraction LayerPython Abstraction Layer6.2. Python

Abstraction Layer

Anchor
Introduction
Introduction
1. Introduction

Anchor
Schema
Schema
2.1. Schema

Anchor
Functions and Aggregates
Functions and Aggregates
2.2. Functions

Anchor
Functions and Languages
Functions and Languages
3. Functions and Languages

Anchor
Function Name Overloading
Function Name Overloading
4. Function Name Overloading

Guide to Driver UDFs
Guide to Driver UDFs
5. Guide to Driver UDFs

Anchor
Input Definition
Input Definition
5.1. Input Definition

Anchor
Output Definition
Output Definition
5.2. Output Definition

Anchor
Logging
Logging
5.3. Logging

Anchor
Parameter Validation
Parameter Validation
5.4. Parameter Validation

Anchor
Multi-User and Multi-Session Execution
Multi-User and Multi-Session Execution
5.5. Multi-User and Multi-Session Execution

Anchor
Support Modules
Support Modules
6. Support Modules

Anchor
DB Connectivity
DB Connectivity
6.1. DB

Anchor
Python Abstraction Layer
Python Abstraction Layer
6.2. Python