Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Install MADlib
  2. Module Anatomy Explained
  3. Adding A a New Module
  4. Adding An an Iterative UDF

Anchor
Install
Install
Installation

...

Code Block
languagecpp
AnyType 
avgvartransition::run(AnyType& args) {
 
	// get current state value 
	AvgVarTransitionState<MutableArrayHandle<double> > state = args[0]; 
 
	// get current row value 
	double x = args[1].getAs<double>(); 
	double d = (x - state.avg);

	// online update mean
    state.avg += d / static_cast<double>(state.numRows + 1);
    double new_d = (x - state.avg);
    double a = static_cast<double>(state.numRows) / static_cast<double>(state.numRows + 1);

    // online update variance
    state.var = state.var * a + d * new_d / static_cast<double>(state.numRows + 1);
    state.numRows ++;
    return state;
}

 - there

  • There are two arguments for avg_var_transition, as specified in avg_var.sql_in. The first one is an array of SQL double

...

  •  type, corresponding to the current mean, variance, and number of rows traversed and the second one is a double

...

  •  representing the current tuple value.

 

...

  • We will describe

...

  • classAvgVarTransitionState later. Basically it takes args[0]

...

  • , a SQL double

...

  •  array, passes the data to the appropriate C++ types and stores them in the state instance.

...

  • Both the mean and the variance are updated in an online manner to avoid accumulating large intermediate sum.

Merge function

Code Block
languagecpp
AnyType
avg_var_merge_states::run(AnyType& args) {
    AvgVarTransitionState<MutableArrayHandle<double> > stateLeft = args[0];
    AvgVarTransitionState<ArrayHandle<double> > stateRight = args[1];

    // Merge states together and return
    stateLeft += stateRight;
    return stateLeft;
}
  • again, Again: the arguments contained in AnyType& args are defined in avg_var.sql_in.
  • the The details are hidden in method of class AvgVarTransitionState which overloads the operator +=

Final function

 
Code Block
languagecpp
AnyType
avg_var_final::run(AnyType& args) {
    AvgVarTransitionState<MutableArrayHandle<double> > state = args[0];

    // If we haven't seen any data, just return Null. This is the standard
    // behavior of aggregate function on empty data sets (compare, e.g.,
    // how PostgreSQL handles sum or avg on empty inputs)
    if (state.numRows == 0)
        return Null();

    return state;
}
  • class Class AvgVarTransitionState overloads the AnyType() operator such that we can directly return state, an instance of AvgVarTransitionState, while the function is expected to return a AnyType.

...

Code Block
languagecpp
/**
* @brief Merge with another State object
* 
* We update mean and variance in a online fashion 
* to avoid intermediate large sum. 
*/ 
template <class OtherHandle> 
AvgVarTransitionState &operator+=( 
	const AvgVarTransitionState<OtherHandle> &inOtherState) {

	if (mStorage.size() != inOtherState.mStorage.size())
        throw std::logic_error("Internal error: Incompatible transition "
                               "states");
    double avg_ = inOtherState.avg;
    double var_ = inOtherState.var;
    uint16_t numRows_ = static_cast<uint16_t>(inOtherState.numRows);
    double totalNumRows = static_cast<double>(numRows + numRows_);
    double p = static_cast<double>(numRows) / totalNumRows;
    double p_ = static_cast<double>(numRows_) / totalNumRows;
    double totalAvg = avg * p + avg_ * p_;
    double a = avg - totalAvg;
    double a_ = avg_ - totalAvg;

    numRows += numRows_;
    var = p * var + p_ * var_ + p * a * a + p_ * a_ * a_;
    avg = totalAvg;
    return *this;
}

 - Given the mean, variance and the size of two data sets, Welford’s method, computes  computes the mean and variance of the two data sets combined.

...