Abstract
MADlib® is an open-source library (licensed under 2-clause BSD license) for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data. The MADlib mission is to foster widespread development of scalable analytic skills, by harnessing efforts from commercial practice, academic research, and open source development.
MADlib occupies a unique niche in the realm of data science and machine learning libraries since its SQL APIs can allow it to work on a wide range of data stores and SQL engines.
Proposal
The current open source community behind MADlib feels that aligning itself with HAWQ's community, governance model, infrastructure and roadmap will allow the project to accelerate adoption and community growth. Given HAWQ's trajectory of entering Apache Software Foundation family as an Incubating project, we feel that the best course of action for MADlib is to follow a similar route.
MADlib and HAWQ are complementary technologies in that MADlib in-database analytical functions can run within the HAWQ execution engine. (MADlib also runs on Greenplum Database and PostgreSQL today.) It is expected that contributors to MADlib will be cognizant of the HAWQ ASF project and may contribute to it as well. In short, collaboration between the two communities will make both projects more vibrant and advance the respective technologies in potentially novel directions.
Contributors may also look at the HAWQ project as a starting port for ports to other parallel database engines. This proposal highly encourages this type of work as it would help to further realize the original cross-platform goal of MADlib as envisioned by its originators.
Thus, the goal of this proposal is to bring the existing MADlib open source community into ASF, change the project's governance model to the "Apache Way" and transition the project's codebase and infrastructure into ASF INFRA. The community has agreed to transfer the brand name "MADlib" to Apache Software Foundation as well.
Pivotal Inc. on behalf of the MADlib open source community is submitting this proposal to transition source code and associated artifacts (documentation, web site content, wiki, etc.) to the Apache Software Foundation Incubator under the Apache License, Version 2.0 and is asking Incubator PMC to established a MADlib incubating project.
Background
MADlib grew out of discussions between database engine developers, data scientists, IT architects and academics interested in new approaches to scalable, sophisticated in-database analytics. These discussions were written up in a paper in VLDB 2009 that coined the term “MAD Skills” for data analysis (http://dl.acm.org/citation.cfm?id=1687576). The MADlib software project began the following year as a collaboration between researchers at UC Berkeley and engineers and data scientists at Pivotal (former EMC/Greenplum).
The initial MADlib codebase came from EMC/Greenplum, UC Berkeley, the University of Wisconsin, and the University of Florida. The project was publicly documented in a paper at VLDB 2012 (http://vldb.org/pvldb/vol5/p1700_joehellerstein_vldb2012.pdf). Today MADlib has contributors from around the world including both individuals and institutions. For example, recent contributions have come from Pivotal, Stanford University, and the University of Illinois at Chicago.
MADlib was conceived from the outset as a free, open source library for all to use and contribute to. Since its inception, the community has steadily added new methods in the areas of mathematics, statistics, machine learning, and data transformation. The current library includes over 30 principle algorithms as well as many additional operators and utility functions.
The methods in MADlib are designed both for in- or out-of-core execution, and for the shared-nothing, scale-out parallelism offered by modern parallel database engines, ensuring that computation is done close to the data. The core functionality is written in declarative SQL statements, which orchestrate data movement to and from disk, and across networked machines. Single-node inner loops take advantage of SQL extensibility to call out to high performance math libraries in user-defined scalar and aggregate functions. At the highest level, tasks that require iteration and/or structure definition are coded in Python driver routines, which are used only to kick off the data-rich computations that happen within the database engine.
The first platforms supported by MADlib were Greenplum Database and PostgreSQL. With the development of HAWQ SQL-on-Hadoop technology by Pivotal, MADlib offers a way to perform predictive analytics on very large data sets stored on a Hadoop cluster.
Today, MADlib is in active development and is deployed on a wide variety of industry and academic projects across many different verticals.
Rationale
Enterprises today are seeing the value of landing very large quantities of data in Hadoop clusters with the goal improving their products and processes. With the proliferation of increasingly sophisticated SQL-on-Hadoop technologies such as HAWQ, analysts can use the familiar SQL language to query this data at scale. This effectively opens the door to Hadoop in the enterprise.
Adding SQL-based predictive analytics like MADlib to the equation enables organizations to reason across large data sets without resorting to sampling, which has been a traditional approach when confronted with scale problems. Operating on all of the data with MADlib results in more robust and accurate models.
Since MADlib is a SQL-based interface, organizations do not need to re-train their teams on an unfamiliar programming language since SQL skills are ubiquitous in today's enterprises.
Given the high velocity of innovation happening in the underlying Hadoop ecosystem, any SQL-based predictive analytics technology that plays in this ecosystem must be commensurately agile to keep up with the community. We strongly believe that in the Big Data space, this can be optimally achieved through a vibrant, diverse, self-governed community collectively innovating around a single codebase while at the same time cross-pollinating with various other data management communities. Apache Software Foundation is the ideal place to meet those ambitious goals.
Initial Goals
Our initial goals are to bring MADlib into the ASF, transition the engineering and governance processes to be in accordance with the "Apache Way" and foster a collaborative development model closely aligned with that of HAWQ.
Another important goal is encouraging efforts to port to other execution engines.
The MADlib project will continue developing new functionality in an open, community-driven way. We envision accelerating innovation under ASF governance, in order to meet the requirements of a wide variety of predictive analytics use cases.
We will also require transitioning of existing project infrastructure (source code, JIRA, mailing list) to the ASF infrastructure.
Current Status
Currently, the project is available at http://madlib.net/. The codebase is licensed under the a 2-clause BSD license. Our current governance model could be described as a "benevolent dictator" one. As stated above, the existing MADlib community feels that closer alignment with HAWQ community, infrastructure and the governance model as it is being proposed to ASF will allow MADlib project to thrive much more compared to relative isolation from HAWQ.
Meritocracy
Our proposed list of initial committers include the current MADlib R&D team at Pivotal and existing active members of the open source project. This group will form a base for the broader community we will invite to collaborate on the codebase. We intend to radically expand the initial developer and user community by running the project in accordance with the "Apache Way". Users and new contributors will be treated with respect and welcomed. By participating in the community and providing quality patches/support that move the project forward, they will earn merit. They also will be encouraged to provide non-code contributions (documentation, events, community management, etc.) and will gain merit for doing so. Those with a proven support and quality track record will be encouraged to become committers.
Community
If MADlib is accepted for incubation, the primary initial goal will be transitioning the core community towards embracing the Apache Way of project governance. We would solicit major existing contributors to become committers on the project from the start.
Core Developers
MADlib core developers are skilled in working as part of openly governed communities. That said, most of the core developers are currently NOT affiliated with the ASF and would require new ICLAs before committing to the project.
Alignment
The following existing ASF projects can be considered when reviewing the MADlib proposal:
Apache Mahout project's goal is to build an environment for quickly creating scalable performant machine learning applications. Apache Mahout is, perhaps, the oldest machine learning library in Hadoop ecosystem. The three major components of Mahout are an environment for building scalable algorithms, many new Scala + Spark (H2O in progress) algorithms, and Mahout's mature Hadoop MapReduce algorithms. We see the two projects benefiting from each other's experience of implementing similar classes of algorithms and look forward to a fruitful exchange of ideas between the two communities.
Apache Spark is a fast engine for processing large datasets, typically from a Hadoop cluster, and performing batch, streaming, interactive, or machine learning workloads. Recently, Apache Spark has embraced SQL-like APIs around DataFrames at its core. Because of that we would expect a level of collaboration between the two projects. Spark project also contains a library (MLlib) that is the closest cousin to MADlib. MLlib is Apache Spark's scalable machine learning library. We see the two projects benefiting from each other's experience of implementing similar classes of algorithms and look forward to a fruitful exchange of ideas between the two communities.
Apache Hive is a data warehouse software that facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. We see a potential for MADlib to leverage Hive as a backend the same way it currently leverages PostgreSQL-derived SQL backends. This could be especially useful for longer running algorithms.
Apache Drill is a schema-free SQL query engine for Hadoop, NoSQL and Cloud Storage. We see a potential for MADlib to leverage Drill as a backend the same way it currently leverages PostgreSQL-derived SQL backends. This could be especially useful for analyzing data coming from heterogenous sources and federated by the Drill engine.
Known Risks
Development has been sponsored mostly by a single company (or its predecessors) thus far and coordinated mainly by the core Pivotal R&D team.
So far, the project's governance model has explicitly been a "benevolent dictator" one. For the project to fully transition to the "Apache Way", development must shift towards the meritocracy-centric model of growing a community of contributors balanced with the needs for extreme stability and core implementation coherency.
Orphaned products
The community proposing MADlib for incubation is an independent open source community. Even though Pivotal happens to be the biggest corporate sponsor of the project (by means of employing the core team) the community goes beyond those affiliated with Pivotal. On top of that, Pivotal is fully committed to maintain its position as one of the leading providers of SQL-based analytics aimed squarely at data scientists. MADlib is the only game in town that can leverage SQL APIs ranging from traditional RDBMS technology all the way to data warehousing (Pivotal Greenplum Database) and into SQL-on-Hadoop (HAWQ). Moreover, Pivotal has a vested interest in making MADlib succeed by driving its close integration with sister ASF projects. We expect this to further reduces the risk of orphaning the product.
Even in the absence of support by a particular vendor such as Pivotal, and in a worst-case scenario where HAWQ and Greenplum Database fail to gain traction in OSS, the existence of an established PostgreSQL OSS project means there’s will still be a working stack for MADlib.
Inexperience with Open Source
MADlib has been an open source project from the outset. All developers working on the project (regardless of their employment affiliation) did so completely in the open. While the governance model of MADlib has been more of a benevolent dictator model, the project has always been receptive to accepting contributions from all sources and including them in future releases based on thorough code review, testing, and compliance with the project’s coding best practices.
Homogeneous Developers
While most of the initial committers are employed by Pivotal, there's still a healthy level of interest coming from academia. On top of that we expect to spark curiosity in sister ASF projects and attract developers unaffiliated with Pivotal. Finally, MADlib is being used extensively whenever Pivotal engages with customers on data science projects. This typically means that the skills remain within a customer organization which further increases the chance of turning customer data scientists into MADlib contributors.
Reliance on Salaried Developers
A large percentage of the contributors are paid to work in the Big Data space. While they might wander from their current employers, they are unlikely to venture far from their core expertise and thus will continue to be engaged with the project regardless of their current employers. In addition, the project is still enjoying popularity in academic circles and we hope that will help mitigate reliance on salaried developers as well.
Relationships with Other Apache Products
As mentioned in the Alignment section, MADlib may consider various degrees of integration and code exchange with Apache Spark (MLlib), Apache Mahout, Apache Hive and Apache Drill projects. We expect integration points to be inside and outside the project. We look forward to collaborating with these communities as well as other communities under the Apache umbrella.
An Excessive Fascination with the Apache Brand
While we intend to leverage the Apache "brand" when talking to other projects as a testament to our project’s neutrality, we have no plans for making use of the Apache brand in press releases nor posting billboards advertising acceptance of MADlib into Apache Incubator.
Documentation
The documentation is currently available at: http://madlib.net/documentation/
The documentation is currently licensed under 2-clause BSD license.
Initial Source
Initial source code is available at:
* MADlib: https://github.com/madlib/madlib * Testsuite: https://github.com/madlib/testsuite * Contributors: https://github.com/madlib/contrib
The code is currently licensed under 2-clause BSD license.
Source and Intellectual Property Submission Plan
As soon as MADlib is approved to join the Incubator, the source code will be transitioned via the Software Grant Agreement onto ASF infrastructure and in turn made available under the Apache License, version 2.0. We know of no legal encumbrances that would inhibit the transfer of source code to the ASF.
External Dependencies
Runtime dependencies:
* boost-1.47.0 (Boost Software License) * _m_widen_init (MIT for this subcomponent of GCC) * python-argparse-1.2.1 (PSF LICENSE AGREEMENT FOR PYTHON 2.7.1) * pyyaml-3.10 (MIT license) * cern_root-5.34 (LGPL, however this dependency will be removed since the 2 cern modules used are being entirely re-written in MADlib) * eigen-3.2.2 (Mozilla Public License) * pyxb-1.2.4 (Apache license version 2) * python (Python Software Foundation License Version 2) * mathjax-2.5 (Apache license version 2)
Build only dependencies:
* doxypy-0.4.2 (GPL) * cmake-2.8.4 (BSD 3-clause License) * doxygen >= 1.8.4 (GPL) * flex >= 2.5.33 (BSD) * bison >= 2.4 (GPL) * latex (LaTeX Project Public License) * TikZ-UML (no license information)
Cryptography
* N/A
Required Resources
Mailing lists
* private@madlib.incubator.apache.org (moderated subscriptions) * commits@madlib.incubator.apache.org * dev@madlib.incubator.apache.org * issues@madlib.incubator.apache.org * user@madlib.incubator.apache.org
Git Repository
https://git-wip-us.apache.org/repos/asf/incubator-madlib.git
Issue Tracking
JIRA Project MADlib (MADLIB)
We will also request migration of our current JIRA available at http://jira.madlib.net/
Other Resources
Means of setting up regular builds for MADlib on builds.apache.org will require integration with Docker support.
Initial Committers
* Anirudh Kondaveeti * Caleb Welton * Frank McQuillan * Gang Xiong * Gautam Muralidhar * Hitoshi Harada * Hulya Emir-farinas * Ian Huston * KeeSiong Ng * Noel Sio * Rahul Iyer * Rashmi Raghu * Regunathan Radhakrishnan * Ronert Obst * Samuel Ziegler * Sarah Aerni * Srivatsan Ramanujam * Woo Jae Jung * Xixuan Feng * Yu Yang * Atri Sharma * Greg Chase * Chloe Jackson * Roman Shaposhnik * Vaibhav Gumashta * Ted Dunning * Konstantin Boudnik
Affiliations
* Hortonworks: Vaibhav Gumashta * MapR: Ted Dunning * WANDisco: Konstantin Boudnik * Barclays: Atri Sharma * Pivotal: everyone else on this proposal
Sponsors
Champion
Roman Shaposhnik
Nominated Mentors
The initial mentors are listed below:
* Ted Dunning - Apache Member, MapR * Konstantin Boudnik - Apache Member, WANDisco * Roman Shaposhnik - Apache Member, Pivotal
Sponsoring Entity
We would like to propose Apache incubator to sponsor this project.