You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Abstract

Submarine is a project which allows infra engineers/data scientists to build deep learning applications (TensorFlow, PyTorch, etc.) end to end on cluster management platforms (like YARN).

There are a bunch of integrations of Submarine to other projects are finished or going on, such as Apache Zeppelin, TonY, Azkaban. The next step of Submarine is going to integrate with more projects like Apache Arrow, Redis, MLflow, etc. & be able to handle end-to-end machine learning use cases like model serving, notebook management, advanced training optimizations (like auto parameter tuning, memory cache optimizations for large datasets for training, etc.), and make it run on other platforms like Kubernetes or natively on Cloud.

The Hadoop community believes that further development on Submarine can be done better as a separate project as discussed on the Hadoop email lists (here [submarine-dev@hadoop.apache.org]).

Proposal

Although Submarine was originally developed in Apache Hadoop, there are several forces that are encouraging it to move to a separate project:

First, is that projects both inside and outside of Apache Software Foundation wish to use Submarine or integrate with Submarine, but do not want to depend on Hadoop and its large list of dependencies. 

Second, the Hadoop, as a community project, is targeted to support computation and storage on commodity hardware. Submarine project was created to solve deep learning computation on Hadoop platform. After development, it is becoming an end-to-end machine learning solution for model training (computation), model management and notebook management. 

Third, moving out of Hadoop will also allow Submarine to support other languages in the future (Go, Python, R, etc.), release on a faster release cycle than Hadoop, and develop an independent community which is focused on Machine Learning.

The traditional path at Apache would have been to create an incubator project, but the code is already being released by Apache and most of the developers are familiar with Apache rules and guidelines. In particular, the proposed PMC has [4] Apache members and incubator PMC members from three companies. They will provide oversight and guidance for the developers that are less experienced in the Apache Way. Therefore, the Submarine project would like to propose becoming a Top Level Project at Apache.

Overview of Submarine

Deep learning is useful for enterprises tasks in the field of speech recognition, image classification, AI chatbots, machine translation, just to name a few. In order to train deep learning/machine learning models, frameworks such as TensorFlow / MXNet / PyTorch / Caffe / XGBoost can be leveraged. 

To make distributed deep learning/machine learning applications easily launched, managed and monitored, Hadoop community initiated the Submarine project on 2018 along with other improvements such as first-class GPU support, Docker container support, container-DNS support, scheduling improvements, etc.

These improvements make distributed deep learning/machine learning applications running on Apache Hadoop YARN or other resource management platforms as simple as running it locally, which can let machine-learning engineers focus on algorithms instead of worrying about underlying infrastructures. By using Submarine users can now run deep learning workloads with other ETL/streaming jobs running on the same cluster. This can achieve easy access to data on the same cluster and achieve better resource utilization.

Current Status

Meritocracy

Submarine has been developed as part of Apache Hadoop and thus has been operating as a meritocracy. Many of the developers of Submarine are active Hadoop PMC members, committers or contributors. The Submarine project plans to continue adding new PMC and committers as the project continues to develop.

Community

Submarine’s development team seeks to foster the development and user communities. We feel that becoming a separate project will improve both communities by being smaller and more focused than Hadoop and bring tighter integration with various Apache projects and other open source projects that either doesn’t want to or can’t accept the large list of dependencies from Hadoop.

Core Developers

Hadoop Submarine is being primarily developed by Cloudera, NetEase, LinkedIn, JD, Dahua and Ke.com.

Alignment

The ASF is a natural host for Submarine given that it is already the home of Hadoop, Spark, Hive, Arrow and other emerging distributed computing software projects. Submarine was designed to offer improved user experiences of deep-learning/machine-learning model training, serving, management which can be part of big data pipeline and leverages the power of Apache Spark, Apache Arrow, Apache Zeppelin, etc.

Known Risks

Orphaned Products

The core developers of the Submarine team are actively working on the project and plan to continue. There is very little risk of Submarine getting orphaned since large companies are using Submarine to train their machine learning models. For example, NetEase is using Submarine to run machine learning jobs on their 250 GPU nodes cluster in production and serve their notebook for developers/data-scientists to use (https://www.infoq.cn/article/C11ef0aa1EfSb*6CE9ce).

Inexperience with Open Source

The potential PMC of the new project has extensive experience with Apache projects and includes [TODO number] Apache members and Incubator PMC members. The Submarine PMC and the more experienced committers will be responsible for training the committers that are less familiar with the Apache Way.

Homogeneous Developers

The developers include employees from Cloudera, Netease, Alibaba TODO. Apache projects encourage an open and diverse meritocratic community and Submarine team is very motivated to increase the size and diversity of the development team.

Reliance on Salaried Developers

Most of the work on Submarine has been by salaried developers, but the hope is that by making Submarine a separate project, it will be more approachable for new developers including non-salaried developers.

Relationships with Other Apache Products

Submarine has a strong relationship and integration with Apache Hadoop, Zeppelin. Being independent of Hadoop will allow other projects to depend on Submarine

directly without incurring the cost of depending on the large list of Submarine

dependencies.

Submarine would like to encourage integration with additional Apache projects:

- Apache Arrow for [cross-language development platform for in-memory data cache]

- Apache Spark for [Data processing / preprocessing, Feature engineering]

- Apache Zeppelin for [Interactive development of algorithms through the way of notebooks]

As far as we know, there’s no similar project in Apache positioned to be an end-to-end machine learning platform.

An Excessive Fascination with the Apache Brand

Submarine wants to become an Apache project in order to help efforts to diversify the committer-base, and not to capitalize on the Apache brand. The Submarine project is in production use already inside several large companies and is already being released by Apache Hadoop. As such, the Submarine project is not seeking to use the Apache brand as a marketing tool.

Documentation

The primary documentation about Submarine is located on the https://hadoop.apache.org/submarine/docs/0.2.0/Index/

There have been also been presentations on Submarine:

   Qcon 2019, Beijing (https://qcon.infoq.cn/2019/beijing/presentation/1440)

   DataWorks Summit 2019 Barcelona, Spain (https://dataworkssummit.com/barcelona-2019/session/hadoop-submarine-project-running-deep-learning-workloads-on-yarn/)

Initial Source

Submarine has been under development as part of Apache Hadoop since 2018. The

original inclusion into Hadoop was via YARN-8135 (Now becomes Submarine-2). 

List Submarine sub-modules and explain

CORE

Run the machine learning workload on k8s or YARN, and take advantage of k8s' online service and yarn's offline calculation.

Submarine Portal

Server resource management, personnel role management, algorithm development, model management.

Provide interactive data analysis and algorithm development capabilities in notebook mode.

Submarine ML Plugin interface

Submarine provides a plug-in interface to the machine learning framework, enabling easy integration of various machine learning frameworks into submarines.

Submarine Toolkit / SDK

Submarine provides development libraries for python, java and scala. You can use it directly in machine learning algorithms. Mlflow, memory cache, metrics in the submarine runtime environment.

Major dependencies

Build tools

- Apache Maven

Other Dependencies

Apache

- Log4j

- Hadoop 2.7.x+

Non-Apache

- JDK 1.8+

- Protobuf

- TonY (BSD 2-clause, https://github.com/linkedin/TonY)


Cryptography

Submarine does not currently support encryption.


Required Resources

Mailing Lists

  • private@submarine for private PMC discussions (with moderated subscriptions)
  • dev@submarine
  • user@submarine
  • commits@submarine

Version Control

Git is the preferred source control system.

Issue Tracking

Submarine already has a separate JIRA instance (SUBMARINE-) to track issues.

Other Resources

The existing code already has unit tests so we will make use of existing Apache continuous testing infrastructure. The resulting load should not be very large.

PMC/Committers

Initial PMC

Wangda Tan (wangda at apache dot org) (Hadoop PMC)

Xun Liu (liuxun at apache dot org) (Zeppelin Committer)

Sunil Govind (sunilg at apache dot org) (Hadoop PMC)

Zhankun Tang (ztang at apache dot org) (Hadoop Committer)

Zac Zhou (yuan.zac.zq at gmail dot com)

Keqiu Hu (khu at linkedin dot com)

Vinod Kumar Vavilapalli (vinodkv at apache dot org) (Hadoop VP, Incubator-PMC, member, Ambari PMC, Atlas PMC, Crunch PMC, Lens-pmc, Metron PMC, Tez PMC)

We’d like to propose Wangda Tan as the initial VP for the Submarine project.

Initial Committers

(All initial PMC members are part of committers).

Szilard Nemeth (snemeth at apache dot org) (Hadoop Committer)

Jeff Zhang (zjffdu at apache dot org) (Member, Incubator,  Livy Committer, Pig Committer, Tez PMC, Zeppelin PMC)

Yanbo Liang (yliang at apache dot org) (Spark PMC)

Affiliations

The initial PMC is employed at Cloudera, NetEase, LinkedIn

The initial committers are employed by Cloudera, NetEase, LinkedIn, Alibaba, Facebook

For anybody who wants to be included in this list, please let us know publicly during the proposal voting time.

  • No labels