Discussion thread
Vote thread
JIRA	Unable to render Jira issues macro, execution error.
Release	1.3

Motivation

Apache Flink’s documentation has grown a lot over the years. Some features are better documented than others; some content is outdated or become irrelevant over time. With this FLIP we would like to coordinate the efforts to restructure, improve and update our documentation.

Goals

Top-Level Structure with Clear Separation of Concerns

With this FLIP we would like to clean up the top-level structure of Flink’s documentation. The top-level sections should all have dedicated goals and clear separation of concerns (see below)

Expand and Restructure Existing Concepts Section

The Apache Flink documentation already contains a Concepts section, but it is a ) incomplete and b) lacks an overall structure & reading flow and c) describes Flink as the community presented it 2-3 years ago. Also, “concepts”-content is also spread over the development & operations documentation without references to the “concepts” section.

Align Application Development with current State and Flink Roadmap

The structure of the Application Development section of the documentation does not align well with the overall direction of the project. Broadly speaking, while we are moving into the direction of having two unified APIs for batch and stream processing (DataStream API & TableAPI SQL), the documentation is centered around DataStream and Dataset API.

Improve documentation on Deployment & Operations

The operations and monitoring are arguably the weakest sections of the Apache Flink documentation. While it is comprehensive and often goes into a lot of detail, it lacks an overall structure and does not address common overarching concerns of operations teams in an efficient way.

Improve getting-started experience

The current documentation has a few getting started guides but those only cover some of the APIs and are also not consistent. It would be good to have guides that address different aspects like APIs (job development / SQL queries), operations (job submission, management via UI/REST). Some of these could be realized with Docker compose environments.

Use of a common terminology

Some terms in the documentation are used to describe the same things or sometimes a term is also used to describe two different things. Examples are operator, task, instance, partition, record, event, function, transformation, etc. It would be great to have a reference pages that defines the meaning of these words and aim to use them in a consistent way.

Add a documentation style guide to contribution guide

Similar to the Checkstyle used to keep the Flink’s source code consistent, we should have a common style for all pages of the documentation. Doing so will allow full community involvement in growing the documentation while maintaining a consistent visual structure. Examples include how and when to use note and attention tags, where the table of contents should appear on the page, etc.

Proposed Change

We propose a complete overhaul of the structure of our documentation; expanding, improving, and bringing up to date all sections.

Top-Level Structure

Current (moved to)	Target
Concepts (Concepts) Tutorials (Getting Started) Examples (Getting Started) Application Development (APIs, Libraries, Connectors) Deployment & Operations (Deployment, Operations) Metrics & Monitoring (Deployment, Operations) Flink Development (Flink Project Page) Internals (Concepts or removed)	Getting Started Concepts APIs Libraries Connectors Testing Deployment Operations

Getting Started is targeted to new users, who want to get up and running quickly. It will contain quickstarts and example walkthroughs for the two main APIs: DataStream & TableAPI/SQL. Interactive Docker compose playgrounds would be great to have as well: 1) SQL CLI setup (like the SQL training, 2) Flink+ Kafka + a job jar to submit, savepoint, restart, etc via Web UI / REST..

The Concepts section introduces the user to Flink’s fundamental concepts for batch as well as stream processing.

The APIs section explains in detail how to write Flink applications with the DataStream API and Table API / SQL and for the time being also DataSet API. The APIs section references Concepts whenever the conceptual background is needed.

The Libraries section documents the usage of our libraries on top of Flink’s high-level APIs, namely for Complex Event Processing, Graph Processing, and Machine Learning.

The new top-level Connectors section makes it easy for new users to evaluate how Flink applications integrate with their existing infrastructure quickly. This section discusses existing connectors as well as a guide to custom sources & sinks.

The Deployment Section covers all topics related installation & setup of a Flink cluster or application including metrics, logging and security concerns.

In contrast the Deployment section, the Operations sections will focus on “Day 2 operations”, namely monitoring, upgrades, troubleshooting & tuning Flink applications & clusters.

Getting Started Section

Getting Started is targeted to new users, who want to get up and running quickly. We propose the following structure:

Flink Overview (~ two pages)
Project Setup
Quickstarts

Example Walkthrough - Table API / SQL
Example Walkthrough - DataStream API

Docker Playgrounds

Flink Cluster Playground
Flink Interactive SQL Playground

The Example Walkthroughs will be structured in multiple steps, similar to how it is done in Apache Beam: Simple Wordcount, Windowed Wordcount. The Example Walkthroughs should also contain deployment to a minimal local Flink cluster.

Writing a program: Tell the story of how to evolve batch into streaming:

Start with simple word count (batch)
Windowed word count (batch, group by word and tie window)
Continuous windowed word count (different source), fire by watermark or periodic proc.-timers. (two ways of producing output)

The playgrounds are Docker compose environments that come with Flink and Kafka. The SQL CLI playground can be a stripped down version of the SQL training. The Flink Ops playground could come with one (or more) reasonably complex streaming applications that can be submitted for execution. We could give some basic REST commands (taking savepoint, killing job, requesting metrics, …) and let users play with the WebUI. I think this could be a nice starting experience.

Concepts Sections

One of the primary goals of this FLIP is to create a better, more prominent Concepts section. The purpose of this section is to introduce Flink users to the fundamental concepts of stream & batch processing with Apache Flink. Each subsection should cover both: stream and batch processing.

We propose the following structure for this section:

Stream Processing

A Unified System for Batch & Stream Processing
DAGs, Parallel DAGs, Partitioning (Random, Keyed, Broadcasted), Functions (wrapped in operators)

Stateful Stream Processing & Persistent State

Introduction

What is State?

State in Stream & Batch Processing
State Types (Keyed State, Broadcast State)
State Persistence

Asynchronous Barrier Snapshots
Recovery
State Backends (Working Data Structure vs Checkpoint Storage; not describing the existing backends though)

Delivery Guarantees & “Exactly Once Semantics”

End-to-end exactly once semantics

Time & The Latency-Completeness Trade-Off

Latency & Completeness

Latency vs. Completeness in Batch & Stream Processing

Notions of Time

Event Time

Different Sources of Time (Event, Ingestion, Storage)

Processing Time

Making Progress: Watermarks, Processing Time, Lateness

Propagation of watermarks

Windowing

Flink Architecture

Flink Applications and Flink Sessions
Anatomy of a Flink Cluster

Client, Flink Master (Dispatcher, Resourcemanager, Jobmanagers), Task Managers

Glossary

The Table API has the additional concept of “Dynamic Tables”. Since this “Concept” is only relevant to that API, we propose to introduce it in APIs -> Table API / SQL.

APIs

Current	Target
Project Build Setup Basic API Concepts Streaming (DataStream API) Batch (DataSet API) Table API & SQL Data Types & Serialization Managing Execution Libraries Best Practices API Migration Guides	DataStream API Table API / SQL DataSet API (deprecated)

Remarks:

Project Build Setup is removed from APIs and moved Getting Started -> Project Setup.
Libraries is moved top-level.
Basic API Concepts & Data Types & Serialization are mostly moved to DataStream API.
Best Practices can be removed completely

Libraries Section

This corresponds to the current Libraries section under Application Development with the following subsections:

Complex Event Processing (moved as is. https://issues.apache.org/jira/browse/FLINK-9773 for restructuring)
Graph Processing
Machine Learning (document new library after replacement)

Connectors Section

The Connector section will document existing connectors of Apache Flink as well as instructions and best practices to build your own connectors. The section should cover Table API/SQL as well as DataStream API, and for the time being DataSet API.

The content currently mainly resides in

Application Development -> Streaming -> Connectors
Application Development -> Batch -> Connectors
Application Development -> Table API /SQL -> Connect to External Systems

and only needs to be moved and restructured.

Suggested Structure

Overview

Stream Sources, Stream Sinks, Table Sources, Table Sinks (Relationship between Stream Sources/Sink and Table Sources Sinks)

Existing Connectors

Overview
FileSystem
Apache Kafka
Elastic Search
……

Existing Encodings

CSV
Avro
Json
Parquet
ORC
….

(Stream Sources) (To be added after Source Interface Rework)

How to write a Stream Source

Stream Sink

How to write a Stream Sink

Table Sources and Sinks

How to build Table Sources/Sinks on top of stream sources and sinks

Testing Section

The Testing section will document best practices for testing across api’s and the testing pyramid.

Unit Testing

Stateless Functions
Stateful Functions

How to use the StreamOperatorTestHarness

Table API UDF’s

Integration Testing

Running embedded Flink clusters

MiniClusterWithCientResource

Injecting Faults

Running the local cluster with savepoints to test recovery

Deployment Section

The Deployment section is restructured and follows a top-down approach starting from a reference architecture, which explains the different components and their roles in a production setup indicating “optional” components. The different components are then discussed in the following subsections including different resource managers, state backends & Flink Master failover. This subsection concludes by three reference setups for typical Fink installations.

The rest of this section is straightforward covering how to setup Metrics & Logging (not the metrics themselves), how to secure a Flink environment and a reference of all Flink configurations.

Apache Flink Reference Architecture

Overview

Components: Resource Manager, Flink, Checkpoint Storage, High Availability Services Provider (aka Zookeeper)

Resource Manager

K8s

Flink Application on K8s
Flink Cluster on K8s

YARN

Flink Application on Yarn
Flink Cluster on Yarn

Mesos

Flink Application on Mesos
Flink Cluster on Mesos

Standalone

(Flink Application on Standalone)
Flink Cluster on Standalone

...

Checkpointing

Configurations
Statebackends

Filesystem
RocksDB

Filesystems
Flink Master Failover (formerly High Availability)
Reference Architectures

A Reference Architecture on YARN

HDFS, Zookeeper, YARN, Keberos(?)

A Reference Architecture for Hosted Kubernetes

Hosted Kubernetes, Object Storage, Zookeeper/Zetcd?

A Reference Architecture for Standalone on VMs

Standalone, Standby Flink Masters, Standby Taskmanagers, NFS, Zookeeper, init.d for process supervising

Metrics & Logging

Metrics Reporter
Logging Configuration

Apache Flink Security

TLS

Flink REST API
Intra-Cluster Communication

Kerberos

Configuration Reference Guide

Operations Section

In contrast to the Deployment section the Operations sections focuses on “Day 2 operations”, namely monitoring, upgrades, troubleshooting & tuning Flink applications & clusters. This distinction is not always clear cut though.

Apache Flink Interfaces

Apache Flink WebUI
CLI
REST API
SQL Client
Scala Shell

Upgrading Apache Flink Applications & Flink Sessions

Savepoints
Application State Compatibility

Monitoring

Metrics Reference

Tuning

Memory
Latency
Slots, Threading Model
Slot Sharing
Task Chaining
Object Reuse

Debugging & Troubleshooting

Strategies
“Common Issues”

Issue A
Issue B
...

New or Changed Public Interfaces

N/A

Migration Plan and Compatibility

At this point we only have a very rough migration plan. The implementation of this FLIP will be realized in three phases.

1st Phase

The first phase we simply implement a few prerequisites for such a large documentation change.

Set up a redirect for 404s to be redirected to the Apache Flink homepage for existing links to Flink’s “stable” or “master” documentation.
Create style guide for documentation (see Goals)
Create 1st version of Concepts->Glossary

2nd Phase

The second phase consists already of some significant rework and restructuring, but consists of independent projects and does not require major changes the existing documentation.

Update Getting Started section & replace current Tutorials & Examples sections.
Update Concepts section and move “Concepts” content from Application Development to Concepts
Remove Flink Development & Internals from documentation

Flink Development can partially be dropped or migrated into the contribution guide on the Flink project homepage
Internals can be removed. Most content is outdated and can partially be reused for the updated Concepts section.

3rd Phase

Afterwards, we tackle the other two larger parts of the documentation rework. These depend on the new Concepts section, but realistically the Concept sections will evolve significantly in the course of the whole implementation phase of this FLIP.

Restructure Application Development and split up into APIs, Libraries and Connectors sections
Rework and expand Deployment & Operations and Metrics & Monitoring section

Rejected Alternatives

None

Related Tickets

All non-translation documentation tickets:

project = FLINK AND component = Documentation and status = Open and component != chinese-translation

Already finished translation efforts:

project = FLINK AND component = Documentation and resolutionDate !=null and component = chinese-translation

Tickets independent, but related to this FLIP/documentation, which came up in the discussion:

https://issues.apache.org/jira/browse/FLINK-9773 (Break Down CEP Documentation)
https://issues.apache.org/jira/browse/FLINK-12537 (Improve The Documentation Build Time)
Create ticket to describe possible multi-data center setups
Create ticket to improve reinterpretAsKeyedStream documentation

Existing tickets subsumed by this FLIP

https://issues.apache.org/jira/browse/FLINK-8444 (Rework Dependency Setup Docs)
https://issues.apache.org/jira/browse/FLINK-1294 (Performance Tuning Documentation)
https://issues.apache.org/jira/browse/FLINK-4463 (FLIP-3: Restructure Documentation)

Alignment with Chinese Translation FLIP (FLIP-35)

We will coordinate with the community on the concurrent implementation of FLIP-35. In FLIP-35 we will first focus on those parts of the documentation, which won’t fundamentally change as part of this FLIP. The least changes will probably be in the following areas:

Libraries -> Gelly
Libraries -> CEP
Application Development -> Streaming -> Connectors

The rest of Application Development -> Streaming will probably be restructured quite a bit, but the fundamental documentation of Flink’s API will of course stay around and can be translated as well.

In addition any documentation added in the course of the FLIP can of course be translated as well. The first parts will be Getting Started, Concepts and the Flink Documentation Guideline.

Open Questions

This is just an additional list of smaller & larger open questions.

Where to put “Best Practices”?

In the corresponding sections:

“Best Practices” for cluster sizing -> Deployment
“Best Practices” for choosing a state backend -> probably Deployment
“Best Practices” for using different state types (List, Value, Map) -> APIs
...

Add top-level “Releases” section for Migration Guides (API as well as Operations side)?
For the Application Development restructuring and the Deployment & Operations restructuring/rewrite, it might be useful to have many smaller PRs, which don’t go onto master directly, but on some feature branch.

Page tree

FLIP-42: Rework Flink Documentation