A good long term objective for the PMC is to drop RabbitMQ in
favor of pulsar (third parties could package their own components using
RabbitMQ if they wishes...)

This means:

Solve the bugs that were found during the Pulsar MailQueue review
Pulsar MailQueue need to allow listing blobs in order to be
deduplication friendly.
Provide an event bus based on Pulsar
Provide a task manager based on Pulsar
Package a distributed server backed by pulsar, deprecate then replace
the current one.
(optionally) support mail queue priorities

While contributions would of course be welcomed on this topic, we could
offer it as part of GSOC 2022, and we could co-mentor it with mentors of
the Pulsar community (see [3])

[3] https://lists.apache.org/thread/y9s7f6hmh51ky30l20yx0dlz458gw259

Would such a plan gain traction around here ?

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Benoit Tellier, mail: btellier (at) apache.org

Project Devs, mail: dev (at) james.apache.org

APISIX

Apache APISIX: Redesign/ADD Apache APISIX Plugin icons

Background: Apache APISIX Plugins has now gained huge popularity and also now people are coming with some tutorials of ‘how to use those plugins’, so to enrich our user experience we should add plugin icons.

Task: The intern should evaluate different possible icon designs, and add or update the existing designs in agreement with the mentor.

References:

Existing icons

Who is a Potential Mentor: Ayush Das, email: ayush24das@gmail.com
Github id - https://github.com/iamayushdas

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Bobur Umurzokov, mail: bumurzokov (at) apache.org

Project Devs, mail: dev (at) apisix.apache.org

Apache APISIX: Multi programing languages SDK support

Project title:

Multiple programming languages client SDK support with OpenAPI generator.

Apache APISIX is a dynamic, real-time, high-performance API gateway.

It provides rich traffic management features such as load balancing, dynamic upstream, canary release, circuit breaking, authentication, observability, and more.

Page: https://apisix.apache.org/

Github: https://github.com/apache/apisix

Background:

OpenAPI Generator allows the generation of API client libraries (SDK generation), server stubs, documentation, and configuration automatically given an OpenAPI Spec.

We can use it to provide Apache APISIX Admin and Control API SDKs in multiple programming languages. In the future, we may potentially integrate Java SDK into Spring framework and the starter of Spring boot or even make integration with ASP .Net

Task:

Generate a multilingual SDK through the definition files of the OpenAPI specification and use the OpenAPI Generator tool to generate client SDKs for Admin and Control APIs.

Difficulty: Normal
Project size: ~350 hours.

References:

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Bobur Umurzokov, mail: bumurzokov (at) apache.org

Project Devs, mail: dev (at) apisix.apache.org

Apache APISIX: Elasticsearch plugin

Apache APISIX is a dynamic, real-time, high-performance API gateway.

It provides rich traffic management features such as load balancing, dynamic upstream, canary release, circuit breaking, authentication, observability, and more.

Page: https://apisix.apache.org/

Github: https://github.com/apache/apisix

Background: Elasticsearch is a widespread search engine based on Apache Lucene. It allows users to index, store, and search for data via a REST API. Data going through APISIX are good candidates to be transferred to Elasticsearch for later analysis.

Task: The intern should evaluate different possible designs, analyze their pros and cons, and implement at least one in agreement with the mentor.

In particular, the intern should investigate ES requirements for writing data (amount of data, frequency, etc.) prior to any development.

Difficulty: Normal
Project size: ~175 hours.

References:

Potential Mentor: ZhengSong Tu, https://github.com/tzssangglass

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Bobur Umurzokov, mail: bumurzokov (at) apache.org

Project Devs, mail: dev (at) apisix.apache.org

Apache APISIX: Support local file and data center configuration conversion, import and export

Apache APISIX is a dynamic, real-time, high-performance API gateway.

It provides rich traffic management features such as load balancing, dynamic upstream, canary release, circuit breaking, authentication, observability, and more.

Page: https://apisix.apache.org/

Github: https://github.com/apache/apisix

Project title:

Datacenter and local file configuration conversion, export and import are supported via Apache APISIX CLI.

Background:

Apache APISIX supports running in standalone mode. At this point, Apache APISIX will rely on the local configuration file `conf/apisix.yaml` for routing and policy settings.

Apache APISIX CLI supports the conversion, import and export of data center and local file configuration data, making Apache APISIX easier to switch and apply between different environments and scenarios.

Task:

Add two commands `bin/apisix conf_export` and `bin/apisix conf_import` to Apache APISIX CLI, and complete the conversion, import and export of remote data center and local file configuration data through the above commands.

Difficulty: Normal
Project size: ~350 hours.

References:

https://github.com/apache/apisix/blob/master/docs/en/latest/stand-alone.md

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

JinChao Shuai, mail: shuaijinchao (at) apache.org

Project Devs, mail: dev (at) apisix.apache.org

Apache APISIX: Introduce a Storage abstraction

Background:

Some plugins require storing data. For example, limit-count needs to keep track of originators of requests to limit how many requests the same client can send.

The plugin provides several data stores: local, Redis single node, and Redis cluster.

Now, other plugins that need to store data would also need to provide such configuration. Moreover, what if users want to store the data in MongoDB, Hazelcast, or in a plain SQL database?

Tasks:

Introduce a Storage abstraction, on the same level as Upstream
Create Storage concretions for local, Redis single node, and Redis cluster
Migrate the limit-count plugin to use this abstraction
If time allows, create a new plugin that uses this abstraction
It time allows, create a new Storage implementation

Who is a Potential Mentor: Bozhong Yu, email: imbozhong@gmail.com and https://github.com/zaunist,

Difficulty: Normal
Project size: ~350 hours.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Bobur Umurzokov, mail: bumurzokov (at) apache.org

Project Devs, mail: dev (at) apisix.apache.org

Apache APISIX: Java Plugin Runner Improvement

Background:

At the moment, the Java runner plugin requires you to use an existing template project and change it according to one’s needs.

Task:

Improve developer experience on the existing Java plugin runner so that we can attract and increase the number of users from the Java community.

Limitations:

The architecture doesn’t manage multiple plugins. All need to be set in the same project
The standard Java unit of deployment is the JAR.
The plugin doesn’t allow for other widespread JVM-based languages (e.g., Scala, Kotlin, Clojure, Groovy). Though it would be technically feasible, we would need to change the template’s language

Requirements:

The new plugin runner:

MUST use the JAR as the unit of deployment
MUST not require the usage of a project template
MAY require the plugin to follow a certain class hierarchy (i.e., extends JavaPlugin)
MAY use a more specific format to enforce a structure
MUST allow multiple plugins to be deployed

MUST use isolated classloader for each plugin
MUST allow any JVM-compatible bytecode to run, whatever the language it was generated from
MAY allow hot reloading of Java plugins
MAY require a single JAR per plugin (to ease the classpath management of shared libraries)
MUST define a minimum JVM version

Difficulty: Normal
Project size: ~350 hours.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Bobur Umurzokov, mail: bumurzokov (at) apache.org

Project Devs, mail: dev (at) apisix.apache.org

Apache APISIX: Refactoring Dashboard plugin orchestration

Apache APISIX is a dynamic, real-time, high-performance API gateway.

It provides rich traffic management features such as load balancing, dynamic upstream, canary release, circuit breaking, authentication, observability, and more.

Page: https://apisix.apache.org/

Github: https://github.com/apache/apisix

Project title: Refactoring Dashboard plugin orchestration

Background:

Apache APISIX Dashboard currently supports plugin orchestration, which supports designing the execution flow of plugins through a visual flow editor and finally generating Lua code that can be executed by Apache APISIX.

This feature currently has poor usability, inability to automatically replenish default configuration fields, poor support for multi-stage plugins, poor usability of generated code, etc.

Task:

Refactor the frontend and backend modules to improve the experience of using the visual editor and the quality of code generation. Code generators written in Lua need to be ported to other languages to achieve better code readability and maintainability and reduce black boxes.

Skills:

Golang
JavaScript / TypeScript
Lua

Difficulty: Hard
Project size: ~350 hours.
Potential Mentor: Zeping Bai, bzp2010@apache.org, https://github.com/bzp2010

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Zeping Bai, mail: bzp2010 (at) apache.org

Project Devs, mail: dev (at) apisix.apache.org

SkyWalking

[SkyWalking] Log outlier detection

Currently Apache SkyWalking can collect logs from various sources like user agents and Envoy access logs, it also provides a log analysis language to analyze the logs and produce some metrics, with those metrics, users can configure rules to trigger alerts and react to those abnormal/exceptional logs.

But in reality, production environment exceptional logs are not known in advance and users can't enumerate all possible exceptional logs.

This task aims to add an algorithm that can identify outlier log(s) from the massive logs, and draw the users attention to see whether there is error in the system.

The algorithm should be able to learn from bot the history logs and streaming logs, and adjust itself to increase the accuracy.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Zhenxu Ke, mail: kezhenxu94 (at) apache.org

Project Devs, mail: dev (at) skywalking.apache.org

Apache SkyWalking: Add the webapp of banyandb

BanyanDB, as an observability database, aims to ingest, analyze and store Metrics, Tracing, and Logging data. It's designed to handle observability data generated by Apache SkyWalking.

We need a web-based application to

Query the data from the banyandb's data nodes
Monitor the performance of the backend
Render the topology of server nodes

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Hongtao Gao, mail: hanahmily (at) apache.org

Project Devs, mail: dev (at) skywalking.apache.org

ShardingSphere

Apache ShardingSphere: Develop an external tool to convert YAML configuration into DistSQL scripts

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.
Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere

Background

Since version 5.0.0, ShrdingSphere provides its own management language: DistSQL, which greatly facilitates users to manage distributed databases.
There are now many users who want to convert from legacy YAML configuration to DistSQL, and we want to design a tool to help them. (For ShardingSphere-Proxy only)

More details:
https://shardingsphere.apache.org/document/current/en/concepts/distsql/

Task

Design and implement a command line tool that allows the user to enter a path to a YAML configuration file and output a DistSQL script file.
This means that when a user uses the generated DistSQL script, it is possible to create a configuration result equivalent to a YAML file.

We have provided a DistSQL for exporting schema configuration, which is related to this issue, to help you understand this issue.

The tool should convert both datasources and rule configuration in YAML to corresponding DistSQL RDL

The tool needs to run independently, but it can depend on the jar package of ShardingSphere.
When the tool starts, it is best to prompt the currently applicable ShardingSphere version.
It is best to use the Java language, so that the jar package provided by ShardingSphere can be reused

Notice:

There is currently no suitable module in the ShardingSphere repository for standalone tools, so a new module needs to be added.

Relevant Skills

1. Master JAVA language
2. Understand the schema configurations of ShardingSphere-Proxy
3. Understand DistSQL RDL

Mentor

Longtao Jiang, Committer of Apache ShardingSphere, jianglongtao@apache.org
Chengxiang Lan, Committer of Apache ShardingSphere, lanchengxiang@apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Longtao Jiang, mail: jianglongtao (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere: Solve unsupported Postgres sql about statements that start with 'c' for ShardingSphere Parser

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere

Background

ShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer`, `openGauss` and `Oracle`, which means we have to understand different database dialect SQLs.

More details:
https://shardingsphere.apache.org/document/current/en/reference/sharding/parse/

Task

This issue is to solve the unsupported postgres sql about alter in this file . * CALL

CHECKPOINT

CLOSE

CLUSTER

COMMENT

COPY

CREATE ACCESS METHOD

CREATE AGGREGATE

CREATE CAST

CREATE COLLATION

CREATE EVENT TRIGGER

CREATE FOREIGN DATA WRAPPER

CREATE FOREIGN TABLE

CREATE GROUP

CREATE MATERIALIZED VIEW

CREATE OPERATOR

CREATE POLICY

CREATE PUBLICATION

You can learn more here. *
You may need to try to get why it's not supported.(antlr4 grammar? or not implement visit method) You can use antlr4 plugins to help you to analyze. You may need to visit an official doc to check the grammar.

After you fix it, remember to add a new corresponding SQL case in SQL Cases and expected parsed result in Expected Statment XML.

Run SQLParserParameterizedTest and UnsupportedSQLParserParameterizedTest to make sure no exceptions.

Notice, these issues can be a good example.
support alter foreign table for pg/og
support alter materialized view for pg/og.

Relevant Skills

1. Master JAVA language
2. Have a basic understanding of Antlr g4 file
3. Be familiar with Postgres SQLs

Targets files

1. Postgres SQLs g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-postgresql/src/main/antlr4/org/apache/shardingsphere/sql/parser/autogen/PostgreSQLStatement.g4

Mentor

Zhengqiang Duan, Committer of Apache ShardingSphere, duanzhengqiang@apache.org
Haoran Meng, PMC of Apache ShardingSphere, menghaoran@apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Zhengqiang Duan, mail: duanzhengqiang (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere: Solve unsupported Postgres sql about alter statement for ShardingSphere Parser

Apache ShardingSphere
Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.
Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere

Background

ShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer`, `openGauss` and `Oracle`, which means we have to understand different database dialect SQLs.
More details:
https://shardingsphere.apache.org/document/current/en/reference/sharding/parse/

Task

This issue is to solve the unsupported postgres sql about alter in this file . * ALTER OPERATOR

ALTER POLICY

ALTER PUBLICATION

ALTER ROUTINE

ALTER RULE

ALTER SCHEMA

ALTER SEQUENCE

ALTER SERVER

ALTER STATISTICS

ALTER SUBSCRIPTION

ALTER TABLE

ALTER TEXT SEARCH

ALTER TRIGGER

ALTER TYPE

ALTER VIEW

You can learn more here. *
You may need to try to get why it's not supported.(antlr4 grammar? or not implement visit method) You can use antlr4 plugins to help you to analyze. You may need to visit an official doc to check the grammar.

After you fix it, remember to add a new corresponding SQL case in SQL Cases and the expected parsed result in Expected Statment XML.

Run SQLParserParameterizedTest and UnsupportedSQLParserParameterizedTest to make sure no exceptions.

Notice, these issues can be a good example.
support alter foreign table for pg/og
support alter materialized view for pg/og.

Relevant Skills

1. Master JAVA language
2. Have a basic understanding of Antlr g4 file
3. Be familiar with Postgres SQLs

Targets files

1. Postgres SQLs g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-postgresql/src/main/antlr4/org/apache/shardingsphere/sql/parser/autogen/PostgreSQLStatement.g4

Mentor

Trista Pan, PMC of Apache ShardingSphere, https://tristazero.github.io

Zhengqiang Duan, Committer of ApacheShardingSphere, https://github.com/strongduanmu

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Juan Pan, mail: panjuan (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

ShenYu

Apache ShenYu: add logging-elasticsearch plugin

Apache ShenYu (incubating)

A High-performance,multi-protocol,extensible,responsive API Gateway. Compatible with a variety of mainstream framework systems, support hot plug, users can customize the development, meet the current situation and future needs of users in a variety of scenarios, experienced the temper of large-scale scenes

Website: https://shenyu.apache.org
GitHub: https://github.com/apache/incubator-shenyu
Linked GitHub Issue: https://github.com/apache/incubator-shenyu/issues/2896

Description

add logging-elasticsearch plugin, it Use elasticsearch to store shenyu's logs.
Take the shenyu gateway log information, write it to elasticSearch and display it.
Can add module like this ：

shenyu-plugin
------ shenyu-plugin-logging-elasticsearch

Task

Add shenyu-plugin-logging-elasticsearch module and impl write it to elasticSearch
Complete unit test for this module
Complete the integration for this module
Complete doc for this module in shenyu website

Recommended Skills

Familiar with Java and reactor Java
Know the usage of shenyu plugin ecology
Know the usage of elasticSearch java client
Have some knowledge about Docker

Mentor

XiaoYu, PPMC of Apache ShenYu, https://github.com/yu199195, [xiaoyu@apache.org](xiaoyu@apache.org)

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Xiao Yu, mail: xiaoyu (at) apache.org

Project Devs, mail: dev (at) shenyu.apache.org

Apache ShenYu: Improve integration test and deployment methods

Apache ShenYu (incubating)

A High-performance,multi-protocol,extensible,responsive API Gateway. Compatible with a variety of mainstream framework systems, support hot plug, users can customize the development, meet the current situation and future needs of users in a variety of scenarios, experienced the temper of large-scale scenes

Website: https://shenyu.apache.org

GitHub: https://github.com/apache/incubator-shenyu

Linked GitHub Issue: https://github.com/apache/incubator-shenyu/issues/2890

Background

ShenYu is still vacant with helm deployment, so we need to write charts for it, and then complete the integration test.
Shenyu already has a relatively complete integration testing framework, but some plug-ins have not been tested, and some tests are not perfect.

Task

Write helm chart for Apache ShenYu
Complete the integration test of deploying Apache ShenYu with helm in Kubernetes
Documentation for helm deployment
Complete the integration test of the Oauth2 plugin
Improve the integration test of other existing plugin

Recommended Skills

Familiar with Java

Know the usage of spring-framework

Have some knowledge about Kubernetes and Docker

Mentor

Kunshuai Zhu, Committer of Apache ShenYu, https://github.com/JooKS-me, jooks@apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Kunshuai Zhu, mail: jooks (at) apache.org

Project Devs, mail: dev (at) shenyu.apache.org

Apache ShenYu: add logging-kafka plugin

Apache ShenYu (incubating)

A High-performance,multi-protocol,extensible,responsive API Gateway. Compatible with a variety of mainstream framework systems, support hot plug, users can customize the development, meet the current situation and future needs of users in a variety of scenarios, experienced the temper of large-scale scenes

Website: https://shenyu.apache.org
GitHub: https://github.com/apache/incubator-shenyu
Linked GitHub Issue: https://github.com/apache/incubator-shenyu/issues/2917

Description

Add logging-kafka plugin, it Use Kafka to store shenyu's logs.
Take the shenyu gateway log information, write it to Kafka and display it.
Can add module like this ：
shenyu-plugin
shenyu-plugin-logging-kafka

Task

Add shenyu-plugin-logging-kafka module and impl write it to Kafka
Complete unit test for this module
Complete the integration for this module
Complete doc for this module in shenyu website

Recommended Skills

Familiar with Java
Know the usage of shenyu plugin ecology
Know the usage of Kafka java client
Have some knowledge about Docker

Mentor

Zhang Yonglun, PPMC of Apache ShenYu, https://github.com/tuohai666, [zhangyonglun@apache.org](zhangyonglun@apache.org)

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Yonglun Zhang, mail: zhangyonglun (at) apache.org

Project Devs, mail: dev (at) shenyu.apache.org

TrafficControl

GSOC: Varnish Cache support in Apache Traffic Control

Background
Apache Traffic Control is a Content Delivery Network (CDN) control plane for large scale content distribution.

Traffic Control currently requires Apache Traffic Server as the underlying cache. Help us expand the scope by integrating with the very popular Varnish Cache.

There are multiple aspects to this project:

Configuration Generation: Write software to build Varnish configuration files (VCL). This code will be implemented in our Traffic Ops and cache client side utilities, both written in Go.

Health Monitoring: Implement monitoring of the Varnish cache health and performance. This code will run both in the Traffic Monitor component and within Varnish. Traffic Monitor is written in Go and Varnish is written in C.

Testing: Adding automated tests for new code

Skills:

Proficiency in Go is required
A basic knowledge of HTTP and caching is preferred, but not required for this project.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Eric Friedrich, mail: friede (at) apache.org

Project Devs, mail: dev (at) trafficcontrol.apache.org

RocketMQ

GSOC : Support connect to Doris in Apache RocketMQ Streams

Apache RocketMQ™ is a unified messaging engine, lightweight data processing platform,

Apache RocketMQ

Apache RocketMQ Streams is a Lightweight Streaming Project for RocketMQ , which can be deployed separately or in cluster mode.
Various types of data input and output: source supports RocketMQ while sink supports databases and RocketMQ, etc.

Apache RocketMQ Streams

Apache Doris is an MPP-based interactive SQL data warehousing for reporting and analysis. Its original name was Palo, developed in Baidu. After donated to Apache Software Foundation, it was renamed Doris.

Apache Doris

Doris provides high concurrent low latency point query performance, as well as high throughput queries of ad-hoc analysis.
Doris provides batch data loading and real-time mini-batch data loading.
Doris provides high availability, reliability, fault tolerance, and scalability.

The main advantages of Doris are the simplicity (of developing, deploying and using) and meeting many data serving requirements in a single system. For details, refer to Overview.

The Apache Doris Sink in RocketMQ allows moving data from RocketMQ to Doris. It writes data from topics in RocketMQ to tables in Doris.

So, in this project, you need to implement a sink based on RocketMQ Streams API, and will executed on RocketMQ Streams runtime.

You should learn before applying for this topic

Mentor

tigerlee@apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Li Wei, mail: tigerlee (at) apache.org

Project Devs, mail: dev (at) rocketmq.apache.org

GSOC : Support connect to Clickhouse in Apache RocketMQ Connect

Apache RocketMQ™ is a unified messaging engine, lightweight data processing platform,

Apache RocketMQ

Apache RocketMQ Streams is a Lightweight Streaming Project for RocketMQ , which can be deployed separately or in cluster mode.
Various types of data input and output: source supports RocketMQ while sink supports databases and RocketMQ, etc.

Apache RocketMQ Streams

ClickHouse® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP). built by the creators of the fastest OLAP database on Earth

ClickHouse

True Column-Oriented Database Management System
Data Compression¶
Disk Storage of Data
Parallel Processing on Multiple Cores
Distributed Processing on Multiple Servers
SQL Support
Vector Computation Engine
Real-time Data Updates
Primary Index
Secondary Indexes
Suitable for Online Queries
Support for Approximated Calculations
Adaptive Join Algorithm
Data Replication and Data Integrity Support
Role-Based Access Control
Features that Can Be Considered Disadvantages

The Clickhouse Sink in RocketMQ allows moving data from RocketMQ to Clickhouse. It writes data from topics in RocketMQ to tables in Clickhouse.

So, in this project, you need to implement a sink based on RocketMQ Streams API, and will executed on RocketMQ Streams runtime.

You should learn before applying for this topic

Mentor

tigerlee@apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Li Wei, mail: tigerlee (at) apache.org

Project Devs, mail: dev (at) rocketmq.apache.org

Community Development

Apache EventMesh: Support Knative as Eventing Infra

Apache EventMesh (incubating)

EventMesh is a dynamic event-driven application runtime used to decouple the application and backend middleware layer, which supports a wide range of use cases that encompass complex multi-cloud, widely distributed topologies using diverse technology stacks.

Website: https://eventmesh.apache.org

GitHub: https://github.com/apache/incubator-eventmesh

Linked GitHub Issue: https://github.com/apache/incubator-eventmesh/issues/790

Background

EventMesh have supported the CloudEvent protocol and need to use this integrating with Knative

Task

Get to know the CloudEvents spec
Run the Knative and familiar with Knative communication protocol
Implement the Knative-Connector module and delivering the events to Knative via EventMesh

Recommended Skills

Familiar with Java

Know the principal of CloudEvents and Knative

Have some knowledge about Kubernetes and Docker

Mentor

Easonc Chen, PPMC of Apache EventMesh, https://github.com/qqeasonchen, chenguangsheng@apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Xue Weiming, mail: mikexue (at) apache.org

Project Devs, mail:

Apache IoTDB: integration with gRPC

Background:

Apache IoTDB uses Thrift as its RPC layer. However, there are some voices in the community: do we need to support gPRC?

We noticed:

thrift has to apply memory for each RPC call (get data from the network into a byte array, and then convert the bytes to objects), and it is hard to control the whole memory cost for large RPC.
thrift connection may be broken when there are too many concurrent connections.
thrift does not support stream mode

So, we'd like to know whether gRPC is better.

Tasks:

implement IoTDB's RPC layer using gRPC.
- including the sync/async mode
- sub-tasks: the C++, c#, python API wrappers are also desired.
have a performance test
- throughput, memory cost and jitter, etc..
write a report to compare them

References:

iotdb's current thrift RPC specification:

https://github.com/apache/iotdb/tree/master/thrift
there are some on-going thrift apis: thrift-datanode, thrift-confignode, thrift-cluster, thrift-sync

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Xiangdong Huang, mail: hxd (at) apache.org

Project Devs, mail:

Apache EventMesh: EventMesh supports dashboard

Apache EventMesh (incubating)

EventMesh is a dynamic event-driven application runtime used to decouple the application and backend middleware layer, which supports a wide range of use cases that encompass complex multi-cloud, widely distributed topologies using diverse technology stacks.

Website: https://eventmesh.apache.org

GitHub: https://github.com/apache/incubator-eventmesh

Linked GitHub Issue: https://github.com/apache/incubator-eventmesh/issues/700

Background

Currently, there is no console page for EventMesh. We hope the community can contribute a visual control page based on EventMesh.

Task

Get familiar with the EventMesh
Support and implement more interface under the admin controller module

Recommended Skills

Familiar with Java\HTML\CSS maybe need vue.js framework or others

Know the restful API specifications

Have the knowledge about the basics of HTTP communication

Mentor

Mike Xue, PPMC of Apache EventMesh, https://github.com/xwm1992, mikexue@apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Xue Weiming, mail: mikexue (at) apache.org

Project Devs, mail:

DolphinScheduler

GSOC: Support etcd as registry

Apache DolphinScheduler

Apache DolphinScheduler is a distributed and extensible workflow scheduler platform with powerful DAG visual interfaces, dedicated to solving complex job dependencies in the data pipeline and providing various types of jobs available out of box.

Website: https://dolphinscheduler.apache.org/en-us/index.html

GitHub: https://github.com/apache/dolphinscheduler

Linked GitHub Issue: https://github.com/apache/dolphinscheduler/issues/8975

Background

Right now, we use zookeeper as registry, and we also use zookeeper to store some metadata of master and worker.

We have already implemented the registry plug-in architecture, it's needed to support Etcd as a new registry plugin choose. This can help user who only familiar with Etcd to use DolphinScheduler.

Task

This task is aim to support etcd as registry.

Recommended Skills

Familiar with Java{}
Know how to use Etcd

Mentors

Wenjun Ruan, wenjun@apache.org

ShunFeng Cai, caishunfeng@apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Wenjun Ruan, mail: wenjun (at) apache.org

Project Devs, mail: dev (at) dolphinscheduler.apache.org

GSoC: Python API CLI enhancement

About pydolphinscheduler

PyDolphinScheduler is Python API for Apache DolphinScheduler, which allows you to define your workflow by Python code, aka workflow-as-codes. You could see more detail about PyDolphinScheduler in its document[4]. And all the source code hold as the submodule in DolphinScheduler main codebase[5].

The Goal

Make pydolphinscheduler's CLI more powerful, make it can operate the model of DolphinScheduler, run pydolphinscheduler's code, visualize its DAG graph in the terminal.

Detail

Up to now, Apache DolphinScheduler Python API has CLI only with limited command supported and our community wishes it to become a more powerful tool and support as much command as possible(unless command has security issue).

It only supports `version` and `config` for now, which you could see more detail in [1]

Basically, we think the following command is helpful for CLI and you could add another command if it should be added(but may sure after discussing in the community):

`run <DAG name> [--example]`: Run local workflow DAG file or examples build-in
`users`: User's operation, CURD
`projects`: Project's operation, CURD, grant to other users
`tenants`: Tenant's operation, CURD
`workflow`: Workflow's operation, CURD, name change, should also change the local Python file name
`visualize`: Show task graph in the terminal.
etc...

Besides the functional addition, we should also consider the output part of CLI which makes our output more clear and cool. We may consider using (we should also find other interesting packages to do it):

rich: For highlight, our output, or using some existing rich plugin like `click-rich`
tabulate: For the tables visualization in terminal

What Can You Learn

We wish everyone joining GSoC could learn some things from the project. When you finish this project, you could learn:

How to write production-level Python codes and docs, you could improve your Python syntax, how to write tests with `pytest` and `tox`, how to write a document with `sphnix` and it related plugin, how to format your Python code and the linter inside
Adding knowledge about task scheduling system, what is it and what it focuses, how it could be run

If You Interested in It

If you want to take this ticket, you should

(Must) Python skill, especially packages click, pytest and etc.
Have a little knowledge of task scheduling systems.
(Optional) Basic Java knowledge is better because Apache DolphinScheduler core is written with Java and you may add some functional code to it.

Mentors

Calvin Kirs: Committer of Apache {DolphinScheduler, SeaTunnel, Wayang}, DolphinScheduler PMC and SeaTunnel PPMC
Jiajie Zhong: Committer of Apache {Airflow, DolphinScheduler, SeaTunnel}, SeaTunnel PPMC

[1]: https://dolphinscheduler.apache.org/python/cli.html

[2]: https://github.com/Textualize/rich

[3]: https://github.com/astanin/python-tabulate

[4]: https://dolphinscheduler.apache.org/python/index.html

[5]: https://github.com/apache/dolphinscheduler/tree/dev/dolphinscheduler-python/pydolphinscheduler

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Jiajie Zhong, mail: zhongjiajie (at) apache.org

Project Devs, mail: dev (at) dolphinscheduler.apache.org

Commons Statistics

GSoC 2022

Placeholder for tasks that could be undertaken in this year's GSoC.

Ideas:

Design an updated summary statistics API for use with Java 8 streams based on the summary statistic implementations in the Commons Math stat.descriptive package including moments, rank and summary sub-packages.

Difficulty: Minor

Project size: ~350 hour (large)

Potential mentors:

Alex Herbert, mail: aherbert (at) apache.org

Project Devs, mail:

Commons Numbers

GSoC 2022

Placeholder for tasks that could be undertaken in this year's GSoC.

Ideas:

Update the support for complex numbers in the complex package to allow operations to be performed on lists of complex numbers. This requires abstracting the representation of multiple complex numbers into a list structure storing real and imaginary parts that can be efficiently iterated to apply all the operations supported by the Complex class. Operations should modify the numbers in place allowing efficient, zero allocation complex number math to be performed on large datasets.

Difficulty: Minor

Project size: ~350 hour (large)

Potential mentors:

Alex Herbert, mail: aherbert (at) apache.org

Project Devs, mail: dev (at) commons.apache.org

Commons Math

GSoC 2022

Placeholder for tasks that could be undertaken in this year's GSoC.

Ideas (extracted from the "dev" ML):

Redesign and modularize the "ml" package
-> main goal: enable multi-thread usage.
Abstract the linear algebra utilities
-> main goal: allow switching to alternative implementations.
Redesign and modularize the "random" package
-> main goal: general support of low-discrepancy sequences.
Refactor and modularize the "special" package
-> main goals: ensure accuracy and performance and better API,
add other functions.
Upgrade the test suite to Junit 5
-> additional goal: collect a list of "odd" expectations.

Other suggestions welcome, as well as

delineating additional and/or intermediate goals,
signalling potential pitfalls and/or alternative approaches to the intended goal(s).

Difficulty: Minor

Project size: ~350 hour (large)

Potential mentors:

Gilles Sadowski, mail: erans (at) apache.org

Project Devs, mail: dev (at) commons.apache.org

Commons Geometry

GSoC 2022

Placeholder for tasks that could be undertaken in this year's GSoC.

Ideas:

Examine and potentially redesign the API and algorithms in the commons-geometry-enclosing module. The goal here is to make consistent use of the newer geometry API and ensure that the algorithms are sound.
Examine and potentially redesign the API and algorithms in the commons-geometry-hull module. The goal here is to make consistent use of the newer geometry API and ensure that the algorithms are sound (see GEOMETRY-144).
Design and implement a parser/writer for the PLY file format in the commons-geometry-io-euclidean module.
Design an API for advanced 3D mesh data structures (e.g. halfedge meshes) and operations (e.g. surface subdivision, smoothing, etc). This may end up being another module, e.g. commons-geometry-mesh.
Create a series of user guides and/or tutorials demonstrating best-practice use of the library.
other ideas ... ?

Difficulty: Minor

Project size: ~350 hour (large)

Potential mentors:

Matt Juntunen, mail: mattjuntunen (at) apache.org

Project Devs, mail:

CloudStack

GSoC Idea 2022 - Bypass Secondary Storage (Direct Download) on VMware &/or XenServer

Background

The default way of registering / downloading templates in CloudStack involves caching them on the secondary store and then during VM deployment, the template is copied to the primary store. However, from ACS version 4.11.1 onward, a feature was added for KVM hypervisor to enable direct download to primary store. This massively reduces the usage of secondary store and also quickens the entire VM deployment process, as there is no need to copy the template from secondary to primary store.

Requirement

We would like to propose an idea to extend this feature of direct download of templates onto primary store for other hypervisors - namely, VMware and XenServer. This would gravely benefit end-users to efficiently use the secondary storage and save overall time of VM deployment on the respective hypervisors

Relevant Skills:

Java
MySQL
Vue.js

Difficulty:

175 hours (Only VMware)
350 hours (VMware & XenServer)

Potential Mentors:

Abhishek Kumar (abhishek.mrt22@gmail.com)
Pearl Dsilva (pearl1594@gmail.com)

References

https://www.shapeblue.com/how-to-deploy-templates-without-using-secondary-storage-on-kvm/
https://www.shapeblue.com/cloudstack-feature-first-look-direct-download-agnostic-of-the-storage-provider/
https://cwiki.apache.org/confluence/display/CLOUDSTACK/Bypass+Secondary+Storage+%28Direct+Download%29+on+KVM
https://www.youtube.com/watch?v=SwepUTfGiKc

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Pearl Dsilva, mail: pearl11594 (at) apache.org

Project Devs, mail: dev (at) cloudstack.apache.org

GSoC 2022: CloudStack Terraform Provider - Add datasources for the existing resources

Background

Terraform is an Infrastructure as Code (IaC) software that provides a consistent CLI workflow to manage resources in
many cloud services. Cloudstack Terraform provider integrates with Cloudstack to aid in managing and automating the deployment of resources in cloudstack. We have recently made the first release of CloudStack Terraform Provider v0.4.0 https://github.com/apache/cloudstack-terraform-provider

Requirement

Terraform defines a datasource as, "something that allows Terraform to use the information defined outside of Terraform, defined by another separate Terraform configuration, or modified by functions". Most resources offer data sources alongside their set of resource types. However, currently Cloudstack Terraform Provider only has one datasource for template. Hence, we propose an idea for students to get involved in enhancing the features of the Cloudstack Terraform Provider by adding support for datasources.

If the students are enjoying the project, the scope can be extended to support adding further resources in Terraform such that the CloudStack Terraform Provider may become a de-facto tool for automating CloudStack deployments.

The current set of resources Cloudstack terraform provider supports are:
https://registry.terraform.io/providers/cloudstack/cloudstack/latest/docs , where as its counterpart Ansible boasts of a more evolved list of resources ~~https://docs.ansible.com/ansible/latest/collections/ngine_io/cloudstack/index.html~~ mainly zones, clusters, accounts, domains, etc. It would be great if students interested want to go a step ahead and help add support for these too.

Relevant Skills:

GoLang (basic)

Difficulty:

Medium

Potential Mentors:

Harikrishna Patnala
Pearl Dsilva

Example and references

https://registry.terraform.io/providers/cloudstack/cloudstack/latest/docs : check Resources and Data Sources section under CloudStack Provider
Depends on CloudStack Go SDK - https://github.com/apache/cloudstack-go

Github issue: https://github.com/apache/cloudstack/issues/6016

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Harikrishna Patnala, mail: harikrishna (at) apache.org

Project Devs, mail: dev (at) cloudstack.apache.org

GSoC 2022 Idea: CloudStack Edge Zones

Background

Over recent years, Edge computing has been gaining popularity as it defines a model that brings compute and storage closer to
where they are consumed by the end-user. By being closer to the end-user a better experience can be provided with reduction on overall latency, lower bandwidth requirements, lower TCO, more flexible hardware/software model, while also ensuring security and reliability. To align ACS with this evolving cloud computing model we would like to propose an idea of supporting Edge Zones in CloudStack, which
can be also looked upon as a lightweight zone, with minimal resources.

Requirement

Today, when a Zone is setup in CloudStack, it by default comes up with a secondary storage VM(SSVM) and a console proxy VM(CPVM). As part of this project, we would need to define a new zone type to decide the change in workflow required to ensure that a CPVM & SSVM isn't spawned up by default. Basic characteristics of an Edge zone include:

no need for Secondary Storage
no Secondary Storage VM
no Console Proxy VM
Local storage only as typically an edge device comprises of a single compute node (host)
And supports L2 and Isolated networks.

A high-level view of an edge zone would look something like:

Relevant Skills:

Java
MySQL
Vue.js (Basic)

Difficulty:

Medium

Project Duration:

175 hours

Potential Mentors:

Alex Mattioli
Nicolas Vazquez
Pearl Dsilva

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Pearl Dsilva, mail: pearl11594 (at) apache.org

Project Devs, mail: dev (at) cloudstack.apache.org

View Logs in the UI

As of now, when an admin encounters an issue or error in CloudStack, the maximum information they can immediately get is the API failure response which provides a reason for the failure. At times this might not be sufficinet to diagnose the error and would require the admin to investiage the CloudStack logs. This would require the admin or the sysadmin to log into the VM running CloudStack and either view or export the logs, and then dive into identifying the issue. This idea aims to eiliminate that step.

The goal of this is to provide admins the ability to view the logs directly in the UI. This would make diagnosing failures and RCAs much quicker.

Provide the ability display the logs in the UI

Add an API / WebSocket (and UI) support to :

View the logs
Live follow the logs (similar to 'tail -f')

Duration

175 hours

Potential Mentors

David Jumani

References

https://github.com/apache/cloudstack/issues/6011

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

David Jumani, mail: davidjumani (at) apache.org

Project Devs, mail: dev (at) cloudstack.apache.org

Add the ability to Safely Shutdown / restart CloudStack

Shutting down / Restarting Cloudstack is a necessary step in upgrades, system maintenance, etc. As of now, there is no way to safely shutdown or restart CloudStack. It is directly terminated via systemd. Since this is the case, any asyncronous job or background task is abrubptly terminated and can fail. As of now, CloudStack maintains a list of asynchronous jobs wihtin it's database along with their status.

This idea aims to provide a way to safely shutdown CloudStack. It involves two parts :

Prevent new asynchronous jobs from being added to CloudStack when a safe shutdown is triggered
Check the status of the async jobs and Shut down CloudStack when all the jobs have been completed

Provide the ability to safely shutdown CloudStack

Add API (and/or UI) support to :

Trigger a safe shutdown
(Optional) Support restarts
(Optional) Support a forced shutdown when CloudStack will quit even if there are async jobs running

Duration

Some Experience : 175 hours
Newbie : 350 hours

Potential Mentors

David Jumani

References

https://github.com/apache/cloudstack/issues/6021

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

David Jumani, mail: davidjumani (at) apache.org

Project Devs, mail: dev (at) cloudstack.apache.org

CloudStack Terraform Provider - Add support for Kubernetes Clusters

As of now the CloudStack Terraform Provider does not support managing CKS clusters

This proposal aims to add support to the CloudStack Terraform Provider to manage CKS clusters

This would involve supporting the following actions on CKS clusters :

Create
Stop / Start
Scale
Upgrade
Delete

[Optional]
Support the following actions on the binary ISOs :

Register
Enable / Disable
Delete

Duration

175 hours

Potential Mentors

Harikrishna Patnala
David Jumani

References

https://github.com/apache/cloudstack/issues/6040

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

David Jumani, mail: davidjumani (at) apache.org

Project Devs, mail: dev (at) cloudstack.apache.org

Cassandra

Produce and verify BoundedReadCompactionStrategy as a unified general purpose compaction algorithm

The existing compaction strategies have a number of drawbacks that make all three unsuitable as a general use compaction strategy, for example STCS creates giant files that are hard to back up, mess with read performance and the page cache, and led to many of the early re-open bugs. LCS improved dramatically on this but also has various issues e.g. lack of performant full compaction or due to the strict leveling with e.g. bulk loading when writes exceed the rate we can do the L0 - L1 promotion.

In this talk I proposed a novel compaction strategy that aims to expose a single tunable that the user can control for the read amplification. Raise the min_threshold_levels and you tradeoff read/space performance for write performance. Since then a proof of concept patch has been published along with some rudimentary documentation but this is still not tracked in Jira.

The remaining work here is

1. Validate the algorithm is correct via test suites and performance testing stress testing and benchmarking with OSS tools (e.g. cassandra-stress, tlp-stress, or ndbench). When issues are found (there likely will be issues as the patch is a PoC), devise how to adjust the algorithm and implementation appropriately. Key metric of success is we can run Cassandra stably for more than 24 hours while applying sustained load, with minimal compaction load (and also compaction can keep up).

2. Do more in depth experiments measuring performance across a wide range of workloads (e.g. write heavy, read heavy, balanced, time series, register update, etc ...) and in comparison with LCS (leveled), STCS (size tiered), and TWCS (time window). Key metrics of success are establishing that as we tune max_read_per_read we should get more predictable read latency under low system load (ρ < 30%) while not degrading at high system load (ρ > 70%), and we should match LCS performance under low load while doing better at high load.

Once this is validated a Cassandra blog post reporting on the findings (positive or negative) may be advisable.

Difficulty: Normal

Project size: ~350 hour (large)

Potential mentors:

, mail: (at) apache.org

Project Devs, mail: dev (at) cassandra.apache.org

Add support for EXPLAIN statements

We should provide users a way to understand how their query will be executed and some information on the amount of work that will be performed.
Explain statements are the most common way to do that.
A CEP Draft has been open for that: (DRAFT) CEP-4: Explain. This draft propose to add support for EXPLAIN and EXPLAIN ANALYZE but I believe that we should split the work in 2 parts because a simple EXPLAIN would already provide relevant information.

To complete this work I believe that the following steps will be required:

Rework and submit the CEP
Add missing statistics
Implements the logic behind the EXPLAIN statements

Difficulty:

Project size: ~350 hour (large)

Potential mentors:

, mail: (at) apache.org

Project Devs, mail: dev (at) cassandra.apache.org

Beam

Runner Comparison / Capability Matrix revamp

Discussion: https://lists.apache.org/thread.html/8aff7d70c254356f2dae3109fb605e0b60763602225a877d3dadf8b7@%3Cdev.beam.apache.org%3E

Summarizing that discussion, we have a lot of issues/wishes. Some can be addressed as one-off and some need a unified reorganization of the runner comparison.

Basic corrections:

Remove rows that impossible to not support (ParDo)
Remove rows where "support" doesn't really make sense (Composite transforms)
Deduplicate rows are actually the same model feature (all non-merging windowing / all merging windowing)
Clearly separate rows that represent optimizations (Combine)
Correct rows in the wrong place (Timers are actually a "what...?" row)
Separate or remove rows have not been designed ([Meta]Data driven triggers, retractions)
Rename rows with names that appear no where else (Timestamp control, which is called a TimestampCombiner in Java)
Switch to a more distinct color scheme for full/partial support (currently just solid/faded colors)
Switch to something clearer than "~" for partial support, versus ✘ and ✓ for none and full.
Correct Gearpump support for merging windows (see ~~BEAM-2759~~)
Correct Spark support for non-merging and merging windows (see BEAM-2499)

Minor rewrites:

Lump all the basic stuff (ParDo, GroupByKey, Read, Window) into one row
Make sections as users see them, like "ParDo" / "side Inputs" not "What?" / "side inputs"
Add rows for non-model things, like portability framework support, metrics backends, etc

Bigger rewrites:

Add versioning to the comparison, as in BEAM-166
Find a way to fit in a plain English summary of runner's support in Beam. It should come first, as it is what new users need before getting to details.
Find a way to describe production readiness of runners and/or testimonials of who is using it in production.
Have a place to compare non-model differences between runners

Changes requiring engineering efforts:

Gather and add quantitative runner metrics, perhaps Nexmark results for mid-level, smaller benchmarks for measuring aspects of specific features, and larger end-to-end benchmarks to get an idea how it might actually perform on a use case
Tighter coupling of the matrix portion of the comparison with tags on ValidatesRunner tests

If you care to address some aspect of this, please reach out and/or just file a subtask and address it.

Difficulty: P3

Project size: ~350 hour (large)

Potential mentors:

Kenneth Knowles, mail: kenn (at) apache.org

Project Devs, mail: dev (at) beam.apache.org

A Complex Event Processing (CEP) library/extension for Apache Beam

Apache Beam [1] is a unified and portable programming model for data processing jobs. The Beam model [2, 3, 4] has rich mechanisms to process endless streams of events.

Complex Event Processing [5] lets you match patterns of events in streams to detect important patterns in data and react to them.

Some examples of uses of CEP are fraud detection for example by detecting unusual behavior (patterns of activity), e.g. network intrusion, suspicious banking transactions, etc. Also trend detection is another interesting use case in the context of sensors and IoT.

The goal of this issue is to implement an efficient pattern matching library inspired by [6] and existing libraries like Apache Flink CEP [7] using the Apache Beam Java SDK and the Beam style guides [8]. Because of the time constraints of GSoC we will probably try to cover first simple patterns of the ‘a followed by b followed by c’ kind, and then if there is still time try to cover more advanced ones e.g. optional, atLeastOne, oneOrMore, etc.

[1] https://beam.apache.org/
[2] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
[3] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
[4] https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43864.pdf
[5] https://en.wikipedia.org/wiki/Complex_event_processing
[6] https://people.cs.umass.edu/~yanlei/publications/sase-sigmod08.pdf
[7] https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/cep.html
[8] https://beam.apache.org/contribute/ptransform-style-guide/

Difficulty: P3

Project size: ~350 hour (large)

Potential mentors:

Ismaël Mejía, mail: iemejia (at) apache.org

Project Devs, mail: dev (at) beam.apache.org

A Beam runner for Ray

Ray (https://ray.io) is a framework to develop distributed applications. There is a push to develop several libraries to support vario7us forms for AI/ML analytics with Ray. There is an opportunity to develop a Beam runner for Ray.

https://docs.google.com/document/u/1/d/1vt78s48Q0aBhaUCHrVrTUsProJSP8-EBqDDRGTPEr0Y/edit

Difficulty: P2

Project size: ~350 hour (large)

Potential mentors:

Pablo Estrada, mail: pabloem (at) apache.org

Project Devs, mail: dev (at) beam.apache.org

Run code in examples in Beam's Pydoc

We have the Beam Pydoc set up, and some functions have examples written into their documentaztion, however we do not run the examples that we express in Pydoc.

This work item consists in improving the Pydoc for Apache Beam to run examples, adding some examples, and reformatting any existing examples / existing Pydoc that needs to be better expressed for Beam.

Difficulty: P2

Project size: ~175 hour (medium)

Potential mentors:

Pablo Estrada, mail: pabloem (at) apache.org

Project Devs, mail: dev (at) beam.apache.org

CLONE - A generic Beam IO Sink for Java

It would be desirable to develop a Beam Sink that supports all of the 'best practices' like throttling, auto-sharding, exactly-once capabilities, etc.

A design proposal is here: https://docs.google.com/document/d/1UIWv6wnD86GYAkeqbVWCG3mx4dTZ9WstUUThPWQmcFM/edit#heading=h.smc16ifdre2

A prototype for the API and parts of implementation is here: https://github.com/apache/beam/pull/16763

Contact Pablo Estrada on dev@beam.apache.org if you have questions, or comment here.

Difficulty: P2

Project size: ~350 hour (large)

Potential mentors:

Pablo Estrada, mail: pabloem (at) apache.org

Project Devs, mail: dev (at) beam.apache.org

A generic Beam IO Sink for Java

It would be desirable to develop a Beam Sink that supports all of the 'best practices' like throttling, auto-sharding, exactly-once capabilities, etc.

A design proposal is here: https://docs.google.com/document/d/1UIWv6wnD86GYAkeqbVWCG3mx4dTZ9WstUUThPWQmcFM/edit#heading=h.smc16ifdre2

A prototype for the API and parts of implementation is here: https://github.com/apache/beam/pull/16763

Contact Pablo Estrada on dev@beam.apache.org if you have questions, or comment here.

Difficulty: P2

Project size: ~350 hour (large)

Potential mentors:

Pablo Estrada, mail: pabloem (at) apache.org

Project Devs, mail: dev (at) beam.apache.org

Apache Nemo

Efficient Dynamic Reconfiguration in Stream Processing

In Stream processing, we have many methods, starting from the primitive checkpoint-and-replay to a more fancy version of reconfiguration and reinitiation of stream workloads. We aim to find a way to find the most effective and efficient way of reconfiguring stream workloads. Sub-issues are to be created later on.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Wonook, mail: wonook (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

Application structure-aware caching on Nemo

Nemo has a policy layer that allows powerful optimization with configurable runtime modules. In terms of caching, it is possible to identify frequently used data and decide to cache them in-memory ahead of execution, without user annotation.

Implementation would need:

On policy layer, build compile-time pass that identify reused data and mark them as cached
On runtime, design and implement caching mechanism that manages per-executor cached data and discard them when these are no longer used.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Jeongyoon Eo, mail: jeongyoon (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

Implement spill mechanism on Nemo

Currently, Nemo doesn't have a spill mechanism. This makes executors prone to memory problems such as OOM(Out Of Memory) or GC when task data is large. For example, handling skewed shuffle data in Nemo results in OOM and executor failure, as all data has to be handled in-memory.

We need to spill in-memory data to secondary storage when there are not enough memory in executor.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Jeongyoon Eo, mail: jeongyoon (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

Efficient Caching and Spilling on Nemo

In-memory caching and spilling are essential features in in-memory big data processing frameworks, and Nemo needs one.

Identify and persist frequently used data and unpersist it when its usage ended
Spill in-memory data to disk upon memory pressure

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Jeongyoon Eo, mail: jeongyoon (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

Apache Dubbo

GSoC2022: Metrics and Observability

Please read the Observasibility proposal here first to know about the ultimate goal behind this issue.

If you are interested in this project and the objective described in the proposal, please leave comments on the corresponding Github issue below so we can further exchange information on the tasks that need to be done.

https://github.com/apache/dubbo/issues/9886

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Jun Liu, mail: chicken (at) apache.org

Project Devs, mail:

GSoC2022: Task demo demonstrating the usage of Dubbo3

目标
首先，从宏观上、使用上掌握 Dubbo 及微服务治理相关概念；在此基础之上，设计一系列的 Demo 应用，基于这些应用设计出一系列微服务治理的 Tasks，每个 Task 涵盖一项或多项 Dubbo 的服务治理能力，通过详细描述的用例引导用户一步步的完成每一个 Task，进而帮助用户学习使用 Dubbo 能做到什么。

详情请在 https://github.com/apache/dubbo/issues/9887 讨论。

任务描述
Dubbo 拥有丰富的治理规则，如服务发现、负载均衡、路由策略（标签路由、条件路由）等，但是这些治理规则的使用具有一定的难度，用户也很难直观的了解到其对应的使用场景。因此 Dubbo 期望有这样的一些场景化的用例能够体现 Dubbo 的治理能力，帮助用户将治理规则迁移到真实业务场景中使用。

这是一项相对比较有挑战性的任务，难度并不在编码本身，而在于对整个 Dubbo 及微服务体系要有比较总体的把握。如能顺利完成，对于参与者整体的视野提升将具有非常大的帮助。参与者可以导师一起协作完成。

参考：
Istio 中 bookinfo 应用

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Jun Liu, mail: chicken (at) apache.org

Project Devs, mail:

GSoC2022: Sidecar Mesh support

Please read the detailed proposal of Dubbo Sidecar Mesh or Thin SDK here first to know about the ultimate goal behind this issue.

The details of this project will be posted on the following GitHub issue, please keep posted there.

https://github.com/apache/dubbo/issues/9885

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Jun Liu, mail: chicken (at) apache.org

Project Devs, mail:

GSoC2022: Proxyless Mesh support

Please read the detailed proposal of Dubbo Proxyless Mesh here first to know about the ultimate goal behind this issue.

The details of this project will be posted on the following GitHub issue, please keep posted there.

https://github.com/apache/dubbo/issues/9884

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Jun Liu, mail: chicken (at) apache.org

Project Devs, mail:

GSoC2022: Metrics and Observability for Dubbo-go

Description

Please read the Observasibility proposal here first to know about the ultimate goal behind this issue.

If you are interested in this project and the objective described in the proposal, please leave comments on the corresponding Github issue below so we can further exchange information on the tasks that need to be done.

https://github.com/apache/dubbo-go/issues/1807

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Zhixin Li, mail: laurence (at) apache.org

Project Devs, mail:

GSoC2022: Sidecar Mesh support for Dubbo-go

Please read the detailed proposal of Dubbo Sidecar Mesh or Thin SDK here first to know about the ultimate goal behind this issue.

The details of this project will be posted on the following GitHub issue, please keep posted there.

https://github.com/apache/dubbo-go/issues/1809

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Zhixin Li, mail: laurence (at) apache.org

Project Devs, mail:

GSoC2022: Proxyless Mesh support for Dubbo-go

Please read the detailed proposal of Dubbo Proxyless Mesh here first to know about the ultimate goal behind this issue.

The details of this project will be posted on the following GitHub issue, please keep posted there.

https://github.com/apache/dubbo-go/issues/1808

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Zhixin Li, mail: laurence (at) apache.org

Project Devs, mail:

Apache AsterixDB

Interactive Hyracks Job Viewer

We will utilize ngx-graph library simialar to interactive query plan viewer (~~ASTERIXDB-2863~~) in order to display an interactive query plan that supports DAGs.

Features:

Colored nodes (by operator)
Zoom out to fit whole plan
Zoom and drag through the plan
Traverse the nodes or jump to nodes in a Depth First Search (DFS) fashion
Detail number of locations for execution
Detailed mode (contains more information per node)
- Search using string match
Clear all selections and reset the interactive plan

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Preston Carman, mail: prestonc (at) apache.org

Project Devs, mail:

Airavata

Provide meta scheduling capabilities within Airavata

As discussed on the architecture mailing list [1] and summarized at [2], Airavata will need to develop a metascheduler. In the short term, a user request (demeler, gobert) is to have airavata throttle jobs to resources. In the future more informed scheduling strategies needs to be integrated. Hopefully, the actual scheduling algorithms can be borrowed from third party implementations.

[1] - http://markmail.org/message/tdae5y3togyq4duv
[2] - https://cwiki.apache.org/confluence/display/AIRAVATA/Airavata+Metascheduler

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Suresh Marru, mail: smarru (at) apache.org

Project Devs, mail: dev (at) airavata.apache.org

Apache Ratis

Support linearizable read from followers

Apache Ratis is a highly customizable Raft protocol implementation in Java. Raft is a easily understandable consensus algorithm to manage replicated state. Apache Ratis could be used in any Java application where state should be replicated between multiple instances.

Raft algorithm not only allows linearizable read through Read Index or Lease Read, but also allows linearizable read on the follower nodes, which can increase the read throughput linearly with the number of nodes. Algorithm specific processes can refer to Raft thesis in section 6.4

In our survey, sofa-jraft, etcd, tikv-rs and other famous consensus algorithms libraries have supported linearizable follower read. As the only consensus algorithm library under the Apache Foundation, we expect Ratis to support this feature as well.

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Tsz-Wo Sze, mail: szetszwo@gmail.com
Project Devs, mail: dev@ratis.apache.org

Space shortcuts

Child pages

James Server

APISIX

SkyWalking

ShardingSphere

Apache ShardingSphere

Background

Task

Relevant Skills

Mentor

Apache ShardingSphere

Background

Task

Relevant Skills

Targets files

Mentor

Background

Task

Relevant Skills

Targets files

Mentor

ShenYu

Apache ShenYu (incubating)

Description

Task

Recommended Skills

Mentor

Apache ShenYu (incubating)

Background

Task

Recommended Skills

Mentor

Apache ShenYu (incubating)

Description

Task

Recommended Skills

Mentor

TrafficControl

RocketMQ

Community Development

Apache EventMesh (incubating)

Background

Task

Recommended Skills

Mentor

Apache EventMesh (incubating)

Background

Task

Recommended Skills

Mentor

DolphinScheduler

Apache DolphinScheduler

Background

Task

Recommended Skills

Mentors

About pydolphinscheduler

The Goal

Detail

What Can You Learn

If You Interested in It

Mentors

Commons Statistics

Commons Numbers

Commons Math

Commons Geometry

CloudStack

Background

Requirement

Relevant Skills:

Difficulty:

Potential Mentors:

References

Background

Requirement

Relevant Skills:

Difficulty:

Potential Mentors:

Example and references