A good long term objective for the PMC is to drop RabbitMQ in
favor of pulsar (third parties could package their own components using
RabbitMQ if they wishes...)

This means:

Solve the bugs that were found during the Pulsar MailQueue review
Pulsar MailQueue need to allow listing blobs in order to be
deduplication friendly.
Provide an event bus based on Pulsar
Provide a task manager based on Pulsar
Package a distributed server backed by pulsar, deprecate then replace
the current one.
(optionally) support mail queue priorities

While contributions would of course be welcomed on this topic, we could
offer it as part of GSOC 2022, and we could co-mentor it with mentors of
the Pulsar community (see [3])

[3] https://lists.apache.org/thread/y9s7f6hmh51ky30l20yx0dlz458gw259

Would such a plan gain traction around here ?

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Benoit Tellier, mail: btellier (at) apache.org

Project Devs, mail: dev (at) james.apache.org

ShenYu

Apache ShenYu: add logging-elasticsearch plugin for agent

Apache ShenYu (incubating)

A High-performance,multi-protocol,extensible,responsive API Gateway. Compatible with a variety of mainstream framework systems, support hot plug, users can customize the development, meet the current situation and future needs of users in a variety of scenarios, experienced the temper of large-scale scenes

Website: https://shenyu.apache.org
GitHub: https://github.com/apache/incubator-shenyu
Linked GitHub Issue: https://github.com/apache/incubator-shenyu/issues/2896

Description

Apache ShenYu uses java agent and bytecode enhancement technology to achieve seamless embedding, so that users can access third-party observability systems without introducing dependencies, and obtain Traces, Metrics and Logging.
Take the shenyu gateway log information, write it to elasticSearch and display it.
Can add module like this ：

shenyu-agent
------ shenyu-agent-plugin-logging
----------------shenyu-agent-plugin-logging-elasticsearch

Task

Add shenyu-agent-plugin-logging-elasticsearch module and impl write it to elasticSearch
Complete unit test for this module
Complete the integration for this module
Complete doc for this module in shenyu website

Recommended Skills

Familiar with Java
Know the usage of java agent and bytebuddy
Know the usage of elasticSearch java client
Have some knowledge about Docker

Mentor

XiaoYu, PPMC of Apache ShenYu, https://github.com/yu199195, [xiaoyu@apache.org](xiaoyu@apache.org)

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Xiao Yu, mail: xiaoyu (at) apache.org

Project Devs, mail: dev (at) shenyu.apache.org

Apache ShenYu: Improve integration test and deployment methods

Apache ShenYu (incubating)

A High-performance,multi-protocol,extensible,responsive API Gateway. Compatible with a variety of mainstream framework systems, support hot plug, users can customize the development, meet the current situation and future needs of users in a variety of scenarios, experienced the temper of large-scale scenes

Website: https://shenyu.apache.org

GitHub: https://github.com/apache/incubator-shenyu

Linked GitHub Issue: https://github.com/apache/incubator-shenyu/issues/2890

Background

ShenYu is still vacant with helm deployment, so we need to write charts for it, and then complete the integration test.
Shenyu already has a relatively complete integration testing framework, but some plug-ins have not been tested, and some tests are not perfect.

Task

Write helm chart for Apache ShenYu
Complete the integration test of deploying Apache ShenYu with helm in Kubernetes
Documentation for helm deployment
Complete the integration test of the Oauth2 plugin
Improve the integration test of other existing plugin

Recommended Skills

Familiar with Java

Know the usage of spring-framework

Have some knowledge about Kubernetes and Docker

Mentor

Kunshuai Zhu, Committer of Apache ShenYu, https://github.com/JooKS-me, jooks@apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Kunshuai Zhu, mail: jooks (at) apache.org

Project Devs, mail: dev (at) shenyu.apache.org

Apache ShenYu: add logging-kafka plugin for agent

Apache ShenYu (incubating)

A High-performance,multi-protocol,extensible,responsive API Gateway. Compatible with a variety of mainstream framework systems, support hot plug, users can customize the development, meet the current situation and future needs of users in a variety of scenarios, experienced the temper of large-scale scenes

Website: https://shenyu.apache.org
GitHub: https://github.com/apache/incubator-shenyu
Linked GitHub Issue: https://github.com/apache/incubator-shenyu/issues/2917

Description

Apache ShenYu uses java agent and bytecode enhancement technology to achieve seamless embedding, so that users can access third-party observability systems without introducing dependencies, and obtain Traces, Metrics and Logging.
Take the shenyu gateway log information, write it to Kafka and display it.
Can add module like this ：

shenyu-agent
------ shenyu-agent-plugin-logging
----------------shenyu-agent-plugin-logging-kafka

Task

Add shenyu-agent-plugin-logging-kafka module and impl write it to Kafka
Complete unit test for this module
Complete the integration for this module
Complete doc for this module in shenyu website

Recommended Skills

Familiar with Java
Know the usage of java agent and bytebuddy
Know the usage of Kafka java client
Have some knowledge about Docker

Mentor

Zhang Yonglun, PPMC of Apache ShenYu, https://github.com/tuohai666, [zhangyonglun@apache.org](zhangyonglun@apache.org)

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Yonglun Zhang, mail: zhangyonglun (at) apache.org

Project Devs, mail: dev (at) shenyu.apache.org

ShardingSphere

Apache ShardingSphere: Solve unsupported Postgres sql about alter statement for ShardingSphere Parser

Apache ShardingSphere
Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.
Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere

Background

ShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer`, `openGauss` and `Oracle`, which means we have to understand different database dialect SQLs.
More details:
https://shardingsphere.apache.org/document/current/en/reference/sharding/parse/

Task

This issue is to solve the unsupported postgres sql about alter in this file . * ALTER OPERATOR

ALTER POLICY

ALTER PUBLICATION

ALTER ROUTINE

ALTER RULE

ALTER SCHEMA

ALTER SEQUENCE

ALTER SERVER

ALTER STATISTICS

ALTER SUBSCRIPTION

ALTER TABLE

ALTER TEXT SEARCH

ALTER TRIGGER

ALTER TYPE

ALTER VIEW

You can learn more here. *
You may need to try to get why it's not supported.(antlr4 grammar? or not implement visit method) You can use antlr4 plugins to help you to analyze. You may need to visit an official doc to check the grammar.

After you fix it, remember to add a new corresponding SQL case in SQL Cases and the expected parsed result in Expected Statment XML.

Run SQLParserParameterizedTest and UnsupportedSQLParserParameterizedTest to make sure no exceptions.

Notice, these issues can be a good example.
support alter foreign table for pg/og
support alter materialized view for pg/og.

Relevant Skills

1. Master JAVA language
2. Have a basic understanding of Antlr g4 file
3. Be familiar with Postgres SQLs

Targets files

1. Postgres SQLs g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-postgresql/src/main/antlr4/org/apache/shardingsphere/sql/parser/autogen/PostgreSQLStatement.g4

Mentor

Trista Pan, PMC of Apache ShardingSphere, https://tristazero.github.io

Zhengqiang Duan, Committer of ApacheShardingSphere, https://github.com/strongduanmu

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Juan Pan, mail: panjuan (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere: Solve unsupported Postgres sql about statements that start with 'c' for ShardingSphere Parser

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere

Background

ShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer`, `openGauss` and `Oracle`, which means we have to understand different database dialect SQLs.

More details:
https://shardingsphere.apache.org/document/current/en/reference/sharding/parse/

Task

This issue is to solve the unsupported postgres sql about alter in this file . * CALL

CHECKPOINT

CLOSE

CLUSTER

COMMENT

COPY

CREATE ACCESS METHOD

CREATE AGGREGATE

CREATE CAST

CREATE COLLATION

CREATE EVENT TRIGGER

CREATE FOREIGN DATA WRAPPER

CREATE FOREIGN TABLE

CREATE GROUP

CREATE MATERIALIZED VIEW

CREATE OPERATOR

CREATE POLICY

CREATE PUBLICATION

You can learn more here. *
You may need to try to get why it's not supported.(antlr4 grammar? or not implement visit method) You can use antlr4 plugins to help you to analyze. You may need to visit an official doc to check the grammar.

After you fix it, remember to add a new corresponding SQL case in SQL Cases and expected parsed result in Expected Statment XML.

Run SQLParserParameterizedTest and UnsupportedSQLParserParameterizedTest to make sure no exceptions.

Notice, these issues can be a good example.
support alter foreign table for pg/og
support alter materialized view for pg/og.

Relevant Skills

1. Master JAVA language
2. Have a basic understanding of Antlr g4 file
3. Be familiar with Postgres SQLs

Targets files

1. Postgres SQLs g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-postgresql/src/main/antlr4/org/apache/shardingsphere/sql/parser/autogen/PostgreSQLStatement.g4

Mentor

Zhengqiang Duan, Committer of Apache ShardingSphere, duanzhengqiang@apache.org
Haoran Meng, PMC of Apache ShardingSphere, menghaoran@apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Zhengqiang Duan, mail: duanzhengqiang (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere: Develop an external tool to convert YAML configuration into DistSQL scripts

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.
Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere

Background

Since version 5.0.0, ShrdingSphere provides its own management language: DistSQL, which greatly facilitates users to manage distributed databases.
There are now many users who want to convert from legacy YAML configuration to DistSQL, and we want to design a tool to help them. (For ShardingSphere-Proxy only)

More details:
https://shardingsphere.apache.org/document/current/en/concepts/distsql/

Task

Design and implement a command line tool that allows the user to enter a path to a YAML configuration file and output a DistSQL script file.
This means that when a user uses the generated DistSQL script, it is possible to create a configuration result equivalent to a YAML file.

We have provided a DistSQL for exporting schema configuration, which is related to this issue, to help you understand this issue.

The tool should convert both datasources and rule configuration in YAML to corresponding DistSQL RDL

The tool needs to run independently, but it can depend on the jar package of ShardingSphere.
When the tool starts, it is best to prompt the currently applicable ShardingSphere version.
It is best to use the Java language, so that the jar package provided by ShardingSphere can be reused

Notice:

There is currently no suitable module in the ShardingSphere repository for standalone tools, so a new module needs to be added.

Relevant Skills

1. Master JAVA language
2. Understand the schema configurations of ShardingSphere-Proxy
3. Understand DistSQL RDL

Mentor

Longtao Jiang, Committer of Apache ShardingSphere, jianglongtao@apache.org
Chengxiang Lan, Committer of Apache ShardingSphere, lanchengxiang@apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Longtao Jiang, mail: jianglongtao (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

SkyWalking

Apache SkyWalking: Add the webapp of banyandb

BanyanDB, as an observability database, aims to ingest, analyze and store Metrics, Tracing, and Logging data. It's designed to handle observability data generated by Apache SkyWalking.

We need a web-based application to

Query the data from the banyandb's data nodes
Monitor the performance of the backend
Render the topology of server nodes

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Hongtao Gao, mail: hanahmily (at) apache.org

Project Devs, mail: dev (at) skywalking.apache.org

TrafficControl

GSOC: Varnish Cache support in Apache Traffic Control

Background
Apache Traffic Control is a Content Delivery Network (CDN) control plane for large scale content distribution.

Traffic Control currently requires Apache Traffic Server as the underlying cache. Help us expand the scope by integrating with the very popular Varnish Cache.

There are multiple aspects to this project:

Configuration Generation: Write software to build Varnish configuration files (VCL). This code will be implemented in our Traffic Ops and cache client side utilities, both written in Go.

Health Monitoring: Implement monitoring of the Varnish cache health and performance. This code will run both in the Traffic Monitor component and within Varnish. Traffic Monitor is written in Go and Varnish is written in C.

Testing: Adding automated tests for new code

Skills:

Proficiency in Go is required
A basic knowledge of HTTP and caching is preferred, but not required for this project.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Eric Friedrich, mail: friede (at) apache.org

Project Devs, mail: dev (at) trafficcontrol.apache.org

Commons Math

GSoC 2022

Placeholder for tasks that could be undertaken in this year's GSoC.

Ideas (extracted from the "dev" ML):

Redesign and modularize the "ml" package
-> main goal: enable multi-thread usage.
Abstract the linear algebra utilities
-> main goal: allow switching to alternative implementations.
Redesign and modularize the "random" package
-> main goal: general support of low-discrepancy sequences.
Refactor and modularize the "special" package
-> main goals: ensure accuracy and performance and better API,
add other functions.
Upgrade the test suite to Junit 5
-> additional goal: collect a list of "odd" expectations.

Other suggestions welcome, as well as

delineating additional and/or intermediate goals,
signalling potential pitfalls and/or alternative approaches to the intended goal(s).

Difficulty: Minor

Project size: ~350 hour (large)

Potential mentors:

Gilles Sadowski, mail: erans (at) apache.org

Project Devs, mail: dev (at) commons.apache.org

Cassandra

Produce and verify BoundedReadCompactionStrategy as a unified general purpose compaction algorithm

The existing compaction strategies have a number of drawbacks that make all three unsuitable as a general use compaction strategy, for example STCS creates giant files that are hard to back up, mess with read performance and the page cache, and led to many of the early re-open bugs. LCS improved dramatically on this but also has various issues e.g. lack of performant full compaction or due to the strict leveling with e.g. bulk loading when writes exceed the rate we can do the L0 - L1 promotion.

In this talk I proposed a novel compaction strategy that aims to expose a single tunable that the user can control for the read amplification. Raise the min_threshold_levels and you tradeoff read/space performance for write performance. Since then a proof of concept patch has been published along with some rudimentary documentation but this is still not tracked in Jira.

The remaining work here is

1. Validate the algorithm is correct via test suites and performance testing stress testing and benchmarking with OSS tools (e.g. cassandra-stress, tlp-stress, or ndbench). When issues are found (there likely will be issues as the patch is a PoC), devise how to adjust the algorithm and implementation appropriately. Key metric of success is we can run Cassandra stably for more than 24 hours while applying sustained load, with minimal compaction load (and also compaction can keep up).

2. Do more in depth experiments measuring performance across a wide range of workloads (e.g. write heavy, read heavy, balanced, time series, register update, etc ...) and in comparison with LCS (leveled), STCS (size tiered), and TWCS (time window). Key metrics of success are establishing that as we tune max_read_per_read we should get more predictable read latency under low system load (ρ < 30%) while not degrading at high system load (ρ > 70%), and we should match LCS performance under low load while doing better at high load.

Once this is validated a Cassandra blog post reporting on the findings (positive or negative) may be advisable.

Difficulty: Normal

Project size: ~350 hour (large)

Potential mentors:

, mail: (at) apache.org

Project Devs, mail: dev (at) cassandra.apache.org

Beam

A generic Beam IO Sink for Java

It would be desirable to develop a Beam Sink that supports all of the 'best practices' like throttling, auto-sharding, exactly-once capabilities, etc.

A design proposal is here: https://docs.google.com/document/d/1UIWv6wnD86GYAkeqbVWCG3mx4dTZ9WstUUThPWQmcFM/edit#heading=h.smc16ifdre2

A prototype for the API and parts of implementation is here: https://github.com/apache/beam/pull/16763

Contact Pablo Estrada on dev@beam.apache.org if you have questions, or comment here.

Difficulty: P2

Project size: ~350 hour (large)

Potential mentors:

Pablo Estrada, mail: pabloem (at) apache.org

Project Devs, mail: dev (at) beam.apache.org

Apache Nemo

Efficient Dynamic Reconfiguration in Stream Processing

In Stream processing, we have many methods, starting from the primitive checkpoint-and-replay to a more fancy version of reconfiguration and reinitiation of stream workloads. We aim to find a way to find the most effective and efficient way of reconfiguring stream workloads. Sub-issues are to be created later on.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Wonook, mail: wonook (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

Application structure-aware caching on Nemo

Nemo has a policy layer that allows powerful optimization with configurable runtime modules. In terms of caching, it is possible to identify frequently used data and decide to cache them in-memory ahead of execution, without user annotation.

Implementation would need:

On policy layer, build compile-time pass that identify reused data and mark them as cached
On runtime, design and implement caching mechanism that manages per-executor cached data and discard them when these are no longer used.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Jeongyoon Eo, mail: jeongyoon (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

Implement spill mechanism on Nemo

Currently, Nemo doesn't have a spill mechanism. This makes executors prone to memory problems such as OOM(Out Of Memory) or GC when task data is large. For example, handling skewed shuffle data in Nemo results in OOM and executor failure, as all data has to be handled in-memory.

We need to spill in-memory data to secondary storage when there are not enough memory in executor.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Jeongyoon Eo, mail: jeongyoon (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

Efficient Caching and Spilling on Nemo

In-memory caching and spilling are essential features in in-memory big data processing frameworks, and Nemo needs one.

Identify and persist frequently used data and unpersist it when its usage ended
Spill in-memory data to disk upon memory pressure

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Jeongyoon Eo, mail: jeongyoon (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

Space shortcuts

Child pages

James Server

ShenYu

Apache ShenYu (incubating)

Description

Task

Recommended Skills

Mentor

Apache ShenYu (incubating)

Background

Task

Recommended Skills

Mentor

Apache ShenYu (incubating)

Description

Task

Recommended Skills

Mentor

ShardingSphere

Background

Task

Relevant Skills

Targets files

Mentor

Apache ShardingSphere

Background

Task

Relevant Skills

Targets files

Mentor

Apache ShardingSphere

Background

Task

Relevant Skills

Mentor

SkyWalking

TrafficControl

Commons Math

Cassandra

Beam

Apache Nemo