Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.

Background

StreamPipes has grown significantly throughout recent years. We were able to introduce a lot of new features and attracted both users and contributors. Putting the cherry on the cake, we were graduated as an Apache top level project in December 2022. We will of course continue developing new features and never rest to make StreamPipes even more amazing. Although, since we are approaching with full stream towards our `1.0` release, we want to project also to get more mature. Therefore, we want to address one of our Achilles' heels: our test coverage.

Don't worry, this issue is not about implementing myriads of tests for our code base. As a first step, we would like to make the status quo transparent. That means we want to measure our code coverage consistently across the whole codebase (Backend, UI, Python library) and report the coverage to codecov. Furthermore, to benchmark ourselves and motivate us to provide tests with every contributing, we would like to lock the current test coverage as an lower threshold that we always want to achieve (meaning in case we drop CI builds fail etc). With time we then can increase the required coverage lever step to step.

More than monitoring our test coverage, we also want to invest in better and more clean code. Therefore, we would like to adopt sonarcloud for our repository.

Tasks

[ ] calculate test coverage for all main parts of the repo
[ ] send coverage to codeCov
[ ] determine coverage threshold and let CI fail if below
[ ] include sonarcloud in CI setup
[ ] include automatic coverage report in PR validation (see an example here ) -> optional
[ ] include automatic sonarcloud report in PR validation -> optional
[ ] what ever comes in your mind 💡 further ideas are always welcome

❗Important Note❗

Do not create any account in behalf of Apache StreamPipes in Sonarcloud or in CodeCov or using the name of Apache StreamPipes for any account creation. Your mentor will take care of it.

Relevant Skills

basic knowledge about GitHub worfklows

Learning Material

GitHub workflow docs
Apache StreamPipes workflows
Sonarcloud for Monorepos
Using code cov for a monorepo: https://www.curtiscode.dev/post/tools/codecov-monorepo/ & https://docs.codecov.com/docs/flags

References

You can find our corresponding issue on GitHub here

Name and Contact Information

Name: Tim Bossenmaier

email: bossenti[at]apache.org

community: dev[at]streampipes.apache.org

website: https://streampipes.apache.org/

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Tim Bossenmaier, mail: bossenti (at) apache.org

Project Devs, mail: dev (at) streampipes.apache.org

SkyWalking

[GSOC] [SkyWalking] AIOps Log clustering with Flink (Algorithm Optimization)

Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This year we will proceed on log clustering implementation with a revised architecture and this task will require student to focus on algorithm optimiztion for the clustering technique.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Yihao Chen, mail: yihaochen (at) apache.org

Project Devs, mail: dev (at) skywalking.apache.org

[GSOC] [SkyWalking] Python Agent Performance Enhancement Plan

Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This task is about enhancing Python agent performance, the tracking issue can be seen here -< https://github.com/apache/skywalking/issues/10408

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Yihao Chen, mail: yihaochen (at) apache.org

Project Devs, mail: dev (at) skywalking.apache.org

[GSOC] [SkyWalking] AIOps Log clustering with Flink (Flink Integration)

Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This year we will proceed on log clustering implementation with a revised architecture and this task will require student to focus on Flink and its integration with SkyWalking OAP.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Yihao Chen, mail: yihaochen (at) apache.org

Project Devs, mail: dev (at) skywalking.apache.org

[GSOC] [SkyWalking] Self-Observability of the query subsystem in BanyanDB

Background

SkyWalking BanyanDB is an observability database, aims to ingest, analyze and store Metrics, Tracing and Logging data.

Objectives

Support EXPLAIN[1] for both measure query and stream query
Add self-observability including trace and metrics for query subsystem
Support EXPLAIN in the client SDK & CLI and add query plan visualization in the UI

[1]: EXPLAIN in MySQL

Recommended Skills

Familiar with Go
Have a basic understanding of database query engine
Have an experience of Apache SkyWalking or other APMs

Mentor

Mentor: Jiajing Lu, Apache SkyWalking PMC, lujiajing@apache.org^{Image Removed}
Mentor: Hongtao Gao, Apache SkyWalking PMC, Apache ShardingSphere PMC, hanahmily@apache.org^{Image Removed}
Mailing List: dev@skywalking.apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Jiajing Lu, mail: lujiajing (at) apache.org

Project Devs, mail: dev (at) skywalking.apache.org

[GSOC] [SkyWalking] Unify query planner and executor in BanyanDB

Background

SkyWalking BanyanDB is an observability database, aims to ingest, analyze and store Metrics, Tracing and Logging data.

Objectives

Fully unify/merge the query planner and executor for Measure and TopN

Recommended Skills

Familiar with Go
Have a basic understanding of database query engine
Have an experience of Apache SkyWalking

Mentor

Mentor: Jiajing Lu, Apache SkyWalking PMC, lujiajing@apache.org^{Image Removed}
Mentor: Hongtao Gao, Apache SkyWalking PMC, Apache ShardingSphere PMC, hanahmily@apache.org^{Image Removed}
Mailing List: dev@skywalking.apache.org

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Jiajing Lu, mail: lujiajing (at) apache.org

Project Devs, mail: dev (at) skywalking.apache.org

ShardingSphere

Apache ShardingSphere Enhance SQLNodeConverterEngine to support more MySQL SQL statements

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere

Background

The ShardingSphere SQL federation engine provides support for complex SQL statements, and it can well support cross-database join queries, subqueries, aggregation queries and other statements. An important part of SQL federation engine is to convert the SQL statement parsed by ShardingSphere into SqlNode, so that Calcite can be used to implement SQL optimization and federated query.

Task

This issue is to solve the MySQL exception that occurs during SQLNodeConverterEngine conversion. The specific case list is as follows.

select_char
select_extract
select_from_dual
select_from_with_table
select_group_by_with_having_and_window
select_not_between_with_single_table
select_not_in_with_single_table
select_substring
select_trim
select_weight_string
select_where_with_bit_expr_with_ampersand
select_where_with_bit_expr_with_caret
select_where_with_bit_expr_with_div
select_where_with_bit_expr_with_minus_interval
select_where_with_bit_expr_with_mod
select_where_with_bit_expr_with_mod_sign
select_where_with_bit_expr_with_plus_interval
select_where_with_bit_expr_with_signed_left_shift
select_where_with_bit_expr_with_signed_right_shift
select_where_with_bit_expr_with_vertical_bar
select_where_with_boolean_primary_with_comparison_subquery
select_where_with_boolean_primary_with_is
select_where_with_boolean_primary_with_is_not
select_where_with_boolean_primary_with_null_safe
select_where_with_expr_with_and_sign
select_where_with_expr_with_is
select_where_with_expr_with_is_not
select_where_with_expr_with_not
select_where_with_expr_with_not_sign
select_where_with_expr_with_or_sign
select_where_with_expr_with_xor
select_where_with_predicate_with_in_subquery
select_where_with_predicate_with_regexp
select_where_with_predicate_with_sounds_like
select_where_with_simple_expr_with_collate
select_where_with_simple_expr_with_match
select_where_with_simple_expr_with_not
select_where_with_simple_expr_with_odbc_escape_syntax
select_where_with_simple_expr_with_row
select_where_with_simple_expr_with_tilde
select_where_with_simple_expr_with_variable
select_window_function
select_with_assignment_operator
select_with_assignment_operator_and_keyword
select_with_case_expression
select_with_collate_with_marker
select_with_date_format_function
select_with_exists_sub_query_with_project
select_with_function_name
select_with_json_value_return_type
select_with_match_against
select_with_regexp
select_with_schema_name_in_column_projection
select_with_schema_name_in_shorthand_projection
select_with_spatial_function
select_with_trim_expr
select_with_trim_expr_from_expr

You need to compare the difference between actual and expected, and then correct the logic in SQLNodeConverterEngine so that actual can be consistent with expected.

After you make changes, remember to add case to SUPPORTED_SQL_CASE_IDS to ensure it can be tested.

Notice, these issues can be a good example.
https://github.com/apache/shardingsphere/pull/14492

Relevant Skills

1. Master JAVA language

2. Have a basic understanding of Antlr g4 file

3. Be familiar with MySQL and Calcite SqlNode

Targets files

SQLNodeConverterEngineIT

https://github.com/apache/shardingsphere/blob/master/test/it/optimizer/src/test/java/org/apache/shardingsphere/test/it/optimize/SQLNodeConverterEngineIT.java

Mentor

Zhengqiang Duan, PMC of Apache ShardingSphere, duanzhengqiang@apache.org

Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org^{Image Added}

Trista Pan, PMC of Apache ShardingSphere, panjuan@apache.org

[GSOC][SkyWalking] Add Terraform provider for Apache SkyWalking

Now the deployment methods for SkyWalking are limited, we only have Helm Chart for users to deploy in Kubernetes, other users that are not using Kubernetes have to do all the house keeping stuffs to set up SkyWalking on, for example, VM.

This issue aims to add a Terraform provider, so that users can conveniently spin up a cluster for demonstration or testing, we should evolve the provider and allow users to customize as their need and finally users can use this in their production environment.

In this task, we will mainly focus on the support for AWS. In the Terraform provider, users need to provide their access key / secret key, and the provider does the rest stuffs: create VMs, create database/OpenSearch or RDS, download SkyWalking tars, configure the SkyWalking, and start the SkyWalking components (OAP/UI), create public IPs/domain name, etc.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Zhenxu Ke, mail: kezhenxu94 (at) apache.org

Project Devs, mail: dev (at) skywalking.apache.org

[SkyWalking] Build the OAP into GraalVM native image

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Zhenxu KeZhengqiang Duan, mail: kezhenxu94 duanzhengqiang (at) apache.org

Project Devs, mail: dev (at) skywalkingshardingsphere.apache.org

[GSOC] [SkyWalking] Add Overview page in BanyanDB UI

Background

SkyWalking BanyanDB is an observability database, aims to ingest, analyze and store Metrics, Tracing and Logging data.

The BanyanDB UI is a web interface provided BanyanDB server. It's developed with Vue3 and Vite3

Objectives

The UI should have a user-friendly Overview page.
The Overview page must display a list of nodes running in a cluster.
For each node in the list, the following information must be shown:

Node ID or name
Uptime
CPU usage (percentage)
Memory usage (percentage)
Disk usage (percentage)
Ports(gRPC and HTTP)

The web app must automatically refresh the node data at a configurable interval to show the most recent information.

Recommended Skills

Familiar with Vue and Vite
Have a basic understanding of RESTFul
Have an experience of Apache SkyWalking

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Hongtao Gao, mail: hanahmily (at) apache.org

Project Devs, mail: dev (at) skywalking.apache.org

ShardingSphere

Apache ShardingSphere Enhance ComputeNode reconciliation

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Page: https://shardingsphere.apache.org/
Github: https://github.com/apache/shardingsphere

Background

There is a proposal about new CRD Cluster and ComputeNode as belows:

Currently we try to promote ComputeNode as major CRD to represent a special ShardingSphere Proxy deployment. And plan to use Cluster indicating a special ShardingSphere Proxy cluster.

Task

This issue is to enhance ComputeNode reconciliation availability. The specific case list is as follows.

Add IT test case for Deployment spec volume
Add IT test case for Deployment spec template init containers
Add IT test case for Deployment spec template spec containers
Add IT test case for Deployment spec volume mounts
Add IT test case for Deployment spec container ports
Add IT test case for Deployment spec container image tag
Add IT test case for Service spec ports
Add IT test case for ConfigMap data serverconfig
Add IT test case for ConfigMap data logback

Notice, these issues can be a good example.
chore: add more Ginkgo tests for ComputeNode #203

Relevant Skills

Master Go language, Ginkgo test framework
Have a basic understanding of Apache ShardingSphere Concepts
Be familiar with Kubernetes Operator, kubebuilder framework

Targets files

ComputeNode IT -

Apache ShardingSphere Enhance SQLNodeConverterEngine to support more MySQL SQL statements

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere

Background

The ShardingSphere SQL federation engine provides support for complex SQL statements, and it can well support cross-database join queries, subqueries, aggregation queries and other statements. An important part of SQL federation engine is to convert the SQL statement parsed by ShardingSphere into SqlNode, so that Calcite can be used to implement SQL optimization and federated query.

Task

This issue is to solve the MySQL exception that occurs during SQLNodeConverterEngine conversion. The specific case list is as follows.

select_char
select_extract
select_from_dual
select_from_with_table
select_group_by_with_having_and_window
select_not_between_with_single_table
select_not_in_with_single_table
select_substring
select_trim
select_weight_string
select_where_with_bit_expr_with_ampersand
select_where_with_bit_expr_with_caret
select_where_with_bit_expr_with_div
select_where_with_bit_expr_with_minus_interval
select_where_with_bit_expr_with_mod
select_where_with_bit_expr_with_mod_sign
select_where_with_bit_expr_with_plus_interval
select_where_with_bit_expr_with_signed_left_shift
select_where_with_bit_expr_with_signed_right_shift
select_where_with_bit_expr_with_vertical_bar
select_where_with_boolean_primary_with_comparison_subquery
select_where_with_boolean_primary_with_is
select_where_with_boolean_primary_with_is_not
select_where_with_boolean_primary_with_null_safe
select_where_with_expr_with_and_sign
select_where_with_expr_with_is
select_where_with_expr_with_is_not
select_where_with_expr_with_not
select_where_with_expr_with_not_sign
select_where_with_expr_with_or_sign
select_where_with_expr_with_xor
select_where_with_predicate_with_in_subquery
select_where_with_predicate_with_regexp
select_where_with_predicate_with_sounds_like
select_where_with_simple_expr_with_collate
select_where_with_simple_expr_with_match
select_where_with_simple_expr_with_not
select_where_with_simple_expr_with_odbc_escape_syntax
select_where_with_simple_expr_with_row
select_where_with_simple_expr_with_tilde
select_where_with_simple_expr_with_variable
select_window_function
select_with_assignment_operator
select_with_assignment_operator_and_keyword
select_with_case_expression
select_with_collate_with_marker
select_with_date_format_function
select_with_exists_sub_query_with_project
select_with_function_name
select_with_json_value_return_type
select_with_match_against
select_with_regexp
select_with_schema_name_in_column_projection
select_with_schema_name_in_shorthand_projection
select_with_spatial_function
select_with_trim_expr
select_with_trim_expr_from_expr

You need to compare the difference between actual and expected, and then correct the logic in SQLNodeConverterEngine so that actual can be consistent with expected.

After you make changes, remember to add case to SUPPORTED_SQL_CASE_IDS to ensure it can be tested.

Notice, these issues can be a good example.
https://github.com/apache/shardingsphere/pull/14492

Relevant Skills

1. Master JAVA language

2. Have a basic understanding of Antlr g4 file

3. Be familiar with MySQL and Calcite SqlNode

Targets files

SQLNodeConverterEngineIT

https://github.com/apache/shardingsphere/blob/master/test/it/optimizer/src/test/java/org/apache/shardingsphere/test/it/optimize/SQLNodeConverterEngineIT.java

Mentor

Zhengqiang Duan, PMC of Apache ShardingSphere, duanzhengqiang@apache.org

Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org^{Image Removed}

Trista Pan, PMC of Apache ShardingSphere, panjuan@apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Zhengqiang Duan, mail: duanzhengqiang (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere Enhance ComputeNode reconciliation

-on-cloud/blob/main/shardingsphere-operator/pkg/reconcile/computenode/compute_node_test.go

Mentor

Liyao Miao, Committer of Apache ShardingSphere, miaoliyao@apache.org^{Image Added}

Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org^{Image Added}

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Chuxin Chen, mail: tuichenchuxin (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere Add the feature of switching logging framework

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere

Background

ShardingSphere provides two adapters: ShardingSphere-JDBC and ShardingSphere-Proxy.

Now, ShardingSphere uses logback for logging, but consider the following situations:

Users may need to switch the logging framework to meet special needs, such as log4j2 can provide better asynchronous performance;
When using the JDBC adapter, the user application may not use logback, which may cause some conflicts.

Why doesn't the log facade suffice? Because ShardingSphere provides users with clustered logging configurations (such as changing the log level online), this requires dynamic construction of logger, which cannot be achieved with only the log facade.

Task

1. Design and implement logging SPI to support multiple logging frameworks (such as logback and log4j2)
2. Allow users to choose which logging framework to use through the logging rule

Relevant Skills

1. Master JAVA language

2. Basic knowledge of logback and log4j2

3. Maven

Mentor

Longtao Jiang, Committer of Apache ShardingSphere, jianglongtao@apache.org^{Image Added}

Trista Pan, PMC of Apache ShardingSphere, panjuan@apache.org^{Image Added}

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Longtao Jiang, mail: jianglongtao (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere Support mainstream database metadata table query

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere

Background

ShardingSphere has designed its own metadata database to simulate metadata queries that support various databases.

More details:

https://github.com/apache/shardingsphere/issues/21268
https://github.com/apache/shardingsphere/issues/22052

Task

Support PostgreSQL And openGauss `\d tableName`
Support PostgreSQL And openGauss `\d+`
Support PostgreSQL And openGauss `\d+ tableName`
Support PostgreSQL And openGauss `l`
Support query for MySQL metadata `TABLES`
Support query for MySQL metadata `COLUMNS`
Support query for MySQL metadata `schemata`
Support query for MySQL metadata `ENGINES`
Support query for MySQL metadata `FILES`
Support query for MySQL metadata `VIEWS`

Notice, these issues can be a good example.

https://github.com/apache/shardingsphere/pull/22053
https://github.com/apache/shardingsphere/pull/22057/
https://github.com/apache/shardingsphere/pull/22166/
https://github.com/apache/shardingsphere/pull/22182

Relevant Skills

Master JAVA language
Have a basic understanding of Zookeeper
Be familiar with MySQL/Postgres SQLs

Mentor

Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org

Zhengqiang Duan, PMC of Apache ShardingSphere, duanzhengqiang@apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Chuxin Chen, mail: tuichenchuxin (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere Add ShardingSphere Kafka source connector

Apache ShardingSphere

Apache ShardingSphere is positioned as

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Page: https://shardingsphere.apache.org/
Github: https://github.com/apache/shardingsphere

Background

There is a proposal about new CRD Cluster and ComputeNode as belows:

Currently we try to promote ComputeNode as major CRD to represent a special ShardingSphere Proxy deployment. And plan to use Cluster indicating a special ShardingSphere Proxy cluster.

Task

This issue is to enhance ComputeNode reconciliation availability. The specific case list is as follows.

Add IT test case for Deployment spec volume
Add IT test case for Deployment spec template init containers
Add IT test case for Deployment spec template spec containers
Add IT test case for Deployment spec volume mounts
Add IT test case for Deployment spec container ports
Add IT test case for Deployment spec container image tag
Add IT test case for Service spec ports
Add IT test case for ConfigMap data serverconfig
Add IT test case for ConfigMap data logback

Notice, these issues can be a good example.
chore: add more Ginkgo tests for ComputeNode #203

Relevant Skills

Master Go language, Ginkgo test framework
Have a basic understanding of Apache ShardingSphere Concepts
Be familiar with Kubernetes Operator, kubebuilder framework

Targets files

ComputeNode IT -

shardingsphere

Background

The community just added CDC (change data capture) feature recently. Change feed will be published in created network connection after logging in, then it could be consumed.

Since Kafka is popular distributed event streaming platform, it's useful to import change feed into Kafka for later processing.

Task

Familiar with ShardingSphere CDC client usage, create publication and subscribe change feed.
Familiar with Kafka connector development, develop source connector, integrate with ShardingSphere CDC. Persist change feed to Kafka topics properly.
Add unit test and E2E integration test.

Relevant Skills

Java language
Basic knowledge of CDC and Kafka
Maven

References

https://github.com/apache/shardingsphere

-on-cloud

/

blob/main/shardingsphere-operator/pkg/reconcile/computenode/compute_node_test.go

Mentor

Liyao Miao, Committer of Apache ShardingSphere, miaoliyao@apache.org^{Image Removed}

Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org^{Image Removed}

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Chuxin Chen, mail: tuichenchuxin (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere Add the feature of switching logging framework

Local Test Steps

Modify `conf/server.yaml`, uncomment `cdc-server-port: 33071` to enable CDC. (Refer to step 2)
Configure proxy, refer to `Prerequisites` and `Procedure` in build to configure proxy (Newer version could be used too, current stable version is 5.3.1).
Start proxy server, it'll start CDC server too.
Download ShardingSphere source code from https://github.com/apache/shardingsphere , modify and run `org.apache.shardingsphere.data.pipeline.cdc.client.example.Bootstrap`. It'll print `records:` by default in `Bootstrap`.
Execute some ISNERT/UPDATE/DELETE SQLs in proxy to generate change feed, and then check in `Bootstrap` console.

Mentor

Hongsheng Zhong, PMC of Apache ShardingSphere, zhonghongsheng@apache.org

Xinze Guo, Committer of Apache ShardingSphere, azexin@apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Hongsheng Zhong, mail: zhonghongsheng (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere Write a converter to generate DistSQL

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Page: https://shardingsphere.apache.org/
Github: https://github.com/apache/shardingsphere

Background

ShardingSphere provides two adapters: ShardingSphere-JDBC and ShardingSphere-Proxy.

Now, ShardingSphere uses logback for logging, but consider the following situations:

Users may need to switch the logging framework to meet special needs, such as log4j2 can provide better asynchronous performance;
When using the JDBC adapter, the user application may not use logback, which may cause some conflicts.

Why doesn't the log facade suffice? Because ShardingSphere provides users with clustered logging configurations (such as changing the log level online), this requires dynamic construction of logger, which cannot be achieved with only the log facade.

Task

1. Design and implement logging SPI to support multiple logging frameworks (such as logback and log4j2)
2. Allow users to choose which logging framework to use through the logging rule

Currently we try to promote StorageNode as major CRD to represent a set of storage units for ShardingSphere.

Task

The elementary task is that the storage node controller could manage the lifecycle of a set of storage units, like PostgreSQL, in kubernetes.

We don't hope to create another wheel like pg-operator. So consider using a predefined parameter group to generate the target CRD.

[ ] Generate DistSQL according to the Golang struct `EncryptionRule`
[ ] Generate DistSQL according to the Golang struct `ShardingRule`
[ ] Generate DistSQL according to the Golang struct `ReadWriteSplittingRule`
[ ] Generate DistSQL according to the Golang struct `MaskRule`
[ ] Generate DistSQL according to the Golang struct `ShadowRule`
Relevant Skills

1. Master JAVA Go language, Ginkgo test framework
2. Basic knowledge of logback and log4j2

3. Maven

Mentor

Longtao Jiang, Committer of Apache ShardingSphere, jianglongtao@apache.org^{Image Removed}

Trista Pan, PMC of Apache ShardingSphere, panjuan@apache.org^{Image Removed}

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Longtao Jiang, mail: jianglongtao (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

Have a basic understanding of Apache ShardingSphere Concepts and DistSQL

Targets files

DistSQL Converter - https://github.com/apache/shardingsphere-on-cloud/blob/main/shardingsphere-operator/pkg/distsql/converter.go, etc.

Example

A struct defined as below:

```golang
type EncryptRule struct{}
func (t EncryptRule) ToDistSQL() string {}
```
While invoking ToDistSQL() it will generate a DistSQL regarding a EncryptRule like:

```SQL
CREATE ENCRYPT RULE t_encrypt (....
```

References:

https://shardingsphere.apache.org/document/current/en/user-manual/shardingsphere-proxy/distsql/syntax/rdl/rule-definition/encrypt/create-encrypt-rule/

Mentor
Liyao Miao, Committer of Apache ShardingSphere, miaoliyao@apache.org

Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Liyao Miao, mail: miaoliyao (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere Introduce new CRD as StorageNode for better usability

Apache ShardingSphere Support mainstream database metadata table query

Apache ShardingSphere
Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.
Page: https://shardingsphere.apache.org/
Github: https://github.com/apache/shardingsphere
Background
ShardingSphere has designed its own metadata database to simulate metadata queries that support various databases.
More details:
https://github.com/apache/shardingsphere/issues/21268
https://github.com/apache/shardingsphere/issues/22052
Task
Support PostgreSQL And openGauss `\d tableName`
Support PostgreSQL And openGauss `\d+`
Support PostgreSQL And openGauss `\d+ tableName`
Support PostgreSQL And openGauss `l`
Support query for MySQL metadata `TABLES`
Support query for MySQL metadata `COLUMNS`
Support query for MySQL metadata `schemata`
Support query for MySQL metadata `ENGINES`
Support query for MySQL metadata `FILES`
Support query for MySQL metadata `VIEWS`
Notice, these issues can be a good example.
There is a proposal about new CRD Cluster and ComputeNode as belows:
#167
#166
Currently we try to promote StorageNode as major CRD to represent a set of storage units for ShardingSphere.
Task
The elementary task is that the storage node controller could manage the lifecycle of a set of storage units, like PostgreSQL, in kubernetes.
We don't hope to create another wheel like pg-operator. So consider using a predefined parameter group to generate the target CRD.
[ ] Create a PostgreSQL cluster while a StorageNode with pg parameters is created
[ ] Update the PostgreSQL cluster while updated StorageNode
[ ] Delete the PostgreSQL cluster while deleted StorageNode. Notice this may need a deletion strategy.
[ ] Reconciling StorageNode according to the status of PostgreSQL cluster.
[ ] The status of StorageNode would be consumed by common storage units related DistSQLs
Relevant Skills
1. Master Go language, Ginkgo test framework
2. Have a basic understanding of Apache ShardingSphere Concepts
3. Be familiar with Kubernetes Operator, kubebuilder framework
Targets files
StorageNode Controller - https://github.com/apache/shardingsphere-on-cloud/pull/22053
https://github.com/apacheblob/main/shardingsphere-operator/pullpkg/22057controllers/
https://github.com/apache/shardingsphere/pull/22166/
https://github.com/apache/shardingsphere/pull/22182
Relevant Skills
Master JAVA language
Have a basic understanding of Zookeeper
Be familiar with MySQL/Postgres SQLs
Mentor
storagenode_controller.go

Mentor
Liyao MiaoChuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache miaoliyao@apache.org^{Image Added}
Zhengqiang DuanChuxin Chen, PMC Committer of Apache ShardingSphere, duanzhengqiang@apache tuichenchuxin@apache.org^{Image Added}
Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Chuxin ChenLiyao Miao, mail: tuichenchuxin miaoliyao (at) apache.org
Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere

Add ShardingSphere Kafka source connector

Introduce JVM chaos to ShardingSphere

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Page: https://shardingsphere.apache.org/
Github: https://github.com/apache/shardingsphere

Background

The community just added CDC (change data capture) feature recently. Change feed will be published in created network connection after logging in, then it could be consumed.

Since Kafka is popular distributed event streaming platform, it's useful to import change feed into Kafka for later processing.

Task

Familiar with ShardingSphere CDC client usage, create publication and subscribe change feed.
Familiar with Kafka connector development, develop source connector, integrate with ShardingSphere CDC. Persist change feed to Kafka topics properly.
Add unit test and E2E integration test.

Relevant Skills

Java language
Basic knowledge of CDC and Kafka
Maven

References

There is a proposal about the background of ChaosEngineering as belows:
Introduce ChaosEngineering for ShardingSphere #32
And we also proposed a generic controller for ShardingSphereChaos as belows:
[GSoC 2023] Introduce New CRD ShardingSphereChaos #272
The ShardingSphereChaos controller is aiming at different chaos tests. This JVMChaos is an important one.

Task

Write several scripts to implement different JVMChaos for main features of ShardingSphere. The specific case list is as follows.

Add scripts injecting chaos to DataSharding
Add scripts injecting chaos to ReadWritingSplitting
Add scripts injecting chaos to DatabaseDiscovery
Add scripts injecting chaos to Encryption
Add scripts injecting chaos to Mask
Add scripts injecting chaos to Shadow
Basically, these scripts will cause unexpected behaviour while executing the related. DistSQL.

Relevant Skills

Master Go language, Ginkgo test framework
Have a deep understanding of Apache ShardingSphere concepts and practices.
JVM byte mechanisms like ByteMan, ByteBuddy.

Targets files

JVMChaos Scripts - https://github.com/apache/shardingsphere-on-cloud/

issues/22500

https://kafka.apache.org/documentation/#connect_development

chaos/jvmchaos/scripts/

Mentor
Liyao Miao, Committer of Apache ShardingSphere, miaoliyao@apache.org
Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Liyao Miao, mail: miaoliyao (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere Introduce New CRD ShardingSphereChaos

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

https://github.com/apache/kafka/tree/trunk/connect/file/src

Page: https://shardingsphere.apache.org/
Github:

https://github.com/

confluentinc/kafka-connect-jdbc

Local Test Steps

Modify `conf/server.yaml`, uncomment `cdc-server-port: 33071` to enable CDC. (Refer to step 2)
Configure proxy, refer to `Prerequisites` and `Procedure` in build to configure proxy (Newer version could be used too, current stable version is 5.3.1).
Start proxy server, it'll start CDC server too.
Download ShardingSphere source code from https://github.com/apache/shardingsphere , modify and run `org.apache.shardingsphere.data.pipeline.cdc.client.example.Bootstrap`. It'll print `records:` by default in `Bootstrap`.
Execute some ISNERT/UPDATE/DELETE SQLs in proxy to generate change feed, and then check in `Bootstrap` console.

Mentor

Hongsheng Zhong, PMC of Apache ShardingSphere, zhonghongsheng@apache.org

apache/shardingsphere

Background

There is a proposal about the background of ChaosEngineering as belows:

Introduce ChaosEngineering for ShardingSphere · Issue #32 · apache/shardingsphere-on-cloud (github.com)

The ShardingSphereChaos controller is aiming at different chaos tests.

Task

Propose a generic controller for ShardingSphereChaos, which reconcile CRD ShardingSphereChaos, prepare, execute and verify test.

[ ] Support common ShardingSphere features, prepare test rules and dataset
[ ] Generating chaos type according to the backend implementation
[ ] Verify testing result with DistSQL or other tools

Relevant Skills

1. Master Go language, Ginkgo test framework
2. Have a deep understanding of Apache ShardingSphere concepts and practices.
3. Kubernetes operator pattern, kube-builder

Targets files

ShardingSphereChaos Controller - https://github.com/apache/shardingsphere-on-cloud/shardingsphere-operator/pkg/controllers/chaos_controller.go, etc.

Mentor

Liyao Miao, Committer of Apache ShardingSphere, miaoliyao@apache.org^{Image Added}

Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org^{Image Added}Xinze Guo, Committer of Apache ShardingSphere, azexin@apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Hongsheng ZhongLiyao Miao, mail: zhonghongsheng miaoliyao (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere Write a converter to generate DistSQL

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Page: https://shardingsphere.apache.org/
Github: https://github.com/apache/shardingsphere

Background

Currently we try to promote StorageNode as major CRD to represent a set of storage units for ShardingSphere.

Task

The elementary task is that the storage node controller could manage the lifecycle of a set of storage units, like PostgreSQL, in kubernetes.

We don't hope to create another wheel like pg-operator. So consider using a predefined parameter group to generate the target CRD.

[ ] Generate DistSQL according to the Golang struct `EncryptionRule`
[ ] Generate DistSQL according to the Golang struct `ShardingRule`
[ ] Generate DistSQL according to the Golang struct `ReadWriteSplittingRule`
[ ] Generate DistSQL according to the Golang struct `MaskRule`
[ ] Generate DistSQL according to the Golang struct `ShadowRule`
Relevant Skills

1. Master Go language, Ginkgo test framework
2. Have a basic understanding of Apache ShardingSphere Concepts and DistSQL

Targets files

DistSQL Converter - https://github.com/apache/shardingsphere-on-cloud/blob/main/shardingsphere-operator/pkg/distsql/converter.go, etc.

Example

A struct defined as below:

```golang
type EncryptRule struct{}
func (t EncryptRule) ToDistSQL() string {}
```
While invoking ToDistSQL() it will generate a DistSQL regarding a EncryptRule like:

```SQL
CREATE ENCRYPT RULE t_encrypt (....
```

References:

https://shardingsphere.apache.org/document/current/en/user-manual/shardingsphere-proxy/distsql/syntax/rdl/rule-definition/encrypt/create-encrypt-rule/

Mentor
Liyao Miao, Committer of Apache ShardingSphere, miaoliyao@apache.org

Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Liyao Miao, mail: miaoliyao (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

ShenYu

Apache ShenYu Gsoc 2023 - Support for Kubernetes Service Discovery

Background

Apache ShenYu is a Java native API Gateway for service proxy, protocol conversion and API governance. Currently, ShenYu has good usability and performance in microservice scenarios. However, ShenYu's support for Kubernetes is still relatively weak.

Tasks

1. Support the registration of microservices deployed in K8s Pod to shenyu-admin and use K8s as the register center.
2. Discuss with mentors, and complete the requirements design and technical design of Shenyu K8s Register Center.
3. Complete the initial version of Shenyu K8s Register Center.
4. Complete the CI test of Shenyu K8s Register Center, verify the correctness of the code.
5. Write the necessary documentation, deployment guides, and instructions for users to connect microservices running inside the K8s Pod to ShenYu

Relevant Skills

1. Know the use of Apache ShenYu, especially the register center
2. Familiar with Java and Golang
3. Familiar with Kubernetes and can use Java or Golang to develop

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Yonglun Zhang, mail: zhangyonglun (at) apache.org

Project Devs, mail: dev (at) shenyu.apache.org

Apache ShenYu Gsoc 2023 - Design and implement shenyu ingress-controller in k8s

Background

Apache ShenYu is a Java native API Gateway for service proxy, protocol conversion and API governance. Currently, ShenYu has good usability and performance in microservice scenarios. However, ShenYu's support for Kubernetes is still relatively weak.

Tasks

1. Discuss with mentors, and complete the requirements design and technical design of shenyu-ingress-controller.
2. Complete the initial version of shenyu-ingress-controller, implement the reconcile of k8s ingress api, and make ShenYu as the ingress gateway of k8s.
3. Complete the ci test of shenyu-ingress-controller, verify the correctness of the code.

Relevant Skills

1. Know the use of Apache ShenYu
2. Familiar with Java and Golang
3. Familiar with Kubernetes and can use java or golang to develop Kubernetes Controller

Description

Issues ： https://github.com/apache/shenyu/issues/4438
website ： https://shenyu.apache.org/

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Yu Xiao, mail: xiaoyu (at) apache.org

Project Devs, mail: dev (at) shenyu.apache.org

Apache ShenYu Gsoc 2023 - ShenYu End-To-End SpringCloud plugin test case

Background:

Shenyu is a native API gateway for service proxy, protocol translation and API governance. but Shenyu lack of End-To-End Tests.

Relevant skills：

1.Understand the architecture of ShenYu

2.Understand SpringCloud micro-service and ShenYu SpringCloud proxy plugin.

3.Understand ShenYu e2e framework and architecture.

How to coding

1.please refer to org.apache.shenyu.e2e.testcase.plugin.DividePluginCases

How to test

1.start shenyu admin in docker

2.start shenyu boostrap in docker

3.run test case org.apache.shenyu.e2e.testcase.plugin.PluginsTest#testDivide

Task List

1.develop e2e tests of the springcloud plug-in.

2.write shenyu e2e springcloud plugin documentation in shenyu-website.

3.refactor the existing plugin test cases.

Links:

website： https://shenyu.apache.org/

issues: https://github.com/apache/shenyu/issues/4474

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Fengen He, mail: hefengen

Apache ShardingSphere Introduce new CRD as StorageNode for better usability

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Page: https://shardingsphere.apache.org/
Github: https://github.com/apache/shardingsphere

Background

There is a proposal about new CRD Cluster and ComputeNode as belows:

#167
#166

Currently we try to promote StorageNode as major CRD to represent a set of storage units for ShardingSphere.

Task

The elementary task is that the storage node controller could manage the lifecycle of a set of storage units, like PostgreSQL, in kubernetes.

We don't hope to create another wheel like pg-operator. So consider using a predefined parameter group to generate the target CRD.

[ ] Create a PostgreSQL cluster while a StorageNode with pg parameters is created
[ ] Update the PostgreSQL cluster while updated StorageNode
[ ] Delete the PostgreSQL cluster while deleted StorageNode. Notice this may need a deletion strategy.
[ ] Reconciling StorageNode according to the status of PostgreSQL cluster.
[ ] The status of StorageNode would be consumed by common storage units related DistSQLs

Relevant Skills

1. Master Go language, Ginkgo test framework
2. Have a basic understanding of Apache ShardingSphere Concepts
3. Be familiar with Kubernetes Operator, kubebuilder framework

Targets files

StorageNode Controller - https://github.com/apache/shardingsphere-on-cloud/blob/main/shardingsphere-operator/pkg/controllers/storagenode_controller.go

Mentor

Liyao Miao, Committer of Apache ShardingSphere, miaoliyao@apache.org^{Image Removed}

Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org^{Image Removed}

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Liyao Miao, mail: miaoliyao (at) apache.org

Project Devs, mail: dev (at) shardingsphereshenyu.apache.org

Apache

ShardingSphere Introduce JVM chaos to ShardingSphere

ShenYu Gsoc 2023 - ShenYu WasmPlugin

Background:{}

Apache ShenYu is a Java native API Gateway for service proxy, protocol conversion and API governance. Currently, ShenYu has good scalability in the Java language. However, ShenYu's support for multiple languages is still relatively weak.

The wasm bytecode is designed to be encoded in a size- and load-time-efficient binary format. WebAssembly aims to execute at native speed by taking advantage of common hardware capabilities available on a wide range of platforms.

The goal of WasmPlugin is to be able to run wasm bytecode(wasmer-java is a good choice, if you find a better choice, please discuss with me), and other languages can write ShenYu plugins based on this language(such as Rust/golang/C++) as long as they can be compiled into wasm bytecode.

More documents on wasm and WASI are as follows:
https://github.com/WebAssembly/design

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Page: https://shardingsphere.apache.org/
Github: https://github.com/apacheWebAssembly/shardingsphereWASI

Background

There is a proposal about the background of ChaosEngineering as belows:

Introduce ChaosEngineering for ShardingSphere #32
And we also proposed a generic controller for ShardingSphereChaos as belows:

[GSoC 2023] Introduce New CRD ShardingSphereChaos #272
The ShardingSphereChaos controller is aiming at different chaos tests. This JVMChaos is an important one.

Task

Write several scripts to implement different JVMChaos for main features of ShardingSphere. The specific case list is as follows.

Add scripts injecting chaos to DataSharding
Add scripts injecting chaos to ReadWritingSplitting
Add scripts injecting chaos to DatabaseDiscovery
Add scripts injecting chaos to Encryption
Add scripts injecting chaos to Mask
Add scripts injecting chaos to Shadow
Basically, these scripts will cause unexpected behaviour while executing the related. DistSQL.

Relevant Skills

Master Go language, Ginkgo test framework
Have a deep understanding of Apache ShardingSphere concepts and practices.
JVM byte mechanisms like ByteMan, ByteBuddy.

Targets files

Relevant Skills

Know the use of Apache ShenYu, especially the plugin
Familiar with Java and another language which can be compiled into wasm bytecode

Task List

1.develop shenyu-wasm-plugin.

2.write integrated test for shenyu-wasm-plugin.

3.write wasm plugin documentation in shenyu-website.

Links:

website： https://shenyu.apache.org/

issues: JVMChaos Scripts - https://github.com/apache/shardingsphere-on-cloud/chaos/jvmchaos/scripts/

Mentor
Liyao Miao, Committer of Apache ShardingSphere, miaoliyao@apache.org
Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org

shenyu/issues/4492

Difficulty: Major

Project size: ~175 hour (medium

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Liyao MiaoZiCheng Zhang, mail: miaoliyao zhangzicheng (at) apache.org

Project Devs, mail: dev (at) shardingsphereshenyu.apache.org

Apache

ShardingSphere Introduce New CRD ShardingSphereChaos

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Page: https://shardingsphere.apache.org/
Github: https://github.com/apache/shardingsphere

Background

There is a proposal about the background of ChaosEngineering as belows:

Introduce ChaosEngineering for ShardingSphere · Issue #32 · apache/shardingsphere-on-cloud (github.com)

The ShardingSphereChaos controller is aiming at different chaos tests.

Task

Propose a generic controller for ShardingSphereChaos, which reconcile CRD ShardingSphereChaos, prepare, execute and verify test.

[ ] Support common ShardingSphere features, prepare test rules and dataset
[ ] Generating chaos type according to the backend implementation
[ ] Verify testing result with DistSQL or other tools

Relevant Skills

1. Master Go language, Ginkgo test framework
2. Have a deep understanding of Apache ShardingSphere concepts and practices.
3. Kubernetes operator pattern, kube-builder

Targets files

ShardingSphereChaos Controller - https://github.com/apache/shardingsphere-on-cloud/shardingsphere-operator/pkg/controllers/chaos_controller.go, etc.

Mentor

Liyao Miao, Committer of Apache ShardingSphere, miaoliyao@apache.org^{Image Removed}

Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org^{Image Removed}

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Liyao Miao, mail: miaoliyao (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

ShenYu

ShenYu Gsoc 2023 - Shenyu-Admin Internationalization

Background

Shenyu is a native API gateway for service proxy, protocol translation and API governance. It can manage and maintain the API through Shenyu-admin, and support internationalization in Chinese and English. Unfortunately, Shenyu-admin is only internationalized on the front end. The message prompt returned by the back-end interface is still in English. Therefore, we need to implement internationalization support for the back-end interface.This will lay a good foundation for shenyu to move towards more language support.

Relevant skills

Related skills spring resources
Spring Internationalization
Front-end react framework

API reference

            java.util.Locale;
            org.springframework.context.MessageSource;
            org.springframework.context.support.ResourceBundleMessageSource;

Interface effect example

            ## zh request example
            POST http://localhost:9095/plugin
            Content-Type: application/json
            Location: cn-zh
            X-Access-Token: xxx
            {
            "name": "test-create-plugin",
            "role": "test-create-plugin",
            "enabled": true,
            "sort": 100
            }
            Respone
            {
            "code": 600,
            "message": "未登录"
            }
            
            ### en request example
            POST http://localhost:9095/plugin
            Content-Type: application/json
            Location: en
            X-Access-Token: xxx
            {
            "name": "test-create-plugin",
            "role": "test-create-plugin",
            "enabled": true,
            "sort": 100
            }
            Respone
            {
            "code": 600,
            "message": "token is error"
            }

Task List

The task discussed with the tutor how to achieve the internationalization of shenyu-admin background
Some prompt message translation
Get through the internationalization of front-end, obtain the client region information through http protocol, support the language of the corresponding region.
Leave the extension of other multi-language internationalization support interface, so as to facilitate the localization transformation of subsequent users.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Keguo Li, mail: likeguo (at) apache.org

Project Devs, mail: dev (at) shenyu.apache.org

Apache ShenYu Gsoc 2023 - Design license scanning function

Background

At present, shenyu needs to manually check whether the license is correct one by one when releasing the version.

Tasks

Discuss with the tutor to complete the requirement design and technical design of the scanning license.
Finished scanning the initial version of the license.
Complete the corresponding test.

Relevant Skills

Familiar with Java.

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

SiYing Zheng, mail: impactcn

Apache ShenYu Gsoc 2023 - Support for Kubernetes Service Discovery

Background

Apache ShenYu is a Java native API Gateway for service proxy, protocol conversion and API governance. Currently, ShenYu has good usability and performance in microservice scenarios. However, ShenYu's support for Kubernetes is still relatively weak.

Tasks

1. Support the registration of microservices deployed in K8s Pod to shenyu-admin and use K8s as the register center.
2. Discuss with mentors, and complete the requirements design and technical design of Shenyu K8s Register Center.
3. Complete the initial version of Shenyu K8s Register Center.
4. Complete the CI test of Shenyu K8s Register Center, verify the correctness of the code.
5. Write the necessary documentation, deployment guides, and instructions for users to connect microservices running inside the K8s Pod to ShenYu

Relevant Skills

1. Know the use of Apache ShenYu, especially the register center
2. Familiar with Java and Golang
3. Familiar with Kubernetes and can use Java or Golang to develop

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Yonglun Zhang, mail: zhangyonglun (at) apache.org

Project Devs, mail: dev (at) shenyu.apache.org

TrafficControl

Apache ShenYu Gsoc 2023 - Design and implement shenyu ingress-controller in k8s

GSOC Varnish Cache support in Apache Traffic Control

Background
Apache ShenYu Traffic Control is a Java native API Gateway for service proxy, protocol conversion and API governance. Currently, ShenYu has good usability and performance in microservice scenarios. However, ShenYu's support for Kubernetes is still relatively weak.

Tasks

1. Discuss with mentors, and complete the requirements design and technical design of shenyu-ingress-controller.
2. Complete the initial version of shenyu-ingress-controller, implement the reconcile of k8s ingress api, and make ShenYu as the ingress gateway of k8s.
3. Complete the ci test of shenyu-ingress-controller, verify the correctness of the code.

Relevant Skills

1. Know the use of Apache ShenYu
2. Familiar with Java and Golang
3. Familiar with Kubernetes and can use java or golang to develop Kubernetes Controller

Description

Issues ： https://github.com/apache/shenyu/issues/4438
website ： https://shenyu.apache.org/

Content Delivery Network (CDN) control plane for large scale content distribution.

Traffic Control currently requires Apache Traffic Server as the underlying cache. Help us expand the scope by integrating with the very popular Varnish Cache.

There are multiple aspects to this project:

Configuration Generation: Write software to build Varnish configuration files (VCL). This code will be implemented in our Traffic Ops and cache client side utilities, both written in Go.

Health Monitoring: Implement monitoring of the Varnish cache health and performance. This code will run both in the Traffic Monitor component and within Varnish. Traffic Monitor is written in Go and Varnish is written in C.

Testing: Adding automated tests for new code

Skills:

Proficiency in Go is required
A basic knowledge of HTTP and caching is preferred, but not required for this project.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Yu XiaoEric Friedrich, mail: xiaoyu friede (at) apache.org

Project Devs, mail: dev (at) shenyutrafficcontrol.apache.org

Add server indicator if a server is a cache

https://github.com/apache/trafficcontrol/issues/7076

Difficulty: Trivial

Apache ShenYu Gsoc 2023 - Design license scanning function

Background

At present, shenyu needs to manually check whether the license is correct one by one when releasing the version.

Tasks

Discuss with the tutor to complete the requirement design and technical design of the scanning license.
Finished scanning the initial version of the license.
Complete the corresponding test.

Relevant Skills

Familiar with Java.

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

SiYing ZhengBrennan Fieck, mail: impactcn ocket8888 (at) apache.org

Project Devs, mail: dev (at) shenyutrafficcontrol.apache.org

Doris

[GSoC][Doris]Page Cache Improvement

Apache Doris
Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.
Page: https://doris.apache.org

Github: https://github.com/apache/doris

Background

Apache Doris accelerates high-concurrency queries utilizing page cache, where the decompressed data is stored.
Currently, the page cache in Apache Doris uses a simple LRU algorithm, which reveals a few problems:

Hot data will be phased out in large queries
The page cache configuration is immutable and does not support GC.

Task

Phase One: Identify the impacts on queries when the decompressed data is stored in memory and SSD, respectively, and then determine whether full page cache is required.

Phase Two: Improve the cache strategy for Apache Doris based on the results from Phase One.

Learning Material

Page: https://doris.apache.org
Github: https://github.com/apache/doris

Mentor

Mentor: Yongqiang Yang, Apache Doris PMC member & Committer, yangyongqiang@apache.org ^{Image Added}
Mentor: Haopeng Li, Apache Doris PMC member & Committer, lihaopeng@apache.org^{Image Added}
Mailing List: dev@doris.apache.org

Apache ShenYu Gsoc 2023 - ShenYu End-To-End SpringCloud plugin test case

Background:

Shenyu is a native API gateway for service proxy, protocol translation and API governance. but Shenyu lack of End-To-End Tests.

Relevant skills：

1.Understand the architecture of ShenYu

2.Understand SpringCloud micro-service and ShenYu SpringCloud proxy plugin.

3.Understand ShenYu e2e framework and architecture.

How to coding

1.please refer to org.apache.shenyu.e2e.testcase.plugin.DividePluginCases

How to test

1.start shenyu admin in docker

2.start shenyu boostrap in docker

3.run test case org.apache.shenyu.e2e.testcase.plugin.PluginsTest#testDivide

Task List

1.develop e2e tests of the springcloud plug-in.

2.write shenyu e2e springcloud plugin documentation in shenyu-website.

3.refactor the existing plugin test cases.

Links:

website： https://shenyu.apache.org/

issues: https://github.com/apache/shenyu/issues/4474

Difficulty: Major

Project size: ~175 ~350 hour (mediumlarge)

Potential mentors:

Fengen HeZhijing Lu, mail: hefengen luzhijing (at) apache.org

Project Devs, mail: dev (at) shenyudoris.apache.org

Apache ShenYu Gsoc 2023 - ShenYu WasmPlugin

[GSoC][Doris] Supports BigQuery/Apache Kudu/Apache Cassandra/Apache Druid in Federated Queries

Apache Doris
Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.
Page: https://doris.apache.org
Github:

Background:{}

Apache ShenYu is a Java native API Gateway for service proxy, protocol conversion and API governance. Currently, ShenYu has good scalability in the Java language. However, ShenYu's support for multiple languages is still relatively weak.

The wasm bytecode is designed to be encoded in a size- and load-time-efficient binary format. WebAssembly aims to execute at native speed by taking advantage of common hardware capabilities available on a wide range of platforms.

The goal of WasmPlugin is to be able to run wasm bytecode(wasmer-java is a good choice, if you find a better choice, please discuss with me), and other languages can write ShenYu plugins based on this language(such as Rust/golang/C++) as long as they can be compiled into wasm bytecode.

More documents on wasm and WASI are as follows:
https://github.com/WebAssembly/design
https://github.com/WebAssemblyapache/WASIdoris

Relevant Skills

Know the use of Apache ShenYu, especially the plugin
Familiar with Java and another language which can be compiled into wasm bytecode

Task List

1.develop shenyu-wasm-plugin.

2.write integrated test for shenyu-wasm-plugin.

3.write wasm plugin documentation in shenyu-website.

Links:

website： https://shenyu.apache.org/

issues: https://github.com/apache/shenyu/issues/4492

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

ZiCheng Zhang, mail: zhangzicheng (at) apache.org

Project Devs, mail: dev (at) shenyu.apache.org

Background

Apache Doris supports acceleration of queries on external data sources to meet users' needs for federated queries and analysis.
Currently, Apache Doris supports multiple external catalogs including those from Hive, Iceberg, Hudi, and JDBC. Developers can connect more data sources to Apache Doris based on a unified framework.

Objective

Enable Apache Doris to access one or more of these data sources via the Multi-Catalog feature: BigQuery/Kudu/Cassandra/Druid;
Compile relevant documentation. See an example here: https://doris.apache.org/docs/dev/lakehouse/multi-catalog/hive

Task
Phase One:

Get familiar with the Multi-Catalog structure of Apache Doris, including the metadata synchronization mechanism in FE and the data reading mechanism of BE.
Investigate how metadata should be acquired and how data access works regarding the picked data source(s); produce the corresponding design documentation.

Phase Two:

Develop connections to the picked data source(s) and implement access to metadata and data.

Learning Material

Page: https://doris.apache.org
Github: https://github.com/apache/doris

Mentor

Mentor: Mingyu Chen, Apache Doris PMC Member & Committer, morningman@apache.org ^{Image Added}
Mentor: Calvin Kirs, Apache Geode PMC & Committer, Kirs@apache.org^{Image Added}
Mailing List: dev@doris.apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Zhijing Lu, mail: luzhijing (at) apache.org

Project Devs, mail: dev (at) doris.apache.org

[GSoC][Doris]Dictionary Encoding Acceleration

Apache Doris
Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.
Page: https://doris.apache.org

Github: https://github.com/apache/doris

Background

In Apache Doris, dictionary encoding is performed during data writing and compaction. Dictionary encoding will be implemented on string data types by default. The dictionary size of a column for one segment is 1M at most. The dictionary encoding technology accelerates strings during queries, converting them into INT, for example.

Task

Phase One: Get familiar with the implementation of Apache Doris dictionary encoding; learning how Apache Doris dictionary encoding accelerates queries.
Phase Two: Evaluate the effectiveness of full dictionary encoding and figure out how to optimize memory in such a case.

Learning Material

Page: https://doris.apache.org
Github: https://github.com/apache/doris

Mentor

Mentor: Chen Zhang, Apache Doris Committer, zhangchen@apache.org ^{Image Added}
Mentor: Zhijing Lu, Apache Doris Committer, luzhijing@apache.org^{Image Added}
Mailing List: dev@doris.apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Zhijing Lu, mail: luzhijing (at) apache.org

Project Devs, mail: dev (at) doris.apache.org

SkyWalking

[GSOC] [SkyWalking] Self-Observability of the query subsystem in BanyanDB

Background

SkyWalking BanyanDB is an observability database, aims to ingest, analyze and store Metrics, Tracing and Logging data.

Objectives

Support EXPLAIN[1] for both measure query and stream query
Add self-observability including trace and metrics for query subsystem
Support EXPLAIN in the client SDK & CLI and add query plan visualization in the UI

[1]: EXPLAIN in MySQL

Recommended Skills

Familiar with Go
Have a basic understanding of database query engine
Have an experience of Apache SkyWalking or other APMs

Mentor

Mentor: Jiajing Lu, Apache SkyWalking PMC, lujiajing@apache.org^{Image Added}
Mentor: Hongtao Gao, Apache SkyWalking PMC, Apache ShardingSphere PMC, hanahmily@apache.org^{Image Added}
Mailing List: dev@skywalking.apache.org

Apache ShenYu Gsoc 2023 - Shenyu-Admin Internationalization

Background

Shenyu is a native API gateway for service proxy, protocol translation and API governance. It can manage and maintain the API through Shenyu-admin, and support internationalization in Chinese and English. Unfortunately, Shenyu-admin is only internationalized on the front end. The message prompt returned by the back-end interface is still in English. Therefore, we need to implement internationalization support for the back-end interface.This will lay a good foundation for shenyu to move towards more language support.

Relevant skills

Related skills spring resources
Spring Internationalization
Front-end react framework

API reference

            java.util.Locale;
            org.springframework.context.MessageSource;
            org.springframework.context.support.ResourceBundleMessageSource;

Interface effect example

            ## zh request example
            POST http://localhost:9095/plugin
            Content-Type: application/json
            Location: cn-zh
            X-Access-Token: xxx
            {
            "name": "test-create-plugin",
            "role": "test-create-plugin",
            "enabled": true,
            "sort": 100
            }
            Respone
            {
            "code": 600,
            "message": "未登录"
            }
            
            ### en request example
            POST http://localhost:9095/plugin
            Content-Type: application/json
            Location: en
            X-Access-Token: xxx
            {
            "name": "test-create-plugin",
            "role": "test-create-plugin",
            "enabled": true,
            "sort": 100
            }
            Respone
            {
            "code": 600,
            "message": "token is error"
            }

Task List

The task discussed with the tutor how to achieve the internationalization of shenyu-admin background
Some prompt message translation
Get through the internationalization of front-end, obtain the client region information through http protocol, support the language of the corresponding region.
Leave the extension of other multi-language internationalization support interface, so as to facilitate the localization transformation of subsequent users.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Keguo LiJiajing Lu, mail: likeguo lujiajing (at) apache.org

Project Devs, mail: dev (at) shenyuskywalking.apache.org

TrafficControl

GSOC Varnish Cache support in Apache Traffic Control

Background
Apache Traffic Control is a Content Delivery Network (CDN) control plane for large scale content distribution.

Traffic Control currently requires Apache Traffic Server as the underlying cache. Help us expand the scope by integrating with the very popular Varnish Cache.

There are multiple aspects to this project:

Configuration Generation: Write software to build Varnish configuration files (VCL). This code will be implemented in our Traffic Ops and cache client side utilities, both written in Go.

Health Monitoring: Implement monitoring of the Varnish cache health and performance. This code will run both in the Traffic Monitor component and within Varnish. Traffic Monitor is written in Go and Varnish is written in C.

Testing: Adding automated tests for new code

Skills:

Proficiency in Go is required
A basic knowledge of HTTP and caching is preferred, but not required for this project.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Eric Friedrich, mail: friede (at) apache.org

Project Devs, mail: dev (at) trafficcontrol.apache.org

[GSOC] [SkyWalking] Unify query planner and executor in BanyanDB

Background

SkyWalking BanyanDB is an observability database, aims to ingest, analyze and store Metrics, Tracing and Logging data.

Objectives

Fully unify/merge the query planner and executor for Measure and TopN

Recommended Skills

Familiar with Go
Have a basic understanding of database query engine
Have an experience of Apache SkyWalking

Mentor

Mentor: Jiajing Lu, Apache SkyWalking PMC, lujiajing@apache.org^{Image Added}
Mentor: Hongtao Gao, Apache SkyWalking PMC, Apache ShardingSphere PMC, hanahmily@apache.org^{Image Added}
Mailing List: dev@skywalking.apache.org

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Jiajing Lu, mail: lujiajing (at) apache.org

Project Devs, mail: dev (at) skywalking.apache.org

[GSOC][SkyWalking] Add Terraform provider for Apache SkyWalking

Now the deployment methods for SkyWalking are limited, we only have Helm Chart for users to deploy in Kubernetes, other users that are not using Kubernetes have to do all the house keeping stuffs to set up SkyWalking on, for example, VM.

This issue aims to add a Terraform provider, so that users can conveniently spin up a cluster for demonstration or testing, we should evolve the provider and allow users to customize as their need and finally users can use this in their production environment.

In this task, we will mainly focus on the support for AWS. In the Terraform provider, users need to provide their access key / secret key, and the provider does the rest stuffs: create VMs, create database/OpenSearch or RDS, download SkyWalking tars, configure the SkyWalking, and start the SkyWalking components (OAP/UI), create public IPs/domain name, etc.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Zhenxu Ke, mail: kezhenxu94

Add server indicator if a server is a cache

https://github.com/apache/trafficcontrol/issues/7076

Difficulty: Trivial

Project size: ~175 hour (medium)

Potential mentors:

Brennan Fieck, mail: ocket8888 (at) apache.org

Project Devs, mail: dev (at) trafficcontrolskywalking.apache.org

Doris

[

GSoC

GSOC] [

Doris]Page Cache Improvement

Apache Doris
Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.
Page: https://doris.apache.org

Github: https://github.com/apache/doris

Background

Apache Doris accelerates high-concurrency queries utilizing page cache, where the decompressed data is stored.
Currently, the page cache in Apache Doris uses a simple LRU algorithm, which reveals a few problems:

Hot data will be phased out in large queries
The page cache configuration is immutable and does not support GC.

Task

Phase One: Identify the impacts on queries when the decompressed data is stored in memory and SSD, respectively, and then determine whether full page cache is required.

Phase Two: Improve the cache strategy for Apache Doris based on the results from Phase One.

Learning Material

Page: https://doris.apache.org
Github: https://github.com/apache/doris

Mentor

Mentor: Yongqiang Yang, Apache Doris PMC member & Committer, yangyongqiang@apache.org ^{Image Removed}
Mentor: Haopeng Li, Apache Doris PMC member & Committer, lihaopeng@apache.org^{Image Removed}
Mailing List: dev@doris.apache.org

SkyWalking] Add Overview page in BanyanDB UI

Background

SkyWalking BanyanDB is an observability database, aims to ingest, analyze and store Metrics, Tracing and Logging data.

The BanyanDB UI is a web interface provided BanyanDB server. It's developed with Vue3 and Vite3

Objectives

The UI should have a user-friendly Overview page.
The Overview page must display a list of nodes running in a cluster.
For each node in the list, the following information must be shown:

Node ID or name
Uptime
CPU usage (percentage)
Memory usage (percentage)
Disk usage (percentage)
Ports(gRPC and HTTP)

The web app must automatically refresh the node data at a configurable interval to show the most recent information.

Recommended Skills

Familiar with Vue and Vite
Have a basic understanding of RESTFul
Have an experience of Apache SkyWalking

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Zhijing LuHongtao Gao, mail: luzhijing hanahmily (at) apache.org

Project Devs, mail: dev (at) dorisskywalking.apache.org

[

GSoC

GSOC] [SkyWalking] AIOps Log clustering with Flink (Algorithm Optimization)

Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This year we will proceed on log clustering implementation with a revised architecture and this task will require student to focus on algorithm optimiztion for the clustering technique.

Mentor

Mentor: Sheng Wu, Apache SkyWalking PMC, wusheng@apache.org^{Image Added}
Co-mentor: Yihao Chen (Superskyyy), Apache SkyWalking PMC, yihaochen@apache.org^{Image Added}
Mailing List: dev@skywalking.apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Yihao Chen, mail: yihaochen (at) apache.org

Project Devs, mail: dev (at) skywalking.apache.org

[GSOC] [SkyWalking] AIOps Log clustering with Flink (Flink Integration)

Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This year we will proceed on log clustering implementation with a revised architecture and this task will require student to focus on Flink and its integration with SkyWalking OAP.

Mentor

Mentor: Yanlong He, Apache SkyWalking PMC, heyanlong@apache.org^{Image Added}
Co-mentor: Yihao Chen (Superskyyy), Apache SkyWalking PMC, yihaochen@apache

Doris] Supports BigQuery/Apache Kudu/Apache Cassandra/Apache Druid in Federated Queries

Apache Doris
Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.
Page: https://doris.apache.org
Github: https://github.com/apache/doris

Background

Apache Doris supports acceleration of queries on external data sources to meet users' needs for federated queries and analysis.
Currently, Apache Doris supports multiple external catalogs including those from Hive, Iceberg, Hudi, and JDBC. Developers can connect more data sources to Apache Doris based on a unified framework.

Objective

Enable Apache Doris to access one or more of these data sources via the Multi-Catalog feature: BigQuery/Kudu/Cassandra/Druid;
Compile relevant documentation. See an example here: https://doris.apache.org/docs/dev/lakehouse/multi-catalog/hive

Task
Phase One:

Get familiar with the Multi-Catalog structure of Apache Doris, including the metadata synchronization mechanism in FE and the data reading mechanism of BE.
Investigate how metadata should be acquired and how data access works regarding the picked data source(s); produce the corresponding design documentation.

Phase Two:

Develop connections to the picked data source(s) and implement access to metadata and data.

Learning Material

Page: https://doris.apache.org
Github: https://github.com/apache/doris

Mentor

Mentor: Mingyu Chen, Apache Doris PMC Member & Committer, morningman@apache.org ^{Image Removed}
Mentor: Calvin Kirs, Apache Geode PMC & Committer, Kirs@apache.org^{Image Modified}
Mailing List: dev@dorisdev@skywalking.apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Zhijing LuYihao Chen, mail: luzhijing yihaochen (at) apache.org

Project Devs, mail: dev (at) dorisskywalking.apache.org

[

GSoC

GSOC] [

Doris]Dictionary Encoding Acceleration

SkyWalking] Python Agent Performance Enhancement Plan

Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This task is about enhancing Python agent performance, the tracking issue can be seen here -<

Apache Doris
Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.
Page: https://doris.apache.org

Github: https://github.com/apache/skywalking/issues/doris10408

Background

In Apache Doris, dictionary encoding is performed during data writing and compaction. Dictionary encoding will be implemented on string data types by default. The dictionary size of a column for one segment is 1M at most. The dictionary encoding technology accelerates strings during queries, converting them into INT, for example.

Task

Phase One: Get familiar with the implementation of Apache Doris dictionary encoding; learning how Apache Doris dictionary encoding accelerates queries.
Phase Two: Evaluate the effectiveness of full dictionary encoding and figure out how to optimize memory in such a case.

Learning Material

Page: https://doris.apache.org
Github: https://github.com/apache/doris

Mentor

Mentor: Chen Zhang, Apache Doris Committer, zhangchen@apache.org ^{Image Removed}

Mentor: Zhijing Lu, Apache Doris Committer, luzhijing@apache.org^{Image Removed}

Mailing List: dev@doris.apache.org

Mentor

Mentor: Yihao Chen (Superskyyy), Apache SkyWalking PMC, yihaochen@apache.org^{Image Added}
Mentor: Zhenxu Ke, Apache SkyWalking PMC, kezhenxu94@apache.org^{Image Added}
Mailing List: dev@skywalking.apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Yihao Chen, mail: yihaochen (at) apache.org

Project Devs, mail: dev (at) skywalking.apache.org

[SkyWalking] Build the OAP into GraalVM native image

Currently skywalking OAP is bundled as a tar ball when releasing, and the start time is long, we are looking for a way to distribute the binary executable in a more convenient way and speed up the bootstrap time. Now we found that GraalVM is a good fit not only it can solve the two aforementioned points but also it will bring benefits that, we can rewrite our LAL or even MAL system in the future with a more secure and isolated method, wasm, which is supported GraalVM too!

so this task is to adjust OAP, build it into GraalVM and make all tests in OAP passed.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Zhijing LuZhenxu Ke, mail: luzhijing kezhenxu94 (at) apache.org

Project Devs, mail: dev (at) dorisskywalking.apache.org

Beam

[GSoC][Beam] Build out Beam Machine Learning Use Cases

Today, you can do all sorts of Machine Learning using Apache Beam (https://beam.apache.org/documentation/ml/overview/).

Many of our users, however, have a hard time getting started with ML and understanding how Beam can be applied to their day to day work. The goal of this project is to build out a series of Beam pipelines as Jupyter Notebooks demonstrating real world ML use cases, from NLP to image recognition to using large language models. As you go, there may be bugs or friction points as well which will provide opportunities to contribute back to Beam's core ML libraries.

Mentor for this will be Danny McCormick

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Pablo Estrada, mail: pabloem (at) apache.org

Project Devs, mail: dev (at) beam.apache.org

...

[GSoC][Teaclave (incubating)] Data Privacy Policy Definition and Function Verification

Background

The Apache Teaclave (incubating) is a cutting-edge solution for confidential computing, providing Function-as-a-Service (FaaS) capabilities that enable the decoupling of data and function providers. Despite its impressive functionality and security features, Teaclave currently lacks a mechanism for data providers to enforce policies on the data they upload. For example, data providers may wish to restrict access to certain columns of data for third-party function providers. Open Policy Agent (OPA) offers flexible control over service behavior and has been widely adopted by the cloud-native community. If Teaclave were to integrate OPA, data providers could apply policies to their data, enhancing Teaclave’s functionality. Another potential security loophole in Teaclave is the absence of a means to verify the expected behavior of a function. This gap leaves the system vulnerable to exploitation by malicious actors. Fortunately, most of Teaclave’s interfaces can be reused, with the exception of the function uploading phase, which may require an overhaul to address this issue. Overall, the integration of OPA and the addition of a function verification mechanism would make Teaclave an even more robust and secure solution for confidential computing.

Benefits

If this proposal moves on smoothly, new functionality will be added to the Teaclave project that enables the verification of the function behavior that it strictly conforms to a prescribed policy.

Deliverables

Milestones: Basic policies (e.g., addition, subtraction) of the data can be verified by Teaclave; Complex policies can be verified.

Components: Verifier for the function code; Policy language adapters (adapt policy language to verifier); Policy language parser; Function source code converter (append policies to the functions).

Documentation: The internal working mechanism of the verification; How to write policies for the data.

Timeline Estimation

0.5 month: Policy language parser and/or policy language design (if Rego is not an ideal choice).
1.5 − 2 months: Verification contracts rewriting on the function source code based on the policy parsed. • (∼ 1 month): The function can be properly verified formally (by, e.g., querying the Z3 SMT solver).

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Mingshen Sun, Apache Teaclave (incubating) PPMC, mssun@apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Mingshen Sun, mail: mssun (at) apache.org

Project Devs, mail: dev (at) teaclave.apache.org

Airflow

[GSoC][Airflow] Automation for PMC

This is a project to implement a tool for PMC task automation.

This is a large project.

Mentor will be aizhamal ,

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Pablo Estrada, mail: pabloem (at) apache.org

Project Devs, mail: dev (at) airflow.apache.org

SeaTunnel

Apache SeaTunnel(Incubating) Http Client For SeaTunnel Zeta

Apache SeaTunnel(Incubating)

SeaTunnel is a very easy-to-use ultra-high-performance distributed data integration platform that supports real-time synchronization of massive data. It can synchronize tens of billions of data stably and efficiently every day, and has been used in the production of nearly 100 companies.

SeaTunnel provides a Connector API that does not depend on a specific execution engine. Connectors (Source, Transform, Sink) developed based on this API can run On many different engines, such as SeaTunnel Zeta, Flink, Spark that are currently supported. SeaTunnel has supported more than 100 Connectors, and the number is surging.

Website: https://seatunnel.apache.org/

GitHub: https://github.com/apache/incubator-seatunnel

Background

To use SeaTunnel, the current user needs to first create and write a config file that specifies the engine that runs the job, as well as engine related parameters. Then define the Source, Transform, and Sink of the job. We hope to provide a client that allows users to define the engine, Source, Transform, and Sink information of the job run directly through code in the client without having to start with a config file. The user can then submit the job definition information through the Client, and SeaTunnel will run these jobs. After the job is submitted, the user can obtain the status of the job running through the Client. For jobs that are already running, users can use this client to manage them, such as stopping jobs, temporary jobs, and so on.

Task

1. Discuss with the mentors what you need to do

2. Learn the details of the Apache SeaTunnel project

3. Discuss and complete design and development

Relevant Skills

Familiar with Java, Http
Familiar with SeaTunnel is better

Mentor

Mentor: Jun Gao, Apache SeaTunnel(Incubating) PPMC Member, gaojun2048@apache.org
Mentor: Li Liu, Apache SeaTunnel(Incubating) Commiter, ic4y@apache.org
Mailing List: dev@seatunnel.apache.org

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Jun Gao, mail: gaojun2048 (at) apache.org

Project Devs, mail: dev (at) seatunnel.apache.org

Airflow

, ic4y@apache.org
Mailing List: dev@seatunnel.apache.org

[GSoC][Airflow] Automation for PMC

This is a project to implement a tool for PMC task automation.

This is a large project.

Mentor will be aizhamal ,

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Pablo EstradaJun Gao, mail: pabloem gaojun2048 (at) apache.org

Project Devs, mail: dev (at) airflowseatunnel.apache.org

CloudStack

CloudStack GSoC 2023 - Improve ConfigDrive to store network information

Github issue: https://github.com/apache/cloudstack/issues/2872

ConfigDrive / cloud-init supports a network_data.json file which can contain network information for a VM.

By providing the network information using ConfigDrive to a VM we can eliminate the need for DHCP and thus the Virtual Router in some use-cases.

An example JSON file:example JSON file:

            {
            "links": [
            {
            "ethernet_mac_address": "52:54:00:0d:bf:93",
            "id": "eth0",
            "mtu": 1500,
            "type": "phy"
            }
            {],
            "linksnetworks": [
            {
            "id": "ethernet_mac"eth0",
            "ip_address": "192.168.200.200",
            "link": "eth0",
            "netmask"52:54:00:0d:bf:93",: "255.255.255.0",
            "network_id": "dacd568d-5be6-4786-91fe-750c374b78b4",
            "routes": [
            {
            "idgateway": "eth0192.168.200.1",
            "mtunetmask": 1500"0.0.0.0",
            "typenetwork": "phy0.0.0.0"
            }
            ],
            "networkstype": [ "ipv4"
            },
            {
            "id": "eth0",
            "ip_address": "192.168.200.2002001:db8:100::1337",
            "link": "eth0",
            "netmask": "255.255.255.064",
            "network_id": "dacd568d-5be6-4786-91fe-750c374b78b4",
            "routes": [
            {
            "gateway": "192.168.200.2001:db8:100::1",
            "netmask": "0.0.0.0",
            "network": "0.0.0.0::"
            }
            ],
            "type": "ipv4ipv6"
            }
            ],
            "services": [
            {
            "idaddress": "eth08.8.8.8",
            "ip_addresstype": "2001:db8:100::1337",dns"
            "link": "eth0",}
            "netmask": "64",]
            "network_id": "dacd568d-5be6-4786-91fe-750c374b78b4",
            "routes": [
            {
            "gateway": "2001:db8:100::1",
            "netmask": "0",
            "network": "::"
            }
            ],
            "type": "ipv6"
}

In Basic Networking and Advanced Networking zones which are using a shared network you wouldn't require a VR anymore.

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Nicolás Vázquez, mail: nvazquez (at) apache.org

Project Devs, mail: dev (at) cloudstack.apache.org

CloudStack GSoC 2023 - User friendly name of Downloaded Templates Volumes and ISOs

Github issue: https://github.com/apache/cloudstack/issues/6949

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Nicolás Vázquez, mail: nvazquez (at) apache.org

Project Devs, mail: dev (at) cloudstack.apache.org

CloudStack GSoC 2023 - Test button addition in Domains LDAP config

Github issue: https://github.com/apache/cloudstack/issues/6934

Please add a button to test the ldaps connection or a list button to list some user button.

Image Added

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Nicolás Vázquez, mail: nvazquez (at) apache.org

Project Devs, mail: dev (at) cloudstack.apache.org

CloudStack GSoC 2023 - Configure NFS version for Primary Storage

Github issue: https://github.com/apache/cloudstack/issues/4482

NFS Primary Storage mounts are handled by libvirt.

Currently libvirt defaults to NFS version 3 when mounting while it does support NFS version 4 if provided in the XML definition: https://libvirt.org/formatstorage.html#StoragePoolSource

}

<source>

], "services": [ {

<host name='localhost'/>

"address": "8.8.8.8",

<dir path='/var/lib/libvirt/images'/>

"type": "dns"

<format type='nfs'/>

}

<protocol ver='4'/>

]

</source>

}

Maybe pass the argument 'nfsvers' to the URL provided to the Management Server and then pass this down to the Hypervisors which generate the XML for libvirt.

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Nicolás Vázquez, mail: nvazquez (at) apache.org

Project Devs, mail: dev (at) cloudstack.apache.org

CloudStack GSoC 2023 - Use Calico or Cilium in CKS

Github issue: https://github.com/apache/cloudstack/issues/6637

Weave project are looking for maintainers, it may be worth exploring what CNI is widely used and standard/stable for CKS use-caseIn Basic Networking and Advanced Networking zones which are using a shared network you wouldn't require a VR anymore.

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Nicolás Vázquez, mail: nvazquez (at) apache.org

Project Devs, mail: dev (at) cloudstack.apache.org

CloudStack GSoC 2023 -

User friendly name of Downloaded Templates Volumes and ISOs

SSL LetsEncrypt the Console Proxy

Github issue: https://github.com/apache/cloudstack/issues/69493141

New Global Option For Letsencrypt enable on console proxy. Letsencrypt domain name option for letsencrypt ssl auto renew

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Nicolás Vázquez, mail: nvazquez (at) apache.org

Project Devs, mail: dev (at) cloudstack.apache.org

CloudStack GSoC 2023 -

Test button addition in Domains LDAP config

Direct Download extension to Ceph storage

Github issue: https://github.com/apache/cloudstack/issues/6934

Please add a button to test the ldaps connection or a list button to list some user button.

3065

Extend the Direct Download functionality to work with Ceph storageImage Removed

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Nicolás Vázquez, mail: nvazquez (at) apache.org

Project Devs, mail: dev (at) cloudstack.apache.org

CloudStack GSoC 2023 - Configure NFS version for Primary Storage

CloudStack GSoC 2023 - Autodetect IPs used inside the VM

Github issue:

Github issue: https://github.com/apache/cloudstack/issues/4482

NFS Primary Storage mounts are handled by libvirt.

Currently libvirt defaults to NFS version 3 when mounting while it does support NFS version 4 if provided in the XML definition: https://libvirtgithub.org/formatstorage.html#StoragePoolSource

            <source>
            <host name='localhost'/>
            <dir path='/var/lib/libvirt/images'/>
            <format type='nfs'/>
            <protocol ver='4'/>
            </source>

Maybe pass the argument 'nfsvers' to the URL provided to the Management Server and then pass this down to the Hypervisors which generate the XML for libvirt.

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Nicolás Vázquez, mail: nvazquez (at) apache.org

Project Devs, mail: dev (at) cloudstack.apache.org

CloudStack GSoC 2023 - Use Calico or Cilium in CKS

Github issue: https://github.com/apache/cloudstack/issues/6637

Weave project are looking for maintainers, it may be worth exploring what CNI is widely used and standard/stable for CKS use-case.

com/apache/cloudstack/issues/7142

Description:

With regards to IP info reporting, Cloudstack relies entirely on it's DHCP data bases and so on. When this is not available (L2 networks etc) no IP information is shown for a given VM.

I propose we introduce a mechanism for "IP autodetection" and try to discover the IPs used inside the machines by means of querying the hypervisors. For example with KVM/libvirt we can simply do something like this:

{{root@fedora35 ~]# virsh domifaddr win2k22 --source agent
Name MAC address Protocol Address
-------------------------------------------------------------------------------
Ethernet 52:54:00:7b:23:6a ipv4 192.168.0.68/24
Loopback Pseudo-Interface 1 ipv6 ::1/128

- ipv4 127.0.0.1/8}}
The above command queries the qemu-guest-agent inside the Windows VM. The VM needs to have the qemu-guest-agent installed and running as well as the virtio serial drivers (easily done in this case with virtio-win-guest-tools.exe ) as well as a guest-agent socket channel defined in libvirt.

Once we have this information we could display it in the UI/API as "Autodetected VM IPs" or something like that.

I imagine it's very similar for VMWare and XCP-ng.

Thank you

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Nicolás Vázquez, mail: nvazquez (at) apache.org

Project Devs, mail: dev (at) cloudstack.apache.org

CloudStack GSoC 2023 - SSL LetsEncrypt the Console Proxy

Github issue: https://github.com/apache/cloudstack/issues/3141

New Global Option For Letsencrypt enable on console proxy. Letsencrypt domain name option for letsencrypt ssl auto renew

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Nicolás Vázquez, mail: nvazquez (at) apache.org

Project Devs, mail: dev (at) cloudstack.apache.org

CloudStack GSoC 2023 - Direct Download extension to Ceph storage

Github issue: https://github.com/apache/cloudstack/issues/3065

Extend the Direct Download functionality to work with Ceph storage

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Nicolás Vázquez, mail: nvazquez (at) apache.org

Project Devs, mail: dev (at) cloudstack.apache.org

CloudStack GSoC 2023 -

Autodetect IPs used inside the VM

Extend Import-Export Instances to the KVM Hypervisor

Github issue: https://github.com/apache/cloudstack/issues/71427127

Description:

With regards to IP info reporting, Cloudstack relies entirely on it's DHCP data bases and so on. When this is not available (L2 networks etc) no IP information is shown for a given VM.

I propose we introduce a mechanism for "IP autodetection" and try to discover the IPs used inside the machines by means of querying the hypervisors. For example with KVM/libvirt we can simply do something like this:

{{root@fedora35 ~]# virsh domifaddr win2k22 --source agent
Name MAC address Protocol Address
-------------------------------------------------------------------------------
Ethernet 52:54:00:7b:23:6a ipv4 192.168.0.68/24
Loopback Pseudo-Interface 1 ipv6 ::1/128

- ipv4 127.0.0.1/8}}
The above command queries the qemu-guest-agent inside the Windows VM. The VM needs to have the qemu-guest-agent installed and running as well as the virtio serial drivers (easily done in this case with virtio-win-guest-tools.exe ) as well as a guest-agent socket channel defined in libvirt.

Once we have this information we could display it in the UI/API as "Autodetected VM IPs" or something like that.

I imagine it's very similar for VMWare and XCP-ng.

Thank you

The Import-Export functionality is only allowed for the Vmware hypervisor. The functionality is developed within a VM ingestion framework that allows the extension to other hypervisors. The Import-Export functionality consists on few APIs and the UI to interact with them:

listUnmanagedInstances: Lists unmanaged virtual machines (not existing in CloudStack but existing on the hypervisor side)
importUnmanagedInstance: Import an unmanaged VM into CloudStack (this implies populating the database with the corresponding data)
unmanageVirtualMachine: Make CloudStack forget a VM but do not remove it on the hypervisor side

The complexity on KVM should be parsing the existing XML domains into different resources and map them in CloudStack to populate the database correctly.

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Nicolás Vázquez, mail: nvazquez (at) apache.org

Project Devs, mail: dev (at) cloudstack.apache.org

Apache Nemo

Enhance Nemo to support autoscaling for bursty loads

The load of streaming jobs usually fluctuate according to the input rate or operations (e.g., window). Supporting the automatic scaling could reduce the operational cost of running streaming applications, while minimizing the performance degradation that can be caused by the bursty loads.

We can harness the cloud resources such as VMs and serverless frameworks to acquire computing resources on demand. To realize the automatic scaling, the following features should be implemented.

1) state migration: scaling jobs require moving tasks (or partitioning a task to multiple ones). In this situation, the internal state of the task should be serialized/deserialized.

2) input/output rerouting: if a task is moved to a new worker, the input and output of the task should be redirected.

3) dynamic Executor or Task creation/deletion: Executor}}s or {{Task can be dynamically created or deleted.

4) scaling policy: a scaling policy that decides when and how to scale out/in should be implemented.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Tae-Geon Um, mail: taegeonum

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Nicolás Vázquez, mail: nvazquez (at) apache.org

Project Devs, mail: dev (at) cloudstacknemo.apache.org

Collect task statistics necessary for estimating duration

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Hwarim Hyun, mail: hwarim (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

Detect skewed task periodically

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Hwarim Hyun, mail: hwarim (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

Dynamic Task Sizing on Nemo

This is an umbrella issue to keep track of the issues related to the dynamic task sizing feature on Nemo.

Dynamic task sizing needs to consider a workload and try to decide on the optimal task size based on the runtime metrics and characteristics. It should have an effect on the parallelism and the partitions, on how many partitions an intermediate data should be divided/shuffled into, and to effectively handle skews in the meanwhile

CloudStack GSoC 2023 - Extend Import-Export Instances to the KVM Hypervisor

Github issue: https://github.com/apache/cloudstack/issues/7127

Description:

The Import-Export functionality is only allowed for the Vmware hypervisor. The functionality is developed within a VM ingestion framework that allows the extension to other hypervisors. The Import-Export functionality consists on few APIs and the UI to interact with them:

listUnmanagedInstances: Lists unmanaged virtual machines (not existing in CloudStack but existing on the hypervisor side)
importUnmanagedInstance: Import an unmanaged VM into CloudStack (this implies populating the database with the corresponding data)
unmanageVirtualMachine: Make CloudStack forget a VM but do not remove it on the hypervisor side

The complexity on KVM should be parsing the existing XML domains into different resources and map them in CloudStack to populate the database correctly.

Difficulty: Major

Project size: ~175 ~350 hour (mediumlarge)

Potential mentors:

Nicolás VázquezWonook, mail: nvazquez wonook (at) apache.org

Project Devs, mail: dev (at) cloudstacknemo.apache.org

Apache Nemo

Dynamic Work Stealing on Nemo for handling skews

We aim to handle the problem on throttled resources (heterogeneous resources) and skewed input data. In order to solve this problem, we suggest dynamic work stealing that can dynamically track task statuses and steal workloads among each other. To do this, we have the following action items:

Dynamically collecting task statistics during execution
Detecting skewed tasks periodically
Splitting the data allocated in skewed tasks and reallocating them into new tasks
Synchronizing the optimization procedure
Evaluation of the resulting implementations

Enhance Nemo to support autoscaling for bursty loads

The load of streaming jobs usually fluctuate according to the input rate or operations (e.g., window). Supporting the automatic scaling could reduce the operational cost of running streaming applications, while minimizing the performance degradation that can be caused by the bursty loads.

We can harness the cloud resources such as VMs and serverless frameworks to acquire computing resources on demand. To realize the automatic scaling, the following features should be implemented.

1) state migration: scaling jobs require moving tasks (or partitioning a task to multiple ones). In this situation, the internal state of the task should be serialized/deserialized.

2) input/output rerouting: if a task is moved to a new worker, the input and output of the task should be redirected.

3) dynamic Executor or Task creation/deletion: Executor}}s or {{Task can be dynamically created or deleted.

4) scaling policy: a scaling policy that decides when and how to scale out/in should be implemented.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Tae-Geon UmWonook, mail: taegeonum wonook (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

Collect task statistics necessary for estimating duration

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Hwarim Hyun, mail: hwarim (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

Detect skewed task periodically

Implement an Accurate Simulator based on Functional model

Missing a deadline often has significant consequences for the business. And simulator can contributes to other approach for optimization

So Implement a Simulator for Stream Processing Based on Functional models.

There are some requirements:

Simulation should be able to execute before or during job execution
When a simulation is executed during job is running, It must be fast enough not to affect the job.
Information about running environment is received through argument.
At least network topology should be considered for the WAN environment.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Hwarim HyunLee Hae Dong, mail: hwarim Lemarais (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

Implement a model that represent a task level exeuction time with statistical analysis

The current SimulatedTaskExecutor is hardly available. because it needs actual metric to predict execution time. To increase utilization, we need new model that predicts a task level execution time with statistical analysis.

Some of the related TODOs are as follows:

Find factors that affect a task level execution time. with loose grid search.
Infer the most suitable model with tight grid search.

Dynamic Task Sizing on Nemo

This is an umbrella issue to keep track of the issues related to the dynamic task sizing feature on Nemo.

Dynamic task sizing needs to consider a workload and try to decide on the optimal task size based on the runtime metrics and characteristics. It should have an effect on the parallelism and the partitions, on how many partitions an intermediate data should be divided/shuffled into, and to effectively handle skews in the meanwhile.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

WonookLee Hae Dong, mail: wonook Lemarais (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

Dynamic Work Stealing

Implement spill mechanism on Nemo

for handling skews

Currently, Nemo doesn't have a spill mechanism. This makes executors prone to memory problems such as OOM(Out Of Memory) or GC when task data is large. For example, handling skewed shuffle data in Nemo results in OOM and executor failure, as all data has to be handled in-memory.
We need to spill in-memory data to secondary storage when there are not enough memory in executor.
We aim to handle the problem on throttled resources (heterogeneous resources) and skewed input data. In order to solve this problem, we suggest dynamic work stealing that can dynamically track task statuses and steal workloads among each other. To do this, we have the following action items:
Dynamically collecting task statistics during execution
Detecting skewed tasks periodically
Splitting the data allocated in skewed tasks and reallocating them into new tasks
Synchronizing the optimization procedure
Evaluation of the resulting implementations
Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
WonookJeongyoon Eo, mail: wonook jeongyoon (at) apache.org
Project Devs, mail: dev (at) nemo.apache.org

Implement an Accurate Simulator based on Functional model

Approximate the factors that affect the stage group level execution time

There are some factors that can affect the stage group level simulation, such as a latency, the rate of skewed data and the error rate of the executor etc. It is required to find a reasonable distribution form for these factors. Such as the normal distribution or the landau distribution. In actual running, It makes it possible to approximate the model with a small amount of data.

Missing a deadline often has significant consequences for the business. And simulator can contributes to other approach for optimization

So Implement a Simulator for Stream Processing Based on Functional models.

There are some requirements:

Simulation should be able to execute before or during job execution
When a simulation is executed during job is running, It must be fast enough not to affect the job.
Information about running environment is received through argument.
At least network topology should be considered for the WAN environment.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Lee Hae Dong, mail: Lemarais (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

Efficient Caching and Spilling on Nemo

In-memory caching and spilling are essential features in in-memory big data processing frameworks, and Nemo needs one.

Identify and persist frequently used data and unpersist it when its usage ended
Spill in-memory data to disk upon memory pressure

Implement a model that represent a task level exeuction time with statistical analysis

The current SimulatedTaskExecutor is hardly available. because it needs actual metric to predict execution time. To increase utilization, we need new model that predicts a task level execution time with statistical analysis.

Some of the related TODOs are as follows:

Find factors that affect a task level execution time. with loose grid search.

Infer the most suitable model with tight grid search.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Lee Hae DongJeongyoon Eo, mail: Lemarais jeongyoon (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

Runtime Level Caching Mechanism

If the the compile time identifies what data can be cached, the runtime requires logic to make this happen.

Implementation needs:

(Driver) receive and update the status of blocks from various Executors, right now this seems to be best implemented as part of BlockManagerMaster
(Driver) communicate to the Executors the availability, location and status of blocks
Possible concurrency issues:

Concurrency in Driver when multiple Executors update/inquire the same block information
Concurrency in Executor when a single cached block is accessed simultaneously

Implement spill mechanism on Nemo

Currently, Nemo doesn't have a spill mechanism. This makes executors prone to memory problems such as OOM(Out Of Memory) or GC when task data is large. For example, handling skewed shuffle data in Nemo results in OOM and executor failure, as all data has to be handled in-memory.

We need to spill in-memory data to secondary storage when there are not enough memory in executor

.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Jeongyoon EoDongjoo Lee, mail: jeongyoon codinggosu (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

Approximate the factors that affect the stage group level execution time

There are some factors that can affect the stage group level simulation, such as a latency, the rate of skewed data and the error rate of the executor etc. It is required to find a reasonable distribution form for these factors. Such as the normal distribution or the landau distribution. In actual running,

Efficient Dynamic Reconfiguration in Stream Processing

In Stream processing, we have many methods, starting from the primitive checkpoint-and-replay to a more fancy version of reconfiguration and reinitiation of stream workloads. We aim to find a way to find the most effective and efficient way of reconfiguring stream workloads. Sub-issues are to be created later on.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Wonook, mail: wonook (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

Evaluate the performance of Work Stealing implementation

It makes it possible to approximate the model with a small amount of data.

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Lee Hae DongHwarim Hyun, mail: Lemarais hwarim (at) apache.org
Project Devs, mail: dev (at) nemo.apache.org

Nemo on Google Dataproc

Issues for making it easy to install and use Nemo on Google Dataproc.

Efficient Caching and Spilling on Nemo

In-memory caching and spilling are essential features in in-memory big data processing frameworks, and Nemo needs one.

Identify and persist frequently used data and unpersist it when its usage ended

Spill in-memory data to disk upon memory pressure

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

John Yang, mail: johnyangk (at)

Potential mentors:

Jeongyoon Eo, mail: jeongyoon (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

apache.org

Project Devs, mail: dev (at) nemo.apache.org

Apache Gora

GSoC/Outreachy Project Idea

I would like to propose a project idea suitable for GSoC and Outreachy comprises of following sub tasks.

1. Complete rest of unfinished work on ArangoDB module - https://issues.apache.org/jira/browse/GORA-650
2. Upgrade HBase driver - https://issues.apache.org/jira/browse/GORA-706
3. Upgrade Hive driver - https://issues.apache.org/jira/browse/GORA-707

We could scope out the project adding / removing sub tasks based on the available capacity of the student and project

Runtime Level Caching Mechanism

If the the compile time identifies what data can be cached, the runtime requires logic to make this happen.

Implementation needs:

(Driver) receive and update the status of blocks from various Executors, right now this seems to be best implemented as part of BlockManagerMaster
(Driver) communicate to the Executors the availability, location and status of blocks
Possible concurrency issues:

Concurrency in Driver when multiple Executors update/inquire the same block information

Concurrency in Executor when a single cached block is accessed simultaneously

.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Dongjoo LeeKevin Ratnasekera, mail: codinggosu djkevincr (at) apache.org

Project Devs, mail: dev (at) nemogora.apache.org

GSoC/Outreachy Project Idea

I would like to propose a project idea suitable for GSoC and Outreachy comprises of following sub tasks.

1. Complete rest of unfinished work on Geode module - https://issues.apache.org/jira/browse/GORA-698
2. Upgrade Hadoop version - https://issues.apache.org/jira/browse/GORA-537

We could scope out the project adding / removing sub tasks based on the available capacity of the student and project

Efficient Dynamic Reconfiguration in Stream Processing

In Stream processing, we have many methods, starting from the primitive checkpoint-and-replay to a more fancy version of reconfiguration and reinitiation of stream workloads. We aim to find a way to find the most effective and efficient way of reconfiguring stream workloads. Sub-issues are to be created later on.

Difficulty: Major

Project size: ~350 ~175 hour (largemedium)

Potential mentors:

WonookKevin Ratnasekera, mail: wonook djkevincr (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

Evaluate the performance of Work Stealing implementation

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Hwarim Hyun, mail: hwarim (at) apache.org

Project Devs, mail: dev (at) nemo.apache.org

Nemo on Google Dataproc

Issues for making it easy to install and use Nemo on Google Dataproc.

(at) gora.apache.org

Apache Fineract

Reduce Boilerplate Code by Introducing lombok to Reduce getters/setters and Mapstruct to map REST DTO to Entity Objects

Lombok could help us to not only reduce a large amount of code, but also to fix a couple of inconsistencies in the code base:

getters/setters with non-standard characters (e. g. underscores)

getters/setters with typos

The layered architecture of Fineract requires mapping between REST DTO classes and internal entity classes. The current code base contains various strategies to achieve this:

private functions

static functions

mapping classes

All of these approaches are very manual (and error prone) and difficult to maintain. Mapstruct can help here:

throw errors at compile time (missing new attributes, type changes etc.)

one common concept (easier to understand)

reduce manually maintained code and replace mostly generated code

Challenges:

maintain immutability (especially in DTO classes)

should we fluent builder pattern?

backwards compatibility

these improvements cannot be introduced as one pull request, but have to be split up at least at the “module” level (clients, loans, accounts etc.). This would result in approximately 30 pull requests; if we split up Lombok and Mapstruct then it would be 30 PRs each (=60); we would need this fine grained approach to make a transition as painless as possible

some classes are maybe beyond repair (e. g. Loan.java with 6000 lines of code, the smaller part getters/setters and a long list of utility/business logic functions)

Difficulty: MinorDifficulty: Major

Project size: ~350 hour (large)

Potential mentors:

John YangRahul Goel, mail: johnyangk rahul.usit12 (at) apache.org

Project Devs, mail: dev (at) nemofineract.apache.org

Apache Dubbo

Dubbo GSoC 2023 - Service Deployer

For a large number of monolithic applications, problems such as performance will be encountered during large-scale deployment. For interface-oriented programming languages, Dubbo provides the capability of RPC remote calls, and we can help applications decouple through interfaces. Therefore, we can provide a deployer to help users realize the decoupling and splitting of microservices during deployment, and quickly provide performance optimization capabilities.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Albumen Kevin, mail: albumenj (at) apache.org

Project Devs, mail:

...

Dubbo GSoC 2023 - Automatically configure pixiu as istio ingress gateway

In the istio mesh environment, the public dubbo/dubbo go provider can be exposed outside the cluster through the http/https protocol through the istio ingress gateway. This requires the ingress gateway to complete the conversion from http to dubbo protocol, which is the main scenario of pixiu; this project Need to complete:
1. Customize pixiu, which can be used as an istio ingress gateway, proxy http/https requests and convert them into dubbo requests;
2. The gateway supports basic user authentication methods.

Basic reference: https://istio.io/latest/blog/2019/custom-ingress-gateway/
https://cloud.ibm.com/docs/containers?topic=containers-istio-custom-gateway

Difficulty: Major

Project size: ~175 hour (medium)

Potential mentors:

Albumen Kevin, mail: albumenj (at) apache.org

Project Devs, mail:

Dubbo GSoC - Pixiu supports gRPC/dubbo protocol with WASM plug-in

Pixiu acts as a gateway, forwarding traffic to various services.
Pixiu needs to support communication between different applications on the browser, and WASM needs to be supported on the browser. Currently, it only supports the HTTP protocol.
This project needs to complete the communication protocol below WASM (gRPC is preferred)
1. Support gRPC protocol
2. Support dubbo protocol

The front end calls gRPC reference https://github.com/grpc/grpc-web

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Albumen Kevin, mail: albumenj (at) apache.org

Project Devs, mail:

...

Space shortcuts

Child pages

Page History

Versions Compared

Key

Apache StreamPipes

Background

Tasks

❗Important Note❗

Relevant Skills

Learning Material

Name and Contact Information

SkyWalking

Background

Objectives

Recommended Skills

Mentor

Background

Objectives

Recommended Skills

Mentor

ShardingSphere

Apache ShardingSphere

Background

Task

Relevant Skills

Targets files

Mentor

Background

Objectives

Recommended Skills

ShardingSphere

Apache ShardingSphere

Background

Task

Relevant Skills

Targets files

Apache ShardingSphere

Background

Task

Relevant Skills

Targets files

Mentor

Mentor

Apache ShardingSphere

Background

Task

Relevant Skills

Mentor

Apache ShardingSphere

Background

Task

Relevant Skills

Mentor

Apache ShardingSphere

Apache ShardingSphere

Background

Task

Relevant Skills

Targets files

Background

Task

Relevant Skills

References

Mentor

Local Test Steps

Mentor

Apache ShardingSphere

Background

Task

Task

Mentor

Targets files

Example

References:

Apache ShardingSphere

Background

Task

Task

Relevant Skills