Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Contents

...

[SkyWalking] Build the OAP into GraalVM native image


Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Zhenxu Ke, mail: kezhenxu94 (at) apache.org
Project Devs, mail: dev (at) skywalking.apache.org

[GSOC] [SkyWalking] Add Overview page in BanyanDB UI

Background

SkyWalking BanyanDB is an observability database, aims to ingest, analyze and store Metrics, Tracing and Logging data.


The BanyanDB UI is a web interface provided BanyanDB server. It's developed with Vue3 and Vite3

Objectives

The UI should have a user-friendly Overview page.
The Overview page must display a list of nodes running in a cluster.
For each node in the list, the following information must be shown:

  • Node ID or name
  • Uptime
  • CPU usage (percentage)
  • Memory usage (percentage)
  • Disk usage (percentage)
  • Ports(gRPC and HTTP)

The web app must automatically refresh the node data at a configurable interval to show the most recent information.

Recommended Skills

  1. Familiar with Vue and Vite
  2. Have a basic understanding of RESTFul
  3. Have an experience of Apache SkyWalking
Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Hongtao Gao, mail: hanahmily (at) apache.org
Project Devs, mail: dev (at) skywalking.apache.org

...

Apache ShardingSphere Enhance SQLNodeConverterEngine to support more MySQL SQL statements

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Pagehttps://shardingsphere.apache.org
Githubhttps://github.com/apache/shardingsphere 

Background

The ShardingSphere SQL federation engine provides support for complex SQL statements, and it can well support cross-database join queries, subqueries, aggregation queries and other statements. An important part of SQL federation engine is to convert the SQL statement parsed by ShardingSphere into SqlNode, so that Calcite can be used to implement SQL optimization and federated query.

Task

This issue is to solve the MySQL exception that occurs during SQLNodeConverterEngine conversion. The specific case list is as follows.

  • select_char
  • select_extract
  • select_from_dual
  • select_from_with_table
  • select_group_by_with_having_and_window
  • select_not_between_with_single_table
  • select_not_in_with_single_table
  • select_substring
  • select_trim
  • select_weight_string
  • select_where_with_bit_expr_with_ampersand
  • select_where_with_bit_expr_with_caret
  • select_where_with_bit_expr_with_div
  • select_where_with_bit_expr_with_minus_interval
  • select_where_with_bit_expr_with_mod
  • select_where_with_bit_expr_with_mod_sign
  • select_where_with_bit_expr_with_plus_interval
  • select_where_with_bit_expr_with_signed_left_shift
  • select_where_with_bit_expr_with_signed_right_shift
  • select_where_with_bit_expr_with_vertical_bar
  • select_where_with_boolean_primary_with_comparison_subquery
  • select_where_with_boolean_primary_with_is
  • select_where_with_boolean_primary_with_is_not
  • select_where_with_boolean_primary_with_null_safe
  • select_where_with_expr_with_and_sign
  • select_where_with_expr_with_is
  • select_where_with_expr_with_is_not
  • select_where_with_expr_with_not
  • select_where_with_expr_with_not_sign
  • select_where_with_expr_with_or_sign
  • select_where_with_expr_with_xor
  • select_where_with_predicate_with_in_subquery
  • select_where_with_predicate_with_regexp
  • select_where_with_predicate_with_sounds_like
  • select_where_with_simple_expr_with_collate
  • select_where_with_simple_expr_with_match
  • select_where_with_simple_expr_with_not
  • select_where_with_simple_expr_with_odbc_escape_syntax
  • select_where_with_simple_expr_with_row
  • select_where_with_simple_expr_with_tilde
  • select_where_with_simple_expr_with_variable
  • select_window_function
  • select_with_assignment_operator
  • select_with_assignment_operator_and_keyword
  • select_with_case_expression
  • select_with_collate_with_marker
  • select_with_date_format_function
  • select_with_exists_sub_query_with_project
  • select_with_function_name
  • select_with_json_value_return_type
  • select_with_match_against
  • select_with_regexp
  • select_with_schema_name_in_column_projection
  • select_with_schema_name_in_shorthand_projection
  • select_with_spatial_function
  • select_with_trim_expr
  • select_with_trim_expr_from_expr

You need to compare the difference between actual and expected, and then correct the logic in SQLNodeConverterEngine so that actual can be consistent with expected.

After you make changes, remember to add case to SUPPORTED_SQL_CASE_IDS to ensure it can be tested.

 
Notice, these issues can be a good example.
https://github.com/apache/shardingsphere/pull/14492

Relevant Skills

 
1. Master JAVA language

2. Have a basic understanding of Antlr g4 file

3. Be familiar with MySQL and Calcite SqlNode

Targets files

 
SQLNodeConverterEngineIT

https://github.com/apache/shardingsphere/blob/master/test/it/optimizer/src/test/java/org/apache/shardingsphere/test/it/optimize/SQLNodeConverterEngineIT.java 

Mentor

Zhengqiang Duan, PMC of Apache ShardingSphere, duanzhengqiang@apache.org

Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org

Trista Pan, PMC of Apache ShardingSphere, panjuan@apache.org

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Zhengqiang Duan, mail: duanzhengqiang (at) apache.org
Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere Enhance ComputeNode reconciliation

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Page: https://shardingsphere.apache.org/
Github: https://github.com/apache/shardingsphere 

Background

There is a proposal about new CRD Cluster and ComputeNode as belows:

Currently we try to promote ComputeNode as major CRD to represent a special ShardingSphere Proxy deployment. And plan to use Cluster indicating a special ShardingSphere Proxy cluster.

Task

This issue is to enhance ComputeNode reconciliation availability. The specific case list is as follows.

  •  Add IT test case for Deployment spec volume
  •  Add IT test case for Deployment spec template init containers
  •  Add IT test case for Deployment spec template spec containers
  •  Add IT test case for Deployment spec volume mounts
  •  Add IT test case for Deployment spec container ports
  •  Add IT test case for Deployment spec container image tag
  •  Add IT test case for Service spec ports
  •  Add IT test case for ConfigMap data serverconfig
  •  Add IT test case for ConfigMap data logback
     
    Notice, these issues can be a good example.
  • chore: add more Ginkgo tests for ComputeNode #203

Relevant Skills

  1. Master Go language, Ginkgo test framework
  2. Have a basic understanding of Apache ShardingSphere Concepts
  3. Be familiar with Kubernetes Operator, kubebuilder framework

Targets files

ComputeNode IT - https://github.com/apache/shardingsphere-on-cloud/blob/main/shardingsphere-operator/pkg/reconcile/computenode/compute_node_test.go

Mentor

Liyao Miao, Committer of Apache ShardingSphere,  miaoliyao@apache.org

Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Chuxin Chen, mail: tuichenchuxin (at) apache.org
Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere Add the feature of switching logging framework

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Pagehttps://shardingsphere.apache.org
Githubhttps://github.com/apache/shardingsphere 

Background

ShardingSphere provides two adapters: ShardingSphere-JDBC and ShardingSphere-Proxy.

Now, ShardingSphere uses logback for logging, but consider the following situations:

  • Users may need to switch the logging framework to meet special needs, such as log4j2 can provide better asynchronous performance;
  • When using the JDBC adapter, the user application may not use logback, which may cause some conflicts.


Why doesn't the log facade suffice? Because ShardingSphere provides users with clustered logging configurations (such as changing the log level online), this requires dynamic construction of logger, which cannot be achieved with only the log facade.

Task

1. Design and implement logging SPI to support multiple logging frameworks (such as logback and log4j2)
2. Allow users to choose which logging framework to use through the logging rule

Relevant Skills

1. Master JAVA language

2. Basic knowledge of logback and log4j2

3. Maven

Mentor

Longtao Jiang, Committer of Apache ShardingSphere, jianglongtao@apache.org

Trista Pan, PMC of Apache ShardingSphere, panjuan@apache.org

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Longtao Jiang, mail: jianglongtao (at) apache.org
Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere Support mainstream database metadata table query

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Pagehttps://shardingsphere.apache.org
Githubhttps://github.com/apache/shardingsphere 

Background

ShardingSphere has designed its own metadata database to simulate metadata queries that support various databases.

More details:

https://github.com/apache/shardingsphere/issues/21268
https://github.com/apache/shardingsphere/issues/22052

Task

  • Support PostgreSQL And openGauss `\d tableName`
  • Support PostgreSQL And openGauss `\d+`
  • Support PostgreSQL And openGauss `\d+ tableName`
  • Support PostgreSQL And openGauss `l`
  • Support query for MySQL metadata `TABLES`
  • Support query for MySQL metadata `COLUMNS`
  • Support query for MySQL metadata `schemata`
  • Support query for MySQL metadata `ENGINES`
  • Support query for MySQL metadata `FILES`
  • Support query for MySQL metadata `VIEWS`

Notice, these issues can be a good example.

https://github.com/apache/shardingsphere/pull/22053
https://github.com/apache/shardingsphere/pull/22057/
https://github.com/apache/shardingsphere/pull/22166/
https://github.com/apache/shardingsphere/pull/22182

Relevant Skills

  •  Master JAVA language
  •  Have a basic understanding of Zookeeper
  •  Be familiar with MySQL/Postgres SQLs 


Mentor

Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org

Zhengqiang Duan, PMC of Apache ShardingSphere, duanzhengqiang@apache.org

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Chuxin Chen, mail: tuichenchuxin (at) apache.org
Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere Write a converter to generate DistSQL

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Page: https://shardingsphere.apache.org/
Github: https://github.com/apache/shardingsphere 

Background

Currently we try to promote StorageNode as major CRD to represent a set of storage units for ShardingSphere.

Task

The elementary task is that the storage node controller could manage the lifecycle of  a set of storage units, like PostgreSQL, in kubernetes. 

We don't hope to create another wheel like pg-operator. So consider using a predefined parameter group to generate the target CRD.

  • [ ] Generate DistSQL according to the Golang struct `EncryptionRule`
  • [ ] Generate DistSQL according to the Golang struct `ShardingRule`
  • [ ] Generate DistSQL according to the Golang struct `ReadWriteSplittingRule`
  • [ ] Generate DistSQL according to the Golang struct `MaskRule`
  • [ ] Generate DistSQL according to the Golang struct `ShadowRule`

    Relevant Skills

1. Master Go language, Ginkgo test framework
2. Have a basic understanding of Apache ShardingSphere Concepts and DistSQL

Targets files

DistSQL Converter - https://github.com/apache/shardingsphere-on-cloud/blob/main/shardingsphere-operator/pkg/distsql/converter.go, etc.

Example

A struct defined as below:

```golang
type EncryptRule struct{}
func (t EncryptRule) ToDistSQL() string {}
```
While invoking ToDistSQL() it will generate a DistSQL regarding a EncryptRule like:

```SQL
CREATE ENCRYPT RULE t_encrypt (....
```

References:

Mentor
Liyao Miao, Committer of Apache ShardingSphere, miaoliyao@apache.org

Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Liyao Miao, mail: miaoliyao (at) apache.org
Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere Introduce new CRD as StorageNode for better usability

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Page: https://shardingsphere.apache.org/
Github: https://github.com/apache/shardingsphere 

Background

There is a proposal about new CRD Cluster and ComputeNode as belows:

  • #167
  • #166

Currently we try to promote StorageNode as major CRD to represent a set of storage units for ShardingSphere.

Task

The elementary task is that the storage node controller could manage the lifecycle of a set of storage units, like PostgreSQL, in kubernetes.

We don't hope to create another wheel like pg-operator. So consider using a predefined parameter group to generate the target CRD.

  • [ ] Create a PostgreSQL cluster while a StorageNode with pg parameters is created
  • [ ] Update the PostgreSQL cluster while updated StorageNode
  • [ ] Delete the PostgreSQL cluster while deleted StorageNode. Notice this may need a deletion strategy.
  • [ ] Reconciling StorageNode according to the status of PostgreSQL cluster.
  • [ ] The status of StorageNode would be consumed by common storage units related DistSQLs

Relevant Skills

1. Master Go language, Ginkgo test framework
2. Have a basic understanding of Apache ShardingSphere Concepts
3. Be familiar with Kubernetes Operator, kubebuilder framework

Targets files

StorageNode Controller - https://github.com/apache/shardingsphere-on-cloud/blob/main/shardingsphere-operator/pkg/controllers/storagenode_controller.go


Mentor

Liyao Miao, Committer of Apache ShardingSphere,  miaoliyao@apache.org

Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Liyao Miao, mail: miaoliyao (at) apache.org
Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere Introduce JVM chaos to ShardingSphere

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Page: https://shardingsphere.apache.org/
Github: https://github.com/apache/shardingsphere

Background

There is a proposal about the background of ChaosEngineering as belows:

Introduce ChaosEngineering for ShardingSphere #32
And we also proposed a generic controller for ShardingSphereChaos as belows:

[GSoC 2023] Introduce New CRD ShardingSphereChaos #272
The ShardingSphereChaos controller is aiming at different chaos tests. This JVMChaos is an important one.

Task

Write several scripts to implement different JVMChaos for main features of ShardingSphere. The specific case list is as follows.

  • Add scripts injecting chaos to DataSharding
  • Add scripts injecting chaos to ReadWritingSplitting
  • Add scripts injecting chaos to DatabaseDiscovery
  • Add scripts injecting chaos to Encryption
  • Add scripts injecting chaos to Mask
  • Add scripts injecting chaos to Shadow
    Basically, these scripts will cause unexpected behaviour while executing the related. DistSQL.

Relevant Skills

  • Master Go language, Ginkgo test framework
  • Have a deep understanding of Apache ShardingSphere concepts and practices.
  • JVM byte mechanisms like ByteMan, ByteBuddy.

Targets files

JVMChaos Scripts - https://github.com/apache/shardingsphere-on-cloud/chaos/jvmchaos/scripts/

Mentor
Liyao Miao, Committer of Apache ShardingSphere, miaoliyao@apache.org
Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Liyao Miao, mail: miaoliyao (at) apache.org
Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere Introduce New CRD ShardingSphereChaos

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Page: https://shardingsphere.apache.org/
Github: https://github.com/apache/shardingsphere 

Background

There is a proposal about the background of ChaosEngineering as belows:

The ShardingSphereChaos controller is aiming at different chaos tests. 

Task

Propose a generic controller for ShardingSphereChaos, which reconcile CRD ShardingSphereChaos, prepare, execute and verify test.

  • [ ] Support common ShardingSphere features, prepare test rules and dataset
  • [ ] Generating chaos type according to the backend implementation
  • [ ] Verify testing result with DistSQL or other tools

Relevant Skills

1. Master Go language, Ginkgo test framework
2. Have a deep understanding of Apache ShardingSphere concepts and practices.
3. Kubernetes operator pattern, kube-builder 

Targets files

ShardingSphereChaos Controller - https://github.com/apache/shardingsphere-on-cloud/shardingsphere-operator/pkg/controllers/chaos_controller.go, etc.


Mentor

Liyao Miao, Committer of Apache ShardingSphere,  miaoliyao@apache.org

Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Liyao Miao, mail: miaoliyao (at) apache.org
Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere Add ShardingSphere Kafka source connector

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Pagehttps://shardingsphere.apache.org
Githubhttps://github.com/apache/shardingsphere 

Background

The community just added CDC (change data capture) feature recently. Change feed will be published in created network connection after logging in, then it could be consumed.

Since Kafka is popular distributed event streaming platform, it's useful to import change feed into Kafka for later processing.

Task

  1. Familiar with ShardingSphere CDC client usage, create publication and subscribe change feed.
  2. Familiar with Kafka connector development, develop source connector, integrate with ShardingSphere CDC. Persist change feed to Kafka topics properly.
  3. Add unit test and E2E integration test.

Relevant Skills

  1. Java language
  2. Basic knowledge of CDC and Kafka
  3. Maven

References

Local Test Steps

  1. Modify `conf/server.yaml`, uncomment `cdc-server-port: 33071` to enable CDC. (Refer to step 2)
  2. Configure proxy, refer to `Prerequisites` and `Procedure` in build to configure proxy (Newer version could be used too, current stable version is 5.3.1).
  3. Start proxy server, it'll start CDC server too.
  4. Download ShardingSphere source code from https://github.com/apache/shardingsphere , modify and run `org.apache.shardingsphere.data.pipeline.cdc.client.example.Bootstrap`. It'll print `records:` by default in `Bootstrap`.
  5. Execute some ISNERT/UPDATE/DELETE SQLs in proxy to generate change feed, and then check in `Bootstrap` console.

Mentor

Hongsheng Zhong, PMC of Apache ShardingSphere, zhonghongsheng@apache.org

Xinze Guo, Committer of Apache ShardingSphere, azexin@apache.org


Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Hongsheng Zhong, mail: zhonghongsheng (at) apache.org
Project Devs, mail: dev (at) shardingsphere.apache.org

...

Apache ShenYu Gsoc 2023 - Design and implement shenyu ingress-controller in k8s

Background

Apache ShenYu is a Java native API Gateway for service proxy, protocol conversion and API governance. Currently, ShenYu has good usability and performance in microservice scenarios. However, ShenYu's support for Kubernetes is still relatively weak.

Tasks

1. Discuss with mentors, and complete the requirements design and technical design of shenyu-ingress-controller.
2. Complete the initial version of shenyu-ingress-controller, implement the reconcile of k8s ingress api, and make ShenYu as the ingress gateway of k8s.
3. Complete the ci test of shenyu-ingress-controller, verify the correctness of the code.

Relevant Skills

1. Know the use of Apache ShenYu
2. Familiar with Java and Golang
3. Familiar with Kubernetes and can use java or golang to develop Kubernetes Controller

Description

Issues : https://github.com/apache/shenyu/issues/4438
website : https://shenyu.apache.org/

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Yu Xiao, mail: xiaoyu (at) apache.org
Project Devs, mail: dev (at) shenyu.apache.org

...

Apache ShenYu Gsoc 2023 - Shenyu-Admin Internationalization

Background

Shenyu is a native API gateway for service proxy, protocol translation and API governance. It can manage and maintain the API through Shenyu-admin, and support internationalization in Chinese and English. Unfortunately, Shenyu-admin is only internationalized on the front end. The message prompt returned by the back-end interface is still in English. Therefore, we need to implement internationalization support for the back-end interface.This will lay a good foundation for shenyu to move towards more language support.

Relevant skills

  • Related skills spring resources
  • Spring Internationalization
  • Front-end react framework

API reference

            java.util.Locale;
            org.springframework.context.MessageSource;
            org.springframework.context.support.ResourceBundleMessageSource; 

Interface effect example

            ## zh request example
            POST http://localhost:9095/plugin
            Content-Type: application/json
            Location: cn-zh
            X-Access-Token: xxx
            {
            "name": "test-create-plugin",
            "role": "test-create-plugin",
            "enabled": true,
            "sort": 100
            }
            Respone
            {
            "code": 600,
            "message": "未登录"
            }
            
            ### en request example
            POST http://localhost:9095/plugin
            Content-Type: application/json
            Location: en
            X-Access-Token: xxx
            {
            "name": "test-create-plugin",
            "role": "test-create-plugin",
            "enabled": true,
            "sort": 100
            }
            Respone
            {
            "code": 600,
            "message": "token is error"
            } 

Task List

  • The task discussed with the tutor how to achieve the internationalization of shenyu-admin background
  • Some prompt message translation
  • Get through the internationalization of front-end, obtain the client region information through http protocol, support the language of the corresponding region.
  • Leave the extension of other multi-language internationalization support interface, so as to facilitate the localization transformation of subsequent users.
Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Keguo Li, mail: likeguo (at) apache.org
Project Devs, mail: dev (at) shenyu.apache.org

...

[GSoC][Doris]Page Cache Improvement

Apache Doris
Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.
Page: https://doris.apache.org

Github: https://github.com/apache/doris

Background

Apache Doris accelerates high-concurrency queries utilizing page cache, where the decompressed data is stored.
Currently, the page cache in Apache Doris uses a simple LRU algorithm, which reveals a few problems: 

  • Hot data will be phased out in large queries
  • The page cache configuration is immutable and does not support GC.

Task

  • Phase One: Identify the impacts on queries when the decompressed data is stored in memory and SSD, respectively, and then determine whether full page cache is required.
  • Phase Two: Improve the cache strategy for Apache Doris based on the results from Phase One.

Learning Material

Page: https://doris.apache.org
Github: https://github.com/apache/doris

Mentor

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Zhijing Lu, mail: luzhijing (at) apache.org
Project Devs, mail: dev (at) doris.apache.org

[GSoC][Doris] Supports BigQuery/Apache Kudu/Apache Cassandra/Apache Druid in Federated Queries

Apache Doris
Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.
Page: https://doris.apache.org
Github: https://github.com/apache/doris

Background

Apache Doris supports acceleration of queries on external data sources to meet users' needs for federated queries and analysis.
Currently, Apache Doris supports multiple external catalogs including those from Hive, Iceberg, Hudi, and JDBC. Developers can connect more data sources to Apache Doris based on a unified framework.

Objective

Task
Phase One:

  • Get familiar with the Multi-Catalog structure of Apache Doris, including the metadata synchronization mechanism in FE and the data reading mechanism of BE.
  • Investigate how metadata should be acquired and how data access works regarding the picked data source(s); produce the corresponding design documentation.

Phase Two:

  • Develop connections to the picked data source(s) and implement access to metadata and data.

Learning Material

Page: https://doris.apache.org
Github: https://github.com/apache/doris

Mentor

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Zhijing Lu, mail: luzhijing (at) apache.org
Project Devs, mail: dev (at) doris.apache.org

[GSoC][Doris]Dictionary Encoding Acceleration

Apache Doris
Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.
Page: https://doris.apache.org

Github: https://github.com/apache/doris

Background

In Apache Doris, dictionary encoding is performed during data writing and compaction. Dictionary encoding will be implemented on string data types by default. The dictionary size of a column for one segment is 1M at most. The dictionary encoding technology accelerates strings during queries, converting them into INT, for example.
 

Task

  • Phase One: Get familiar with the implementation of Apache Doris dictionary encoding; learning how Apache Doris dictionary encoding accelerates queries.
  •  Phase Two: Evaluate the effectiveness of full dictionary encoding and figure out how to optimize memory in such a case.

Learning Material

Page: https://doris.apache.org
Github: https://github.com/apache/doris

Mentor

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Zhijing Lu, mail: luzhijing (at) apache.org
Project Devs, mail: dev (at) doris.apache.org

...

Apache SeaTunnel(Incubating) Http Client For SeaTunnel Zeta

Apache SeaTunnel(Incubating)

SeaTunnel is a very easy-to-use ultra-high-performance distributed data integration platform that supports real-time synchronization of massive data. It can synchronize tens of billions of data stably and efficiently every day, and has been used in the production of nearly 100 companies. 

SeaTunnel provides a Connector API that does not depend on a specific execution engine. Connectors (Source, Transform, Sink) developed based on this API can run On many different engines, such as SeaTunnel Zeta, Flink, Spark that are currently supported. SeaTunnel has supported more than 100 Connectors, and the number is surging.

Website: https://seatunnel.apache.org/

GitHub: https://github.com/apache/incubator-seatunnel

Background

To use SeaTunnel, the current user needs to first create and write a config file that specifies the engine that runs the job, as well as engine related parameters. Then define the Source, Transform, and Sink of the job. We hope to provide a client that allows users to define the engine, Source, Transform, and Sink information of the job run directly through code in the client without having to start with a config file. The user can then submit the job definition information through the Client, and SeaTunnel will run these jobs. After the job is submitted, the user can obtain the status of the job running through the Client. For jobs that are already running, users can use this client to manage them, such as stopping jobs, temporary jobs, and so on.

Task

1. Discuss with the mentors what you need to do

2. Learn the details of the Apache SeaTunnel project

3. Discuss and complete design and development

Relevant Skills

  1. Familiar with Java, Http
  2. Familiar with SeaTunnel is better

Mentor

  • Mentor: Jun Gao, Apache SeaTunnel(Incubating) PPMC Member, gaojun2048@apache.org
  • Mentor: Li Liu, Apache SeaTunnel(Incubating) Commiter, ic4y@apache.org
  • Mailing List: dev@seatunnel.apache.org
Difficulty: Major
Project size: ~175 hour (medium)
Potential mentors:
Jun Gao, mail: gaojun2048 (at) apache.org
Project Devs, mail: dev (at) seatunnel.apache.org

...

Dubbo GSoC 2023 - Refactor Connection

Background

At present, the abstraction of connection by client in different protocols in Dubbo is not perfect. For example, there is a big discrepancy between the client abstraction of connection in dubbo and triple protocols. As a result, the enhancement of connection-related functions in the client is more complicated, and the implementation cannot be reused. At the same time, the client also needs to implement a lot of repetitive code when extending the protocol.

Target

Reduce the complexity of the client part when extending the protocol, and increase the reuse of connection-related modules.

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Albumen Kevin, mail: albumenj (at) apache.org
Project Devs, mail:

Dubbo GSoC 2023 - IDL management

Background

Dubbo currently supports protobuf as a serialization method. Protobuf relies on proto (Idl) for code generation, but currently lacks tools for managing Idl files. For example, for java users, proto files are used for each compilation. It is more troublesome, and everyone is used to using jar packages for dependencies.

Target

Implement an Idl management and control platform, support idl files to automatically generate dependency packages in various languages, and push them to relevant dependency warehouses

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Albumen Kevin, mail: albumenj (at) apache.org
Project Devs, mail:

...

Dubbo GSoC 2023 - Refactor the http layer

Background

Dubbo currently supports the rest protocol based on http1, and the triple protocol based on http2, but currently the two protocols based on the http protocol are implemented independently, and at the same time, they cannot replace the underlying implementation, and their respective implementation costs are relatively high.

Target

In order to reduce maintenance costs, we hope to be able to abstract http. The underlying implementation of the target implementation of http has nothing to do with the protocol, and we hope that different protocols can reuse related implementations.

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Albumen Kevin, mail: albumenj (at) apache.org
Project Devs, mail:

...