You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 25 Next »

This page is auto-generated! Please do NOT edit it, all changes will be lost on next update


James Server

Adopt Pulsar as the messaging technology backing the distributed James server

A good long term objective for the PMC is to drop RabbitMQ in
favor of pulsar (third parties could package their own components using
RabbitMQ if they wishes...)

This means:

  • Solve the bugs that were found during the Pulsar MailQueue review
  • Pulsar MailQueue need to allow listing blobs in order to be
    deduplication friendly.
  • Provide an event bus based on Pulsar
  • Provide a task manager based on Pulsar
  • Package a distributed server backed by pulsar, deprecate then replace
    the current one.
  • (optionally) support mail queue priorities

While contributions would of course be welcomed on this topic, we could
offer it as part of GSOC 2022, and we could co-mentor it with mentors of
the Pulsar community (see [3])


Would such a plan gain traction around here ?

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Benoit Tellier, mail: btellier (at)
Project Devs, mail: dev (at)


GSOC Varnish Cache support in Apache Traffic Control

Apache Traffic Control is a Content Delivery Network (CDN) control plane for large scale content distribution.

Traffic Control currently requires Apache Traffic Server as the underlying cache. Help us expand the scope by integrating with the very popular Varnish Cache.

There are multiple aspects to this project:

  • Configuration Generation: Write software to build Varnish configuration files (VCL). This code will be implemented in our Traffic Ops and cache client side utilities, both written in Go.
  • Health Monitoring: Implement monitoring of the Varnish cache health and performance. This code will run both in the Traffic Monitor component and within Varnish. Traffic Monitor is written in Go and Varnish is written in C.
  • Testing: Adding automated tests for new code


  • Proficiency in Go is required
  • A basic knowledge of HTTP and caching is preferred, but not required for this project.
Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Eric Friedrich, mail: friede (at)
Project Devs, mail: dev (at)


Apache ShardingSphere Support mainstream database metadata table query

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.



ShardingSphere has designed its own metadata database to simulate metadata queries that support various databases.

More details:


  • Support PostgreSQL And openGauss `\d tableName`
  • Support PostgreSQL And openGauss `\d+`
  • Support PostgreSQL And openGauss `\d+ tableName`
  • Support PostgreSQL And openGauss `l`
  • Support query for MySQL metadata `TABLES`
  • Support query for MySQL metadata `COLUMNS`
  • Support query for MySQL metadata `schemata`
  • Support query for MySQL metadata `ENGINES`
  • Support query for MySQL metadata `FILES`
  • Support query for MySQL metadata `VIEWS`

Notice, these issues can be a good example.

Relevant Skills

  •  Master JAVA language
  •  Have a basic understanding of Zookeeper
  •  Be familiar with MySQL/Postgres SQLs 


Chuxin Chen, Committer of Apache ShardingSphere,

Zhengqiang Duan, PMC of Apache ShardingSphere,

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Chuxin Chen, mail: tuichenchuxin (at)
Project Devs, mail: dev (at)

Apache ShardingSphere Add the feature of switching logging framework

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.



ShardingSphere provides two adapters: ShardingSphere-JDBC and ShardingSphere-Proxy.

Now, ShardingSphere uses logback for logging, but consider the following situations:

  • Users may need to switch the logging framework to meet special needs, such as log4j2 can provide better asynchronous performance;
  • When using the JDBC adapter, the user application may not use logback, which may cause some conflicts.

Why doesn't the log facade suffice? Because ShardingSphere provides users with clustered logging configurations (such as changing the log level online), this requires dynamic construction of logger, which cannot be achieved with only the log facade.


1. Design and implement logging SPI to support multiple logging frameworks (such as logback and log4j2)
2. Allow users to choose which logging framework to use through the logging rule

Relevant Skills

1. Master JAVA language

2. Basic knowledge of logback and log4j2

3. Maven


Longtao Jiang, Committer of Apache ShardingSphere,

Trista Pan, PMC of Apache ShardingSphere,

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Longtao Jiang, mail: jianglongtao (at)
Project Devs, mail: dev (at)

Apache ShardingSphere Add ShardingSphere Kafka source connector

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.



The community just added CDC (change data capture) feature recently. Change feed will be published in created network connection after logging in, then it could be consumed.

Since Kafka is popular distributed event streaming platform, it's useful to import change feed into Kafka for later processing.


  1. Familiar with ShardingSphere CDC client usage, create publication and subscribe change feed.
  2. Familiar with Kafka connector development, develop source connector, integrate with ShardingSphere CDC. Persist change feed to Kafka topics properly.
  3. Add unit test and E2E integration test.

Relevant Skills

1. Java language

2. Basic knowledge of CDC and Kafka

3. Maven



Hongsheng Zhong, PMC of Apache ShardingSphere,

Xinze Guo, Committer of Apache ShardingSphere,

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Hongsheng Zhong, mail: zhonghongsheng (at)
Project Devs, mail: dev (at)

Apache ShardingSphere Enhance ComputeNode reconciliation

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.



There is a proposal about new CRD Cluster and ComputeNode as belows:

Currently we try to promote ComputeNode as major CRD to represent a special ShardingSphere Proxy deployment. And plan to use Cluster indicating a special ShardingSphere Proxy cluster.


This issue is to enhance ComputeNode reconciliation availability. The specific case list is as follows.

  •  Add IT test case for Deployment spec volume
  •  Add IT test case for Deployment spec template init containers
  •  Add IT test case for Deployment spec template spec containers
  •  Add IT test case for Deployment spec volume mounts
  •  Add IT test case for Deployment spec container ports
  •  Add IT test case for Deployment spec container image tag
  •  Add IT test case for Service spec ports
  •  Add IT test case for ConfigMap data serverconfig
  •  Add IT test case for ConfigMap data logback
    Notice, these issues can be a good example.
  • chore: add more Ginkgo tests for ComputeNode #203

Relevant Skills

  1. Master Go language, Ginkgo test framework
  2. Have a basic understanding of Apache ShardingSphere Concepts
  3. Be familiar with Kubernetes Operator, kubebuilder framework

Targets files

ComputeNode IT -


Liyao Miao, Committer of Apache ShardingSphere,

Chuxin Chen, Committer of Apache ShardingSphere,

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Chuxin Chen, mail: tuichenchuxin (at)
Project Devs, mail: dev (at)

Apache ShardingSphere Enhance SQLNodeConverterEngine to support more MySQL SQL statements

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.



The ShardingSphere SQL federation engine provides support for complex SQL statements, and it can well support cross-database join queries, subqueries, aggregation queries and other statements. An important part of SQL federation engine is to convert the SQL statement parsed by ShardingSphere into SqlNode, so that Calcite can be used to implement SQL optimization and federated query.


This issue is to solve the MySQL exception that occurs during SQLNodeConverterEngine conversion. The specific case list is as follows.

  • select_char
  • select_extract
  • select_from_dual
  • select_from_with_table
  • select_group_by_with_having_and_window
  • select_not_between_with_single_table
  • select_not_in_with_single_table
  • select_substring
  • select_trim
  • select_weight_string
  • select_where_with_bit_expr_with_ampersand
  • select_where_with_bit_expr_with_caret
  • select_where_with_bit_expr_with_div
  • select_where_with_bit_expr_with_minus_interval
  • select_where_with_bit_expr_with_mod
  • select_where_with_bit_expr_with_mod_sign
  • select_where_with_bit_expr_with_plus_interval
  • select_where_with_bit_expr_with_signed_left_shift
  • select_where_with_bit_expr_with_signed_right_shift
  • select_where_with_bit_expr_with_vertical_bar
  • select_where_with_boolean_primary_with_comparison_subquery
  • select_where_with_boolean_primary_with_is
  • select_where_with_boolean_primary_with_is_not
  • select_where_with_boolean_primary_with_null_safe
  • select_where_with_expr_with_and_sign
  • select_where_with_expr_with_is
  • select_where_with_expr_with_is_not
  • select_where_with_expr_with_not
  • select_where_with_expr_with_not_sign
  • select_where_with_expr_with_or_sign
  • select_where_with_expr_with_xor
  • select_where_with_predicate_with_in_subquery
  • select_where_with_predicate_with_regexp
  • select_where_with_predicate_with_sounds_like
  • select_where_with_simple_expr_with_collate
  • select_where_with_simple_expr_with_match
  • select_where_with_simple_expr_with_not
  • select_where_with_simple_expr_with_odbc_escape_syntax
  • select_where_with_simple_expr_with_row
  • select_where_with_simple_expr_with_tilde
  • select_where_with_simple_expr_with_variable
  • select_window_function
  • select_with_assignment_operator
  • select_with_assignment_operator_and_keyword
  • select_with_case_expression
  • select_with_collate_with_marker
  • select_with_date_format_function
  • select_with_exists_sub_query_with_project
  • select_with_function_name
  • select_with_json_value_return_type
  • select_with_match_against
  • select_with_regexp
  • select_with_schema_name_in_column_projection
  • select_with_schema_name_in_shorthand_projection
  • select_with_spatial_function
  • select_with_trim_expr
  • select_with_trim_expr_from_expr

You need to compare the difference between actual and expected, and then correct the logic in SQLNodeConverterEngine so that actual can be consistent with expected.

After you make changes, remember to add case to SUPPORTED_SQL_CASE_IDS to ensure it can be tested.

Notice, these issues can be a good example.

Relevant Skills

1. Master JAVA language

2. Have a basic understanding of Antlr g4 file

3. Be familiar with MySQL and Calcite SqlNode

Targets files



Zhengqiang Duan, PMC of Apache ShardingSphere,

Chuxin Chen, Committer of Apache ShardingSphere,

Trista Pan, PMC of Apache ShardingSphere,

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Zhengqiang Duan, mail: duanzhengqiang (at)
Project Devs, mail: dev (at)


Code Insights for Apache StreamPipes

Apache StreamPipes

Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.


StreamPipes has grown significantly throughout recent years. We were able to introduce a lot of new features and attracted both users and contributors. Putting the cherry on the cake, we were graduated as an Apache top level project in December 2022. We will of course continue developing new features and never rest to make StreamPipes even more amazing. Although, since we are approaching with full stream towards our `1.0` release, we want to project also to get more mature. Therefore, we want to address one of our Achilles' heels: our test coverage.

Don't worry, this issue is not about implementing myriads of tests for our code base. As a first step, we would like to make the status quo transparent. That means we want to measure our code coverage consistently across the whole codebase (Backend, UI, Python library) and report the coverage to codecov. Furthermore, to benchmark ourselves and motivate us to provide tests with every contributing, we would like to lock the current test coverage as an lower threshold that we always want to achieve (meaning in case we drop CI builds fail etc). With time we then can increase the required coverage lever step to step.

More than monitoring our test coverage, we also want to invest in better and more clean code. Therefore, we would like to adopt sonarcloud for our repository.


  • [ ] calculate test coverage for all main parts of the repo
  • [ ] send coverage to codeCov
  • [ ] determine coverage threshold and let CI fail if below
  • [ ] include sonarcloud in CI setup
  • [ ] include automatic coverage report in PR validation (see an example here ) -> optional
  • [ ] include automatic sonarcloud report in PR validation -> optional
  • [ ] what ever comes in your mind 💡 further ideas are always welcome

❗Important Note❗

Do not create any account in behalf of Apache StreamPipes in Sonarcloud or in CodeCov or using the name of Apache StreamPipes for any account creation. Your mentor will take care of it.

Relevant Skills

  • basic knowledge about GitHub worfklows

Learning Material


You can find our corresponding issue on GitHub here

Name and Contact Information

Name: Tim Bossenmaier

email:  bossenti[at]

community: dev[at]


Difficulty: Major
Project size: ~175 hour (medium)
Potential mentors:
Tim Bossenmaier, mail: bossenti (at)
Project Devs, mail: dev (at)

Improving End-to-End Test Infrastructure of Apache StreamPipes

Apache StreamPipes

Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.


StreamPipes has grown significantly over the past few years, with new features and contributors joining the project. However, as the project continues to evolve, e2e test coverage must also be improved to ensure that all features remain functional. Modern frameworks, such as Cypress, make it quite easy and fun to automatically test even complex application functionalities. As StreamPipes approaches its 1.0 release, it is important to improve e2e testing to ensure the robustness of the project and its use in real-world scenarios.


  • [ ] Write e2e tests using Cypress to cover most functionalities and user interface components of StreamPipes.
  • [ ] Add more complex testing scenarios to ensure the reliability and robustness of StreamPipes in real-world use cases (e.g. automated tests for version updates)
  • [ ] Add e2e tests for the new Python client to ensure its integration with the main system and its functionalities ([#774 |]])
  • [ ] Document the testing infrastructure and the testing approach to allow for easy maintenance and future contributions.

    ❗ ***Important Note*** ❗

Do not create any account on behalf of Apache StreamPipes in Cypress or using the name of Apache StreamPipes for any account creation. Your mentor will take care of it.

Relevant Skills

  • Familiarity with testing frameworks, such as Cypress or Selenium
  • Experience with TypeScript or Java
  • Basic knowledge of Angular is helpful
  • Familiarity with Docker and containerization is a plus

    Learning Material


You can find our corresponding issue on GitHub here

Name and Contact Information

Name: Philipp Zehnder

email:  zehnder[at]

community: dev[at]


Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Philipp Zehnder, mail: zehnder (at)
Project Devs, mail: dev (at)

OPC-UA browser for Apache StreamPipes

Apache StreamPipes

Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.


StreamPipes is grown significantly throughout recent years. We were able to introduce a lot of new features and attracted both users and contributors. Putting the cherry on the cake, we were graduated as an Apache top level project in December 2022. We will of course continue developing new features and never rest to make StreamPipes even more amazing. 

StreamPipes really shines when connecting Industrial IoT data. Such data sources typically originate from machine controllers, called PLCs (e.g., Siemens S7). But there are also new protocols such as OPC-UA which allow to browse available data within the controller. Our goal is to make connectivity of industrial data sources a matter of minutes.

Currently, data sources can be connected using the built-in module `StreamPipes Connect` from the UI. We provide a set of adapters for popular protocols that can be customized, e.g., connection details can be added. 

To make it even easier to connect industrial data sources with StreamPipes, we plan to add an OPC-UA browser. This will be part of the entry page of StreamPipes connect and should allow users to enter connection details of an existing OPC-UA server. Afterwards, a new view in the UI shows available data nodes from the server, their status and current value. Users should be able to select values that should be part of a new adapter. Afterwards, a new adapter can be created by reusing the current workflow to create an OPC-UA data source.

This is a really cool project for participants interested in full-stack development who would like to get a deeper understanding of industrial IoT protocols. Have fun! 


  • [ ] get familiar with the OPC-UA protocol
  • [ ] develop mockups which demonstrate the user workflow
  • [ ] develop a data model for discovering data from OPC-UA
  • [ ] create the backend business logic for the OPC-UA browser 
  • [ ] create the frontend views to asynchronously browse data and to create a new adapter
  • [ ] write Junit, Component and E2E tests
  • [ ] what ever comes in your mind 💡 further ideas are always welcome


 Relevant Skills

  • interest in Industrial IoT and procotols such as OPC-UA
  • Java development skills
  • Angular/Typescript development skills

Anyways, the most important relevant skill is motivation and readiness to learn during the project!

Learning Material


Github issue can be found here:

Name and contact information

  • Mentor: Dominik Riemer (riemer[at]
  • Mailing list: (dev[at]
  • Website:

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Dominik Riemer, mail: riemer (at)
Project Devs, mail: dev (at)


GSoC Implement python client for RocketMQ 5.0

Apache RocketMQ

Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.



RocketMQ 5.0 has released various language clients including Java, CPP, and Golang, to cover all major programming languages, a Python client needs to be implemented.

Related Repo:


The developer is required to be familiar with the Java implementation and capable of developing a Python client, while ensuring consistent functionality and semantics.

Relevant Skills
Python language
Basic knowledge of RocketMQ 5.0


Yangkun Ai, PMC of Apache RocketMQ,

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Yangkun Ai, mail: aaronai (at)
Project Devs, mail: dev (at)

GSoC Integrate RocketMQ 5.0 client with Spring


Apache RocketMQ

Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.



RocketMQ 5.0 client has been released recently, we need to integrate it with Spring.

Related issue:


  1. Familiar with RocketMQ 5.0 java client usage, you could see more details from and
  2. Integrate with Spring.

Relevant Skills

  1. Java language
  2. Basic knowledge of RocketMQ 5.0
  3. Spring


Rongtong Jin, PMC of Apache RocketMQ,

Yangkun Ai, PMC of Apache RocketMQ,

Difficulty: Major
Project size: ~175 hour (medium)
Potential mentors:
Yangkun Ai, mail: aaronai (at)
Project Devs, mail: dev (at)

GSoC Make RocketMQ support higher versions of Java

 Apache RocketMQ

Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.



RocketMQ is a widely used message middleware system in the Java community, which mainly supports Java8. As Java has evolved many new features and improvements have been added to the language and the Java Virtual Machine (JVM). However, RocketMQ still lacks compatibility with the latest Java versions, preventing users from taking advantage of new features and performance improvements. Therefore, we are seeking community support to upgrade RocketMQ to support higher versions of Java and enable the use of new features and JVM parameters.


We aim to update the RocketMQ codebase to support newer versions of Java in a cross-compile manner. The goal is to enable RocketMQ to work with Java17, while maintaining backward compatibility with previous versions of Java. This will involve identifying and updating any dependencies that need to be changed to support the new Java versions, as well as testing and verifying that the new version of RocketMQ works correctly. With these updates, users will be able to take advantage of the latest Java features and performance improvements. We hope that the community can come together to support this task and make RocketMQ a more versatile and powerful middleware system.

Relevant Skills

  1. Java language
  2. Having a good understanding of the new features in higher versions of Java, particularly LTS versions.


Yangkun Ai, PMC of Apache RocketMQ,

Difficulty: Major
Project size: ~175 hour (medium)
Potential mentors:
Yangkun Ai, mail: aaronai (at)
Project Devs, mail: dev (at)

[GSoC] [RocketMQ] The performance tuning of RocketMQ proxy

Apache RocketMQ

Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity, and flexible scalability.



RocketMQ 5.0 has released a new module called `proxy`, which supports gRPC and remoting protocol. Additionally, it can be deployed in two modes, namely Local and Cluster modes. The performance tuning task will provide contributors with a comprehensive understanding of Apache RocketMQ and its intricate data flow, presenting a unique opportunity for beginners to acquaint themselves with and actively participate in our community.


The task is to tune RocketMQ proxy for optimal performance involves latency and throughput. possess a thorough knowledge of Java implementation and possess the ability to fine-tune Netty, gRPC, the operating system, and RocketMQ itself. We anticipate that the developer responsible for this task will provide a performance report about measurements of both latency and throughput.

Relevant Skills

Basic knowledge of RocketMQ 5.0, Netty, gRPC, and operating system.

Mailing List:
Zhouxiang Zhan, committer of Apache RocketMQ,

Difficulty: Major
Project size: ~175 hour (medium)
Potential mentors:
Zhouxiang Zhan, mail: zhouxzhan (at)
Project Devs, mail: dev (at)

RocketMQ TieredStore Integration with High Availability Architecture

Apache RocketMQ{}

Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.



With the official release of RocketMQ 5.1.0, tiered storage has arrived as a new independent module in the Technical Preview milestone. This allows users to unload messages from local disks to other cheaper storage, extending message retention time at a lower cost.

Reference RIP-57:

In addition, RocketMQ introduced a new high availability architecture in version 5.0.

Reference RIP-44:

However, currently RocketMQ tiered storage only supports single replicas.


Currently, tiered storage only supports single replicas, and there are still the following issues in the integration with the high availability architecture:

  • Metadata synchronization: how to reliably synchronize metadata between master and slave nodes.
  • Disallowing message uploads beyond the confirm offset: to avoid message rollback, the maximum uploaded offset cannot exceed the confirm offset.
  • Starting multi-tier storage upload when the slave changes to master, and stopping tiered storage upload when the master becomes the slave: only the master node has write and delete permissions, and after the slave node is promoted, it needs to quickly resume tiered storage breakpoint resumption.
  • Design of slave pull protocol: how a newly launched empty slave can properly synchronize data through the tiered storage architecture. (If synchronization is performed based on the first or last file, resumption of breakpoints may not be possible when switching again).

So you need to provide a complete plan to solve the above issues and ultimately complete the integration of tiered storage and high availability architecture, while verifying it through the existing tiered storage file version and OpenChaos testing.

Relevant Skills

  • Interest in messaging middleware and distributed storage systems
  • Java development skills
  • Having a good understanding of RocketMQ tiered storage and high availability architecture
Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Rongtong Jin, mail: jinrongtong (at)
Project Devs, mail: dev (at)


[GSOC] [SkyWalking] AIOps Log clustering with Flink (Algorithm Optimization)

Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This year we will proceed on log clustering implementation with a revised architecture and this task will require student to focus on algorithm optimiztion for the clustering technique.

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Yihao Chen, mail: yihaochen (at)
Project Devs, mail: dev (at)

[GSOC] [SkyWalking] AIOps Log clustering with Flink (Flink Integration)

Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This year we will proceed on log clustering implementation with a revised architecture and this task will require student to focus on Flink and its integration with SkyWalking OAP.

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Yihao Chen, mail: yihaochen (at)
Project Devs, mail: dev (at)

[GSOC] [SkyWalking] Python Agent Performance Enhancement Plan

Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This task is about enhancing Python agent performance, the tracking issue can be seen here -<

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Yihao Chen, mail: yihaochen (at)
Project Devs, mail: dev (at)

[GSOC] [SkyWalking] Pending Task on K8s

Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This task is about a pending task on K8s.

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Yihao Chen, mail: yihaochen (at)
Project Devs, mail: dev (at)

[SkyWalking] Add Terraform provider for Apache SkyWalking

Now the deployment methods for SkyWalking are limited, we only have Helm Chart for users to deploy in Kubernetes, other users that are not using Kubernetes have to do all the house keeping stuffs to set up SkyWalking on, for example, VM.

This issue aims to add a Terraform provider, so that users can conveniently  spin up a cluster for demonstration or testing, we should evolve the provider and allow users to customize as their need and finally users can use this in their production environment.

In this task, we will mainly focus on the support for AWS. In the Terraform provider, users need to provide their access key / secret key, and the provider does the rest stuffs: create VMs, create database/OpenSearch or RDS, download SkyWalking tars, configure the SkyWalking, and start the SkyWalking components (OAP/UI), create public IPs/domain name, etc.

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Zhenxu Ke, mail: kezhenxu94 (at)
Project Devs, mail: dev (at)


[GSoC][Doris]Dictionary Encoding Acceleration

Apache Doris
Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.



In Apache Doris, dictionary encoding is performed during data writing and compaction. Dictionary encoding will be implemented on string data types by default. The dictionary size of a column for one segment is 1M at most. The dictionary encoding technology accelerates strings during queries, converting them into INT, for example.


  • Phase One: Get familiar with the implementation of Apache Doris dictionary encoding; learning how Apache Doris dictionary encoding accelerates queries.
  •  Phase Two: Evaluate the effectiveness of full dictionary encoding and figure out how to optimize memory in such a case.

Learning Material



Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Zhijing Lu, mail: luzhijing (at)
Project Devs, mail: dev (at)

[GSoC][Doris] Supports BigQuery/Apache Kudu/Apache Cassandra/Apache Druid in Federated Queries

Apache Doris
Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.


Apache Doris supports acceleration of queries on external data sources to meet users' needs for federated queries and analysis.
Currently, Apache Doris supports multiple external catalogs including those from Hive, Iceberg, Hudi, and JDBC. Developers can connect more data sources to Apache Doris based on a unified framework.


Phase One:

  • Get familiar with the Multi-Catalog structure of Apache Doris, including the metadata synchronization mechanism in FE and the data reading mechanism of BE.
  • Investigate how metadata should be acquired and how data access works regarding the picked data source(s); produce the corresponding design documentation.

Phase Two:

  • Develop connections to the picked data source(s) and implement access to metadata and data.

Learning Material



Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Zhijing Lu, mail: luzhijing (at)
Project Devs, mail: dev (at)

[GSoC][Doris]Page Cache Improvement

Apache Doris
Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.



Apache Doris accelerates high-concurrency queries utilizing page cache, where the decompressed data is stored.
Currently, the page cache in Apache Doris uses a simple LRU algorithm, which reveals a few problems: 

  • Hot data will be phased out in large queries
  • The page cache configuration is immutable and does not support GC.


  • Phase One: Identify the impacts on queries when the decompressed data is stored in memory and SSD, respectively, and then determine whether full page cache is required.
  • Phase Two: Improve the cache strategy for Apache Doris based on the results from Phase One.

Learning Material



Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Zhijing Lu, mail: luzhijing (at)
Project Devs, mail: dev (at)


Apache EventMesh EventMesh official website dos by version and demo show

Apache EventMesh (incubating)
Apache EventMesh is a fully serverless platform used to build distributed event-driven applications.



Upstream Issue:


We hope that the community can contribute to the maintenance of documents, including the archiving of Chinese and English content of documents of different release versions, the maintenance of official website documents, the improvement of project quick start documents, feature introduction, etc.


1.Discuss with the mentors what you need to do

2. Learn the details of the Apache EventMesh project

3. Improve and supplement the content of documents on GitHub, maintain official website documents, record eventmesh quick user experience, and feature display videos

Recommended Skills

1.Familiar with MarkDown

2. Familiar with Java\Go

Eason Chen, PPMC of Apache EventMesh,,

Mike Xue, PPMC of Apache EventMesh,,

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Xue Weiming, mail: mikexue (at)
Project Devs, mail: dev (at)

Apache EventMesh Integrate eventmesh runtime on Kubernetes

Apache EventMesh (incubating)
Apache EventMesh is a fully serverless platform used to build distributed event-driven applications.



Upstream Issue:


Currently, EventMesh has good usability in microservice scenarios. However, EventMesh's support for Kubernetes is still relatively weak.We hope the community can contribute EventMesh integration with the k8s.


1.Discuss with the mentors your implementation idea

2. Learn the details of the Apache EventMesh project

3. Integrate EventMesh with the k8s

Recommended Skills

1.Familiar with Java

2.Familiar with Kubernetes

Eason Chen, PPMC of Apache EventMesh,,

Mike Xue, PPMC of Apache EventMesh,,

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Xue Weiming, mail: mikexue (at)
Project Devs, mail: dev (at)


Apache ShenYu Gsoc 2023 - Support for Kubernetes Service Discovery


Apache ShenYu is a Java native API Gateway for service proxy, protocol conversion and API governance. Currently, ShenYu has good usability and performance in microservice scenarios. However, ShenYu's support for Kubernetes is still relatively weak.


1. Support the registration of microservices deployed in K8s Pod to shenyu-admin and use K8s as the register center.
2. Discuss with mentors, and complete the requirements design and technical design of Shenyu K8s Register Center.
3. Complete the initial version of Shenyu K8s Register Center.
4. Complete the CI test of Shenyu K8s Register Center, verify the correctness of the code.
5. Write the necessary documentation, deployment guides, and instructions for users to connect microservices running inside the K8s Pod to ShenYu

Relevant Skills

1. Know the use of Apache ShenYu, especially the register center
2. Familiar with Java and Golang
3. Familiar with Kubernetes and can use Java or Golang to develop

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Yonglun Zhang, mail: zhangyonglun (at)
Project Devs, mail: dev (at)

Community Development

Add server indicator if a server is a cache

Difficulty: Trivial
Project size: ~175 hour (medium)
Potential mentors:
Brennan Fieck, mail: ocket8888 (at)
Project Devs, mail:

Apache Nemo

Dynamic Work Stealing on Nemo for handling skews

We aim to handle the problem on throttled resources (heterogeneous resources) and skewed input data. In order to solve this problem, we suggest dynamic work stealing that can dynamically track task statuses and steal workloads among each other. To do this, we have the following action items:

  • Dynamically collecting task statistics during execution
  • Detecting skewed tasks periodically
  • Splitting the data allocated in skewed tasks and reallocating them into new tasks
  • Synchronizing the optimization procedure
  • Evaluation of the resulting implementations
Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Wonook, mail: wonook (at)
Project Devs, mail: dev (at)


Airavata Jupyter Platform Services

  1. UI Framework 
    1. To host the jupyter environment we will need to envolop the notebooks in a user interface and connect it with Apache Airavata services 
    2. Leverage Airavata communications from within the Django Portal - 
    3. Explore if the platform is better to be developed as VSCode extensions leveraging jupyter extensions like -
    4. Alternatively, explore developing a standalone native application using ElectronJS
  2. Draft up a platform architecture - Airavata based infrastructure with functionality similar to collab. 
  3. Authenticate with Airavata Custos Framework - 
  4. Extend Notebook filesystem using the virtual file system approaching integration with Airavata based storage and catalog
  5. Make the notebooks registered with Airavata app catalog and experiment catalog. 

Advanced Possibilities:

Explore Multi-tenanted JupyterHub 

  • Can K8 namespace isolation accomplish?
  • Make deployment of Jupyter support as part of the default core
  • Data and the user-level tenancy can be assumed, how to make sure infrastructure can isolate them, like not one gateway crashing a hosting environment.
  1. How to leverage computational resources jupypter hub
Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Suresh Marru, mail: smarru (at)
Project Devs, mail: dev (at)

Dashboards to get quick statistics

Gateway admins need period reports for various reporting and planning. 

Features Include:

  • Compute resources across that had at least one job submitted during the period <start date - End date>
  • User groups created within a given period and how many users are in those and with permission levels and also number of jobs each user have submitted.
  • List applications and number of jobs for each applications for a given period and group them by job status.
  • Number of users that at least submitted a single job for the period <start date - End date>
  • Total number of Unique Users
  • User Registration Trends
  • Number of experiments for a given period <Start date - End date> grouped by the experiment status
  • The total cpu-hours used by a users, sorted, quarterly, plotted over a period of time
  • The total cpu-hours consumed by application, sorted, quarterly, plotted over a period of time

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Suresh Marru, mail: smarru (at)
Project Devs, mail: dev (at)

Enhance File Transports in MFT

Complete all transports in MFT

  • Currently SCP, S3 is known to work
  • Others need effort to optimize, test, and declare readiness
  • Develop a complete a fully functional MFT Command-line interface
  • Have a feature-complete Python SDK
  • A minimum implementation will be prvoided, students need to complete it and test it. 
Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Suresh Marru, mail: smarru (at)
Project Devs, mail: dev (at)

Custos Backup and Restore

Custos does not have the capabilities to efficiently backup and restore a live instance. This is essential for high available services. 

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Suresh Marru, mail: smarru (at)
Project Devs, mail: dev (at)

Airavata Rich Client based on ElectronJS

Using SEAGrid Rich Client as an example, develop a native application based on electronJS to mimic Airavata Django Portal.

Reference example - 

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Suresh Marru, mail: smarru (at)
Project Devs, mail: dev (at)
  • No labels