Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Contents

...

Implement a web ui for James administration

James today provides a command line tool to do administration tasks like creating a domain, listing users, setting quota, etc.
It requires access to JMX port and even if lot of admins are confortable with such tools, to make our user base broader, we probably should expose the same commands in Rest and provide a fancy default web ui.
The task would need some basic skills on frontend tools to design an administration board, knowledge on what REST mean and enough Java understanding to add commands to existing Rest backend.
In the team, we have a strong focus on test (who want a mail server that is not tested enough ?) so we will explain and/or teach the student how to have the right test coverage of the features using modern tools like Cucumber, Selenium, rest-assured, etc.

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Matthieu Baechler, mail: matthieu (at) apache.org
Project Devs, mail: dev (at) james.apache.org

Beam

[GSoC][Beam] An IntelliJ plugin to develop Apache Beam pipelines and the Apache Beam SDKs

Beam library developers and Beam users would appreciate this : )

This project involves prototyping a few different solutions, so it will be large.

TrafficControl

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Pablo Estrada, mail: pabloem (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

TrafficControl

GSOC Varnish Cache support in Apache Traffic Control

Background
Apache Traffic Control is a Content Delivery Network (CDN) control plane for large scale content distribution.

Traffic Control currently requires Apache Traffic Server as the underlying cache. Help us expand the scope by integrating with the very popular Varnish Cache.

There are multiple aspects to this project:

  • Configuration Generation: Write software to build Varnish configuration files (VCL). This code will be implemented in our Traffic Ops and cache client side utilities, both written in Go.
  • Health Monitoring: Implement monitoring of the Varnish cache health and performance. This code will run both in the Traffic Monitor component and within Varnish. Traffic Monitor is written in Go and Varnish is written in C.
  • Testing: Adding automated tests for new code

Skills:

  • Proficiency in Go is required
  • A basic knowledge of HTTP and caching is preferred, but not required for this project.
Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Eric Friedrich, mail: friede (at) apache.org
Project Devs, mail: dev (at) trafficcontrol.apache.org

Add server indicator if a server is a cache

Difficulty: Trivial
Project size: ~175 hour (medium)
Potential mentors:
Brennan Fieck, mail: ocket8888 (at) apache.org
Project Devs, mail: dev (at) trafficcontrol.apache.org

ShardingSphere

RocketMQ

[GSoC] RocketMQ TieredStore Integration with HDFS

[GSoC] RocketMQ TieredStore Integration with HDFS

Github Issue: https://github.com/apache/rocketmq/issues/6282

Apache RocketMQ and HDFS

  • Apache RocketMQ is a cloud native messaging and streaming platform, making it simple to build event-driven applications.
  • Hadoop Distributed File System (HDFS) is a distributed file system designed to store and manage large data sets across multiple servers or clusters. HDFS provides a reliable, scalable, and fault-tolerant platform for storing and accessing data that can be accessed by a variety of applications running on the hadoop cluster.

Background

High-speed storage media, such as solid-state drives (SSDs), are typically more expensive than traditional hard disk drives (HDDs). To minimize storage costs, the local data disk size of a rocketmq broker is often limited. HDFS can store large amounts of data at a lower cost, it has better support for storing and retrieving data sequentially rather than randomly. In order to preserve message data over a long period or facilitate message export, the RocketMQ project previously introduced a tiered storage plugin. Now it is necessary to implement a storage plugin to save data on hdfs.

Relevant Skills

  • Interest in messging middleware and distributed storage system
  • Java development skills
  • Having a good understanding of rocketmq and hdfs models

Anyways, the most important relevant skill is motivation and readiness to learn during the project!

Tasks

  • understand the basic concepts and principles in distributed systems
  • provide related design documents
  • develop one that uses hdfs as the backend storage plugin to store rocketmq message data
  • write effective unit test code
  • *suggest improvements to the tiered storage interface
  • *what ever comes in your mind further ideas are always welcome

Learning Material

Apache ShardingSphere Support mainstream database metadata table query

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Pagehttps://shardingsphere.apache.org
Githubhttps://github.com/apache/shardingsphere 

Background

ShardingSphere has designed its own metadata database to simulate metadata queries that support various databases.

More details:

https://github.com/apache/shardingsphere/issues/21268
https://github.com/apache/shardingsphere/issues/22052

Task

  • Support PostgreSQL And openGauss `\d tableName`
  • Support PostgreSQL And openGauss `\d+`
  • Support PostgreSQL And openGauss `\d+ tableName`
  • Support PostgreSQL And openGauss `l`
  • Support query for MySQL metadata `TABLES`
  • Support query for MySQL metadata `COLUMNS`
  • Support query for MySQL metadata `schemata`
  • Support query for MySQL metadata `ENGINES`
  • Support query for MySQL metadata `FILES`
  • Support query for MySQL metadata `VIEWS`

Notice, these issues can be a good example.

https://github.com/apache/shardingsphere/pull/22053
shardingsphere/pull/22057/
shardingspherepull/22166/
github.com/apache/shardingsphere/pull/22182

Relevant Skills

  •  Master JAVA language
  •  Have a basic understanding of Zookeeper
  •  Be familiar with MySQL/Postgres SQLs 

Mentor

Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org

Zhengqiang Duan, PMC of Apache ShardingSphere, duanzhengqiang@apache.org

Name and contact information

Difficulty: Major
Project size: ~175 hour (medium
Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Chuxin ChenZhimin Li, mail: tuichenchuxin lizhimin (at) apache.org
Project Devs, mail: dev (at) shardingsphererocketmq.apache.org

Apache ShardingSphere Add the feature of switching logging framework

GSoC Implement python client for RocketMQ 5.0

Apache RocketMQ

Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.

Page: https://rocketmq.apache.org

Background

RocketMQ 5.0 has released various language clients including Java, CPP, and Golang, to cover all major programming languages, a Python client needs to be implemented.

Related Repo: https://github.com/apache/rocketmq-clients

Task

The developer is required to be familiar with the Java implementation and capable of developing a Python client, while ensuring consistent functionality and semantics.

Relevant Skills
Python language
Basic knowledge of RocketMQ 5.0

Mentor

Yangkun Ai, PMC of Apache RocketMQ, aaronai@apache.org

Difficulty: Major

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Pagehttps://shardingsphere.apache.org
Githubhttps://github.com/apache/shardingsphere 

Background

ShardingSphere provides two adapters: ShardingSphere-JDBC and ShardingSphere-Proxy.

Now, ShardingSphere uses logback for logging, but consider the following situations:

  • Users may need to switch the logging framework to meet special needs, such as log4j2 can provide better asynchronous performance;
  • When using the JDBC adapter, the user application may not use logback, which may cause some conflicts.

Why doesn't the log facade suffice? Because ShardingSphere provides users with clustered logging configurations (such as changing the log level online), this requires dynamic construction of logger, which cannot be achieved with only the log facade.

Task

1. Design and implement logging SPI to support multiple logging frameworks (such as logback and log4j2)
2. Allow users to choose which logging framework to use through the logging rule

Relevant Skills

1. Master JAVA language

2. Basic knowledge of logback and log4j2

3. Maven

Mentor

Longtao Jiang, Committer of Apache ShardingSphere, jianglongtao@apache.orgImage Removed

Trista Pan, PMC of Apache ShardingSphere, panjuan@apache.orgImage Removed

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Longtao JiangYangkun Ai, mail: jianglongtao aaronai (at) apache.org
Project Devs, mail: dev (at) shardingsphererocketmq.apache.org

Apache ShardingSphere Add ShardingSphere Kafka source connector

GSoC Make RocketMQ support higher versions of Java

 Apache RocketMQ

Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.

Page: https://rocketmq

Apache ShardingSphere

Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

Page: https://shardingsphere.apache.org
Github:  https://github.com/apache/shardingsphere rocketmq

Background

The community just added CDC (change data capture) feature recently. Change feed will be published in created network connection after logging in, then it could be consumed.

Since Kafka is popular distributed event streaming platform, it's useful to import change feed into Kafka for later processing.

Task

  1. Familiar with ShardingSphere CDC client usage, create publication and subscribe change feed.
  2. Familiar with Kafka connector development, develop source connector, integrate with ShardingSphere CDC. Persist change feed to Kafka topics properly.
  3. Add unit test and E2E integration test.

Relevant Skills

1. Java language

2. Basic knowledge of CDC and Kafka

3. Maven

References

RocketMQ is a widely used message middleware system in the Java community, which mainly supports Java8. As Java has evolved many new features and improvements have been added to the language and the Java Virtual Machine (JVM). However, RocketMQ still lacks compatibility with the latest Java versions, preventing users from taking advantage of new features and performance improvements. Therefore, we are seeking community support to upgrade RocketMQ to support higher versions of Java and enable the use of new features and JVM parameters.

Task

We aim to update the RocketMQ codebase to support newer versions of Java in a cross-compile manner. The goal is to enable RocketMQ to work with Java17, while maintaining backward compatibility with previous versions of Java. This will involve identifying and updating any dependencies that need to be changed to support the new Java versions, as well as testing and verifying that the new version of RocketMQ works correctly. With these updates, users will be able to take advantage of the latest Java features and performance improvements. We hope that the community can come together to support this task and make RocketMQ a more versatile and powerful middleware system.

Relevant Skills

  1. Java language
  2. Having a good understanding of the new features in higher versions of Java, particularly LTS versions.

Mentor

Yangkun Ai, PMC of Apache RocketMQ, aaronai@apache.orgImage Added

Difficulty: Major
Project size: ~175 hour (medium)
Potential mentors:
Yangkun Ai, mail: aaronai (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

[GSoC] [RocketMQ] The performance tuning of RocketMQ proxy

Apache RocketMQ

Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity, and flexible scalability.

  • https://github.com/apache/shardingsphere/issues/22500
  • https://kafka.apache.org/documentation/#connect_development
  • https://github.com/apache/kafka/tree/trunk/connect/file/src
  • Page: https://rocketmq.apache.org
    Repo:

    https://github.com/

    confluentinc/kafka-connect-jdbc

    Mentor

    Hongsheng Zhong, PMC of Apache ShardingSphere, zhonghongsheng@apache.org

    Xinze Guo, Committer of Apache ShardingSphere, azexin@apache.org

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Hongsheng Zhong, mail: zhonghongsheng (at) apache.org
    Project Devs, mail: dev (at) shardingsphere.apache.org

    apache/rocketmq

    Background

    RocketMQ 5.0 has released a new module called `proxy`, which supports gRPC and remoting protocol. Additionally, it can be deployed in two modes, namely Local and Cluster modes. The performance tuning task will provide contributors with a comprehensive understanding of Apache RocketMQ and its intricate data flow, presenting a unique opportunity for beginners to acquaint themselves with and actively participate in our community.

    Task

    The task is to tune RocketMQ proxy for optimal performance involves latency and throughput. possess a thorough knowledge of Java implementation and possess the ability to fine-tune Netty, gRPC, the operating system, and RocketMQ itself. We anticipate that the developer responsible for this task will provide a performance report about measurements of both latency and throughput.

    Relevant Skills

    Basic knowledge of RocketMQ 5.0, Netty, gRPC, and operating system.
     

    Mailing List: dev@rocketmq.apache.orgImage Added
     
    Mentor
    Zhouxiang Zhan, committer of Apache RocketMQ, zhouxzhan@apache.orgImage Added

    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Zhouxiang Zhan, mail: zhouxzhan (at) apache.org
    Project Devs, mail: dev (at) rocketmq.apache.org

    GSoC Integrate RocketMQ 5.0 client with Spring

     

    Apache RocketMQ

    Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.

    Page: https://rocketmq.apache.org
    Github:

    Apache ShardingSphere Enhance ComputeNode reconciliation

    Apache ShardingSphere

    Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

    Page: https://shardingsphere.apache.org/
    Github: https://github.com/apache/shardingsphere 

    Background

    There is a proposal about new CRD Cluster and ComputeNode as belows:

    Currently we try to promote ComputeNode as major CRD to represent a special ShardingSphere Proxy deployment. And plan to use Cluster indicating a special ShardingSphere Proxy cluster.

    Task

    This issue is to enhance ComputeNode reconciliation availability. The specific case list is as follows.

    •  Add IT test case for Deployment spec volume
    •  Add IT test case for Deployment spec template init containers
    •  Add IT test case for Deployment spec template spec containers
    •  Add IT test case for Deployment spec volume mounts
    •  Add IT test case for Deployment spec container ports
    •  Add IT test case for Deployment spec container image tag
    •  Add IT test case for Service spec ports
    •  Add IT test case for ConfigMap data serverconfig
    •  Add IT test case for ConfigMap data logback
       
      Notice, these issues can be a good example.
    • chore: add more Ginkgo tests for ComputeNode #203

    Relevant Skills

    1. Master Go language, Ginkgo test framework
    2. Have a basic understanding of Apache ShardingSphere Concepts
    3. Be familiar with Kubernetes Operator, kubebuilder framework

    Targets files

    ComputeNode IT - https://github.com/apache/shardingsphere-on-cloud/blob/main/shardingsphere-operator/pkg/reconcile/computenode/compute_node_test.go

    Mentor

    Liyao Miao, Committer of Apache ShardingSphere,  miaoliyao@apache.orgImage Removed

    Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.orgImage Removed

    rocketmq

    Background

    RocketMQ 5.0 client has been released recently, we need to integrate it with Spring.

    Related issue: https://github.com/apache/rocketmq-clients/issues/275

    Task

    1. Familiar with RocketMQ 5.0 java client usage, you could see more details from https://github.com/apache/rocketmq-clients/tree/master/java and https://rocketmq.apache.org/docs/quickStart/01quickstart
    2. Integrate with Spring.

    Relevant Skills

    1. Java language
    2. Basic knowledge of RocketMQ 5.0
    3. Spring

    Mentor

    Rongtong Jin, PMC of Apache RocketMQ, jinrongtong@apache.orgImage Added

    Yangkun Ai, PMC of Apache RocketMQ, aaronai@apache.orgImage Added

    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Yangkun Ai, mail: aaronai
    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Chuxin Chen, mail: tuichenchuxin (at) apache.org
    Project Devs, mail: dev (at) shardingsphererocketmq.apache.org

    Apache ShardingSphere Enhance SQLNodeConverterEngine to support more MySQL SQL statements

    RocketMQ TieredStore Integration with High Availability Architecture

    Apache RocketMQ{}

    Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability

    Apache ShardingSphere

    Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

    Pagehttps://shardingsphererocketmq.apache.org
    Github


    Background

    With the official release of RocketMQ 5.1.0, tiered storage has arrived as a new independent module in the Technical Preview milestone. This allows users to unload messages from local disks to other cheaper storage, extending message retention time at a lower cost.

    Reference RIP-57: https://github.com/apache/shardingsphere 

    Background

    The ShardingSphere SQL federation engine provides support for complex SQL statements, and it can well support cross-database join queries, subqueries, aggregation queries and other statements. An important part of SQL federation engine is to convert the SQL statement parsed by ShardingSphere into SqlNode, so that Calcite can be used to implement SQL optimization and federated query.

    Task

    This issue is to solve the MySQL exception that occurs during SQLNodeConverterEngine conversion. The specific case list is as follows.

    • select_char
    • select_extract
    • select_from_dual
    • select_from_with_table
    • select_group_by_with_having_and_window
    • select_not_between_with_single_table
    • select_not_in_with_single_table
    • select_substring
    • select_trim
    • select_weight_string
    • select_where_with_bit_expr_with_ampersand
    • select_where_with_bit_expr_with_caret
    • select_where_with_bit_expr_with_div
    • select_where_with_bit_expr_with_minus_interval
    • select_where_with_bit_expr_with_mod
    • select_where_with_bit_expr_with_mod_sign
    • select_where_with_bit_expr_with_plus_interval
    • select_where_with_bit_expr_with_signed_left_shift
    • select_where_with_bit_expr_with_signed_right_shift
    • select_where_with_bit_expr_with_vertical_bar
    • select_where_with_boolean_primary_with_comparison_subquery
    • select_where_with_boolean_primary_with_is
    • select_where_with_boolean_primary_with_is_not
    • select_where_with_boolean_primary_with_null_safe
    • select_where_with_expr_with_and_sign
    • select_where_with_expr_with_is
    • select_where_with_expr_with_is_not
    • select_where_with_expr_with_not
    • select_where_with_expr_with_not_sign
    • select_where_with_expr_with_or_sign
    • select_where_with_expr_with_xor
    • select_where_with_predicate_with_in_subquery
    • select_where_with_predicate_with_regexp
    • select_where_with_predicate_with_sounds_like
    • select_where_with_simple_expr_with_collate
    • select_where_with_simple_expr_with_match
    • select_where_with_simple_expr_with_not
    • select_where_with_simple_expr_with_odbc_escape_syntax
    • select_where_with_simple_expr_with_row
    • select_where_with_simple_expr_with_tilde
    • select_where_with_simple_expr_with_variable
    • select_window_function
    • select_with_assignment_operator
    • select_with_assignment_operator_and_keyword
    • select_with_case_expression
    • select_with_collate_with_marker
    • select_with_date_format_function
    • select_with_exists_sub_query_with_project
    • select_with_function_name
    • select_with_json_value_return_type
    • select_with_match_against
    • select_with_regexp
    • select_with_schema_name_in_column_projection
    • select_with_schema_name_in_shorthand_projection
    • select_with_spatial_function
    • select_with_trim_expr
    • select_with_trim_expr_from_expr

    You need to compare the difference between actual and expected, and then correct the logic in SQLNodeConverterEngine so that actual can be consistent with expected.

    After you make changes, remember to add case to SUPPORTED_SQL_CASE_IDS to ensure it can be tested.

    rocketmq/wiki/RIP-57-Tiered-storage-for-RocketMQ

    In addition, RocketMQ introduced a new high availability architecture in version 5.0.

    Reference RIP-44: https://github.com/apache/rocketmq/wiki/RIP-44-Support-DLedger-Controller

    However, currently RocketMQ tiered storage only supports single replicas.


    Task

    Currently, tiered storage only supports single replicas, and there are still the following issues in the integration with the high availability architecture:

    • Metadata synchronization: how to reliably synchronize metadata between master and slave nodes.
    • Disallowing message uploads beyond the confirm offset: to avoid message rollback, the maximum uploaded offset cannot exceed the confirm offset.
    • Starting multi-tier storage upload when the slave changes to master, and stopping tiered storage upload when the master becomes the slave: only the master node has write and delete permissions, and after the slave node is promoted, it needs to quickly resume tiered storage breakpoint resumption.
    • Design of slave pull protocol: how a newly launched empty slave can properly synchronize data through the tiered storage architecture. (If synchronization is performed based on the first or last file, resumption of breakpoints may not be possible when switching again).

    So you need to provide a complete plan to solve the above issues and ultimately complete the integration of tiered storage and high availability architecture, while verifying it through the existing tiered storage file version and OpenChaos testing.


    Relevant Skills

    • Interest in messaging middleware and distributed storage systems
    • Java development skills
    • Having a good understanding of RocketMQ tiered storage and high availability architecture
    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Rongtong Jin, mail: jinrongtong (at) apache.org
    Project Devs, mail: dev (at) rocketmq.apache.org

    RocketMQ DLedger Controller Performance Optimization

    Apache RocketMQ

    Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity, and flexible scalability.
    Page: https://rocketmq.apache.org
    Repo: https://github.com/apache/rocketmq

    Background

    RocketMQ 5.0 introduced a new component, the controller, which controls the high availability master-slave switch in multi-replica scenarios. It uses the DLedger Raft library as a consensus replication state machine for metadata. As a completely independent component, it can run normally in some scenarios, but in large-scale clusters, it is necessary to maintain a large number of broker groups, which is a great challenge for operational capabilities and resource waste. When dealing with a large number of Broker groups, we need to optimize performance in large-scale scenarios, leveraging the high-performance writing of DLedger itself and performing some optimization for the current Controller architecture.

    Task

    1. Polish the usage of DLedger

    Currently, on the Controller side, a task queue single thread is used for requesting reads and writes to DLedger, that is, only one read/write request can be processed at a time. However, DLedger itself implements many optimizations for multi-client reads and writes and can ensure linear consistency reading. However, now all read and write processing is performed using a single logical DLedger client, which will become a serious performance bottleneck in large-scale scenarios.

    2. Optimization of DLedger features usage

    DLedger itself can implement many optimizations, such as ReadIndex read and FollowerRead read. After implementation, we can fully leverage the performance of reads. Currently, all Broker nodes communicate with the Leader node of the Controller. In large-scale scenarios, this will cause the requests of each Controller group to be concentrated on the Leader node, and the other Follower nodes will not share the request processing of the Leader, which will cause single-point performance bottlenecks for the Leader.

    3. Full asynchronous + parallel processing

    Currently, DLedger itself is fully asynchronous, but on the Controller side, all requests for the DLedger side are synchronized, and many Controller-side operations are performed synchronously in a single thread, such as heartbeat checks and other timed tasks. In large-scale scenarios, the logic of these single-threaded synchronous operations will block a large number of requests from Broker-side, so asynchronous + parallel processing can be used for optimization.

    4. Correctness testing and performance testing

    After completing the above optimizations, it is necessary to conduct correctness testing on the new version and use distributed chaos testing frameworks such as OpenChaos to verify correct operation under fault scenarios such as network partition and random crashes.
    After completing the correctness testing, a detailed performance testing report can be produced by comparing the new and old versions.

    Skills Required

    • Strong interest in message middleware and distributed storage systems
    • Proficient in Java development
    • In-depth understanding of distributed consensus algorithms
    • In-depth understanding of the high-availability module of RockeetMQ and the DLedger library
    • Understanding of distributed chaos testing and performance testing.
    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Rongtong Jin, mail: jinrongtong (at) apache.org
    Project Devs, mail: dev (at) rocketmq.apache.org

    GSoC Observability Improvement for RocketMQ Streams

    RocketMQ Streams 

    RocketMQ Streams is a lightweight stream processing framework, application gains the stream processing ability by depending on RocketMQ Streams as an SDK.

    Background

    Repo of RocketMQ Streams:   
    Notice, these issues can be a good example.
    https://github.com/apache/shardingsphere/pull/14492

    Relevant Skills

     
    1. Master JAVA language

    2. Have a basic understanding of Antlr g4 file

    3. Be familiar with MySQL and Calcite SqlNode

    Targets files

     
    SQLNodeConverterEngineIT

    https://github.com/apache/shardingsphere/blob/master/test/it/optimizer/src/test/java/org/apache/shardingsphere/test/it/optimize/SQLNodeConverterEngineIT.java 

    Mentor

    Zhengqiang Duan, PMC of Apache ShardingSphere, duanzhengqiang@apache.org

    Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.orgImage Removed

    rocketmq-streams
    The architecture document of RocketMQ Streams: RocketMQ Streams examples,

    architecture doc

    Task

    The observability needs to be enhanced in the following aspects:

    • The metric of client/processor/thread/state/rocksdb;
    • The topology of streaming process;
    • Mutli-output of metrics;

    This task need you to study the implementation details of RocketMQ Streams streams, and find out the key indicators in the stream processing process. Design and implement a complete set of observability solutions, and finally use it to complete runtime problem diagnosis;

    Mentor

    nize, Committer of of Apache RocketMQ, karp@apache.orgImage Added

     

    Trista Pan, PMC of Apache ShardingSphere, panjuan@apache.org

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Zhengqiang DuanNi Ze, mail: duanzhengqiang karp (at) apache.org
    Project Devs, mail: dev (at) shardingsphererocketmq.apache.org

    RocketMQ

    StreamPipes

    Code Insights for Apache StreamPipes

    Apache StreamPipes

    Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.

    Background

    StreamPipes has grown significantly throughout recent years. We were able to introduce a lot of new features and attracted both users and contributors. Putting the cherry on the cake, we were graduated as an Apache top level project in December 2022. We will of course continue developing new features and never rest to make StreamPipes even more amazing. Although, since we are approaching with full stream towards our `1.0` release, we want to project also to get more mature. Therefore, we want to address one of our Achilles' heels: our test coverage.

    Don't worry, this issue is not about implementing myriads of tests for our code base. As a first step, we would like to make the status quo transparent. That means we want to measure our code coverage consistently across the whole codebase (Backend, UI, Python library) and report the coverage to codecov. Furthermore, to benchmark ourselves and motivate us to provide tests with every contributing, we would like to lock the current test coverage as an lower threshold that we always want to achieve (meaning in case we drop CI builds fail etc). With time we then can increase the required coverage lever step to step.

    More than monitoring our test coverage, we also want to invest in better and more clean code. Therefore, we would like to adopt sonarcloud for our repository.

    Tasks

    • [ ] calculate test coverage for all main parts of the repo
    • [ ] send coverage to codeCov
    • [ ] determine coverage threshold and let CI fail if below
    • [ ] include sonarcloud in CI setup
    • [ ] include automatic coverage report in PR validation (see an example here ) -> optional
    • [ ] include automatic sonarcloud report in PR validation -> optional
    • [ ] what ever comes in your mind 💡 further ideas are always welcome


    ❗Important Note❗

    Do not create any account in behalf of Apache StreamPipes in Sonarcloud or in CodeCov or using the name of Apache StreamPipes for any account creation. Your mentor will take care of it.


    Relevant Skills

    • basic knowledge about GitHub worfklows

    Learning Material


    References

    You can find our corresponding issue on GitHub here


    Name and Contact Information

    Name: Tim Bossenmaier

    email:  bossenti[at]apache.org

    community: dev[at]streampipes.apache.org

    website: https://streampipes

    [GSoC] RocketMQ TieredStore Integration with HDFS

    [GSoC] RocketMQ TieredStore Integration with HDFS

    Github Issue: https://github.com/apache/rocketmq/issues/6282

    Apache RocketMQ and HDFS

    • Apache RocketMQ is a cloud native messaging and streaming platform, making it simple to build event-driven applications.
    • Hadoop Distributed File System (HDFS) is a distributed file system designed to store and manage large data sets across multiple servers or clusters. HDFS provides a reliable, scalable, and fault-tolerant platform for storing and accessing data that can be accessed by a variety of applications running on the hadoop cluster.

    Background

    High-speed storage media, such as solid-state drives (SSDs), are typically more expensive than traditional hard disk drives (HDDs). To minimize storage costs, the local data disk size of a rocketmq broker is often limited. HDFS can store large amounts of data at a lower cost, it has better support for storing and retrieving data sequentially rather than randomly. In order to preserve message data over a long period or facilitate message export, the RocketMQ project previously introduced a tiered storage plugin. Now it is necessary to implement a storage plugin to save data on hdfs.

    Relevant Skills

    • Interest in messging middleware and distributed storage system
    • Java development skills
    • Having a good understanding of rocketmq and hdfs models

    Anyways, the most important relevant skill is motivation and readiness to learn during the project!

    Tasks

    • understand the basic concepts and principles in distributed systems
    • provide related design documents
    • develop one that uses hdfs as the backend storage plugin to store rocketmq message data
    • write effective unit test code
    • *suggest improvements to the tiered storage interface
    • *what ever comes in your mind further ideas are always welcome

    Learning Material

    Name and contact information

    Website: https://rocketmq.apache.org/ and https://hadoop

    .apache.org/

    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Zhimin LiTim Bossenmaier, mail: lizhimin bossenti (at) apache.org
    Project Devs, mail: dev (at) rocketmqstreampipes.apache.org

    GSoC Implement python client for RocketMQ 5.0

    Apache RocketMQ

    Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.

    Page: https://rocketmq.apache.org

    Background

    RocketMQ 5.0 has released various language clients including Java, CPP, and Golang, to cover all major programming languages, a Python client needs to be implemented.

    Related Repo: https://github.com/apache/rocketmq-clients

    Task

    The developer is required to be familiar with the Java implementation and capable of developing a Python client, while ensuring consistent functionality and semantics.

    Relevant Skills
    Python language
    Basic knowledge of RocketMQ 5.0

    Mentor

    Yangkun Ai, PMC of Apache RocketMQ, aaronai@apache.org

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Yangkun Ai, mail: aaronai (at) apache.org
    Project Devs, mail: dev (at) rocketmq.apache.org

    Improving End-to-End Test Infrastructure of Apache StreamPipes

    Apache StreamPipes

    Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.

    Background

    StreamPipes has grown significantly over the past few years, with new features and contributors joining the project. However, as the project continues to evolve, e2e test coverage must also be improved to ensure that all features remain functional. Modern frameworks, such as Cypress, make it quite easy and fun to automatically test even complex application functionalities. As StreamPipes approaches its 1.0 release, it is important to improve e2e testing to ensure the robustness of the project and its use in real-world scenarios.

    Tasks

    GSoC Integrate RocketMQ 5.0 client with Spring

     

    Apache RocketMQ

    Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.

    Page: https://rocketmq.apache.org
    Github:
    • [ ] Write e2e tests using Cypress to cover most functionalities and user interface components of StreamPipes.
    • [ ] Add more complex testing scenarios to ensure the reliability and robustness of StreamPipes in real-world use cases (e.g. automated tests for version updates)
    • [ ] Add e2e tests for the new Python client to ensure its integration with the main system and its functionalities ([#774 |
    /rocketmq

    Background

    RocketMQ 5.0 client has been released recently, we need to integrate it with Spring.

    Related issue: https://github.com/apache/rocketmq-clients/issues/275

    Task

    1. Familiar with RocketMQ 5.0 java client usage, you could see more details from https://github.com/apache/rocketmq-clients/tree/master/java and https://rocketmq.apache.org/docs/quickStart/01quickstart
    2. Integrate with Spring.

    Relevant Skills

    1. Java language
    2. Basic knowledge of RocketMQ 5.0
    3. Spring

    Mentor

    Rongtong Jin, PMC of Apache RocketMQ, jinrongtong@apache.orgImage Removed

    • /streampipes/issues/774]])
    • [ ] Document the testing infrastructure and the testing approach to allow for easy maintenance and future contributions.

      ❗ ***Important Note*** ❗

    Do not create any account on behalf of Apache StreamPipes in Cypress or using the name of Apache StreamPipes for any account creation. Your mentor will take care of it.

    Relevant Skills

    • Familiarity with testing frameworks, such as Cypress or Selenium
    • Experience with TypeScript or Java
    • Basic knowledge of Angular is helpful
    • Familiarity with Docker and containerization is a plus

      Learning Material

    References

    You can find our corresponding issue on GitHub here

    Name and Contact Information

    Name: Philipp Zehnder

    email:  zehnder[at]apache.org

    community: dev[at]streampipes.apache.org

    website: https://streampipes.apache.org/Yangkun Ai, PMC of Apache RocketMQ, aaronai@apache.orgImage Removed

    Difficulty: Major
    Project size: ~175 ~350 hour (mediumlarge)
    Potential mentors:
    Yangkun AiPhilipp Zehnder, mail: aaronai zehnder (at) apache.org
    Project Devs, mail: dev (at) rocketmqstreampipes.apache.org

    GSoC Make RocketMQ support higher versions of Java

     Apache RocketMQ

    Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.

    Page: https://rocketmq.apache.org
    Github: https://github.com/apache/rocketmq

    Background

    RocketMQ is a widely used message middleware system in the Java community, which mainly supports Java8. As Java has evolved many new features and improvements have been added to the language and the Java Virtual Machine (JVM). However, RocketMQ still lacks compatibility with the latest Java versions, preventing users from taking advantage of new features and performance improvements. Therefore, we are seeking community support to upgrade RocketMQ to support higher versions of Java and enable the use of new features and JVM parameters.

    Task

    We aim to update the RocketMQ codebase to support newer versions of Java in a cross-compile manner. The goal is to enable RocketMQ to work with Java17, while maintaining backward compatibility with previous versions of Java. This will involve identifying and updating any dependencies that need to be changed to support the new Java versions, as well as testing and verifying that the new version of RocketMQ works correctly. With these updates, users will be able to take advantage of the latest Java features and performance improvements. We hope that the community can come together to support this task and make RocketMQ a more versatile and powerful middleware system.

    Relevant Skills

    1. Java language
    2. Having a good understanding of the new features in higher versions of Java, particularly LTS versions.

    Mentor

    Yangkun Ai, PMC of Apache RocketMQ, aaronai@apache.orgImage Removed

    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Yangkun Ai, mail: aaronai (at) apache.org
    Project Devs, mail: dev (at) rocketmq.apache.org

    OPC-UA browser for Apache StreamPipes

    Apache StreamPipes

    Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.

    Background

    StreamPipes is grown significantly throughout recent years. We were able to introduce a lot of new features and attracted both users and contributors. Putting the cherry on the cake, we were graduated as an Apache top level project in December 2022. We will of course continue developing new features and never rest to make StreamPipes even more amazing. 

    StreamPipes really shines when connecting Industrial IoT data. Such data sources typically originate from machine controllers, called PLCs (e.g., Siemens S7). But there are also new protocols such as OPC-UA which allow to browse available data within the controller. Our goal is to make connectivity of industrial data sources a matter of minutes.

    Currently, data sources can be connected using the built-in module `StreamPipes Connect` from the UI. We provide a set of adapters for popular protocols that can be customized, e.g., connection details can be added. 

    To make it even easier to connect industrial data sources with StreamPipes, we plan to add an OPC-UA browser. This will be part of the entry page of StreamPipes connect and should allow users to enter connection details of an existing OPC-UA server. Afterwards, a new view in the UI shows available data nodes from the server, their status and current value. Users should be able to select values that should be part of a new adapter. Afterwards, a new adapter can be created by reusing the current workflow to create an OPC-UA data source.

    This is a really cool project for participants interested in full-stack development who would like to get a deeper understanding of industrial IoT protocols. Have fun! 

    Tasks

    • [ ] get familiar with the OPC-UA protocol
    • [ ] develop mockups which demonstrate the user workflow
    • [ ] develop a data model for discovering data from OPC-UA
    • [ ] create the backend business logic for the OPC-UA browser 
    • [ ] create the frontend views to asynchronously browse data and to create a new adapter
    • [ ] write Junit, Component and E2E tests
    • [ ] what ever comes in your mind 💡 further ideas are always welcome

       

     Relevant Skills

    • interest in Industrial IoT and procotols such as OPC-UA
    • Java development skills
    • Angular/Typescript development skills

    Anyways, the most important relevant skill is motivation and readiness to learn during the project!

    Learning Material

    Reference

    Github issue can be found here: https://github.com/apache/streampipes/issues/1390

    Name and contact information

    • Mentor: Dominik Riemer (riemer[at]apache.org).
    • Mailing list: (dev[at]streampipes.apache.org)
    • Website: streampipes.apache.org


    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Dominik Riemer, mail: riemer

    [GSoC] [RocketMQ] The performance tuning of RocketMQ proxy

    Apache RocketMQ

    Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity, and flexible scalability.

    Page: https://rocketmq.apache.org
    Repo: https://github.com/apache/rocketmq

    Background

    RocketMQ 5.0 has released a new module called `proxy`, which supports gRPC and remoting protocol. Additionally, it can be deployed in two modes, namely Local and Cluster modes. The performance tuning task will provide contributors with a comprehensive understanding of Apache RocketMQ and its intricate data flow, presenting a unique opportunity for beginners to acquaint themselves with and actively participate in our community.

    Task

    The task is to tune RocketMQ proxy for optimal performance involves latency and throughput. possess a thorough knowledge of Java implementation and possess the ability to fine-tune Netty, gRPC, the operating system, and RocketMQ itself. We anticipate that the developer responsible for this task will provide a performance report about measurements of both latency and throughput.

    Relevant Skills

    Basic knowledge of RocketMQ 5.0, Netty, gRPC, and operating system.
     

    Mailing List: dev@rocketmq.apache.orgImage Removed
     
    Mentor
    Zhouxiang Zhan, committer of Apache RocketMQ, zhouxzhan@apache.orgImage Removed

    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Zhouxiang Zhan, mail: zhouxzhan (at) apache.org
    Project Devs, mail: dev (at) rocketmqstreampipes.apache.org

    ShardingSphere

    Apache ShardingSphere Add the feature of switching logging framework

    Apache ShardingSphere

    Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

    Pagehttps://shardingsphere.apache.org
    Github

    RocketMQ TieredStore Integration with High Availability Architecture

    Apache RocketMQ{}

    Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.

    Page: https://rocketmq.apache.org

    Background

    With the official release of RocketMQ 5.1.0, tiered storage has arrived as a new independent module in the Technical Preview milestone. This allows users to unload messages from local disks to other cheaper storage, extending message retention time at a lower cost.

    Reference RIP-57: https://github.com/apache/rocketmq/wiki/RIP-57-Tiered-storage-for-RocketMQ

    In addition, RocketMQ introduced a new high availability architecture in version 5.0.

    Reference RIP-44: https://github.com/apache/rocketmq/wiki/RIP-44-Support-DLedger-Controller

    However, currently RocketMQ tiered storage only supports single replicas.

    Task

    Currently, tiered storage only supports single replicas, and there are still the following issues in the integration with the high availability architecture:

    • Metadata synchronization: how to reliably synchronize metadata between master and slave nodes.
    • Disallowing message uploads beyond the confirm offset: to avoid message rollback, the maximum uploaded offset cannot exceed the confirm offset.
    • Starting multi-tier storage upload when the slave changes to master, and stopping tiered storage upload when the master becomes the slave: only the master node has write and delete permissions, and after the slave node is promoted, it needs to quickly resume tiered storage breakpoint resumption.
    • Design of slave pull protocol: how a newly launched empty slave can properly synchronize data through the tiered storage architecture. (If synchronization is performed based on the first or last file, resumption of breakpoints may not be possible when switching again).

    So you need to provide a complete plan to solve the above issues and ultimately complete the integration of tiered storage and high availability architecture, while verifying it through the existing tiered storage file version and OpenChaos testing.

    Relevant Skills

  • Interest in messaging middleware and distributed storage systems
  • Java development skills
  • shardingsphere 

    Background

    ShardingSphere provides two adapters: ShardingSphere-JDBC and ShardingSphere-Proxy.

    Now, ShardingSphere uses logback for logging, but consider the following situations:

    • Users may need to switch the logging framework to meet special needs, such as log4j2 can provide better asynchronous performance;
    • When using the JDBC adapter, the user application may not use logback, which may cause some conflicts.


    Why doesn't the log facade suffice? Because ShardingSphere provides users with clustered logging configurations (such as changing the log level online), this requires dynamic construction of logger, which cannot be achieved with only the log facade.

    Task

    1. Design and implement logging SPI to support multiple logging frameworks (such as logback and log4j2)
    2. Allow users to choose which logging framework to use through the logging rule

    Relevant Skills

    1. Master JAVA language

    2. Basic knowledge of logback and log4j2

    3. Maven

    Mentor

    Longtao Jiang, Committer of Apache ShardingSphere, jianglongtao@apache.orgImage Added

    Trista Pan, PMC of Apache ShardingSphere, panjuan@apache.orgImage Added

    Having a good understanding of RocketMQ tiered storage and high availability architecture

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Rongtong JinLongtao Jiang, mail: jinrongtong jianglongtao (at) apache.org
    Project Devs, mail: dev (at) rocketmqshardingsphere.apache.org

    StreamPipes

    Code Insights for Apache StreamPipes

    Apache ShardingSphere Add ShardingSphere Kafka source connector

    Apache

    StreamPipes

    ShardingSphere

    Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.

    Background

    StreamPipes has grown significantly throughout recent years. We were able to introduce a lot of new features and attracted both users and contributors. Putting the cherry on the cake, we were graduated as an Apache top level project in December 2022. We will of course continue developing new features and never rest to make StreamPipes even more amazing. Although, since we are approaching with full stream towards our `1.0` release, we want to project also to get more mature. Therefore, we want to address one of our Achilles' heels: our test coverage.

    Don't worry, this issue is not about implementing myriads of tests for our code base. As a first step, we would like to make the status quo transparent. That means we want to measure our code coverage consistently across the whole codebase (Backend, UI, Python library) and report the coverage to codecov. Furthermore, to benchmark ourselves and motivate us to provide tests with every contributing, we would like to lock the current test coverage as an lower threshold that we always want to achieve (meaning in case we drop CI builds fail etc). With time we then can increase the required coverage lever step to step.

    More than monitoring our test coverage, we also want to invest in better and more clean code. Therefore, we would like to adopt sonarcloud for our repository.

    Tasks

    • [ ] calculate test coverage for all main parts of the repo
    • [ ] send coverage to codeCov
    • [ ] determine coverage threshold and let CI fail if below
    • [ ] include sonarcloud in CI setup
    • [ ] include automatic coverage report in PR validation (see an example here ) -> optional
    • [ ] include automatic sonarcloud report in PR validation -> optional
    • [ ] what ever comes in your mind 💡 further ideas are always welcome

    ❗Important Note❗

    Do not create any account in behalf of Apache StreamPipes in Sonarcloud or in CodeCov or using the name of Apache StreamPipes for any account creation. Your mentor will take care of it.

    Relevant Skills

    • basic knowledge about GitHub worfklows

    Learning Material

    References

    You can find our corresponding issue on GitHub here

    Name and Contact Information

    Name: Tim Bossenmaier

    email:  bossenti[at]apache.org

    community: dev[at]streampipes.apache.org

    website: https://streampipes.apache.org/

    ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

    Pagehttps://shardingsphere.apache.org
    Githubhttps://github.com/apache/shardingsphere 

    Background

    The community just added CDC (change data capture) feature recently. Change feed will be published in created network connection after logging in, then it could be consumed.

    Since Kafka is popular distributed event streaming platform, it's useful to import change feed into Kafka for later processing.

    Task

    1. Familiar with ShardingSphere CDC client usage, create publication and subscribe change feed.
    2. Familiar with Kafka connector development, develop source connector, integrate with ShardingSphere CDC. Persist change feed to Kafka topics properly.
    3. Add unit test and E2E integration test.

    Relevant Skills

    1. Java language

    2. Basic knowledge of CDC and Kafka

    3. Maven

    References

    Mentor

    Hongsheng Zhong, PMC of Apache ShardingSphere, zhonghongsheng@apache.org

    Xinze Guo, Committer of Apache ShardingSphere, azexin@apache.org


    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Hongsheng Zhong, mail: zhonghongsheng (at) apache.org
    Project Devs, mail: dev (at) shardingsphere.apache.org

    Apache ShardingSphere Enhance ComputeNode reconciliation

    Apache ShardingSphere

    Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

    Page: https://shardingsphere.apache.org/
    Github: https://github.com/apache/shardingsphere 

    Background

    There is a proposal about new CRD Cluster and ComputeNode as belows:

    Currently we try to promote ComputeNode as major CRD to represent a special ShardingSphere Proxy deployment. And plan to use Cluster indicating a special ShardingSphere Proxy cluster.

    Task

    This issue is to enhance ComputeNode reconciliation availability. The specific case list is as follows.

    •  Add IT test case for Deployment spec volume
    •  Add IT test case for Deployment spec template init containers
    •  Add IT test case for Deployment spec template spec containers
    •  Add IT test case for Deployment spec volume mounts
    •  Add IT test case for Deployment spec container ports
    •  Add IT test case for Deployment spec container image tag
    •  Add IT test case for Service spec ports
    •  Add IT test case for ConfigMap data serverconfig
    •  Add IT test case for ConfigMap data logback
       
      Notice, these issues can be a good example.
    • chore: add more Ginkgo tests for ComputeNode #203

    Relevant Skills

    1. Master Go language, Ginkgo test framework
    2. Have a basic understanding of Apache ShardingSphere Concepts
    3. Be familiar with Kubernetes Operator, kubebuilder framework

    Targets files

    ComputeNode IT - https://github.com/apache/shardingsphere-on-cloud/blob/main/shardingsphere-operator/pkg/reconcile/computenode/compute_node_test.go

    Mentor

    Liyao Miao, Committer of Apache ShardingSphere,  miaoliyao@apache.orgImage Added

    Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.orgImage Added

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Chuxin Chen, mail: tuichenchuxin
    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Tim Bossenmaier, mail: bossenti (at) apache.org
    Project Devs, mail: dev (at) streampipesshardingsphere.apache.org

    Apache ShardingSphere Support mainstream database metadata table query

    Apache ShardingSphere

    Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

    Pagehttps://shardingsphere.apache.org
    Githubhttps://github.com/apache/shardingsphere 

    Background

    ShardingSphere has designed its own metadata database to simulate metadata queries that support various databases.

    More details:

    Improving End-to-End Test Infrastructure of Apache StreamPipes

    Apache StreamPipes

    Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.

    Background

    StreamPipes has grown significantly over the past few years, with new features and contributors joining the project. However, as the project continues to evolve, e2e test coverage must also be improved to ensure that all features remain functional. Modern frameworks, such as Cypress, make it quite easy and fun to automatically test even complex application functionalities. As StreamPipes approaches its 1.0 release, it is important to improve e2e testing to ensure the robustness of the project and its use in real-world scenarios.

    Tasks

  • [ ] Write e2e tests using Cypress to cover most functionalities and user interface components of StreamPipes.
  • [ ] Add more complex testing scenarios to ensure the reliability and robustness of StreamPipes in real-world use cases (e.g. automated tests for version updates)
  • [ ] Add e2e tests for the new Python client to ensure its integration with the main system and its functionalities ([#774 |

    https://github.com/apache/

    streampipes

    shardingsphere/issues/

    774]])
  • [ ] Document the testing infrastructure and the testing approach to allow for easy maintenance and future contributions.

    ❗ ***Important Note*** ❗

  • Do not create any account on behalf of Apache StreamPipes in Cypress or using the name of Apache StreamPipes for any account creation. Your mentor will take care of it.

    Relevant Skills

    • Familiarity with testing frameworks, such as Cypress or Selenium
    • Experience with TypeScript or Java
    • Basic knowledge of Angular is helpful
    • Familiarity with Docker and containerization is a plus

      Learning Material

    References

    You can find our corresponding issue on GitHub here

    Name and Contact Information

    Name: Philipp Zehnder

    email:  zehnder[at]apache.org

    community: dev[at]streampipes.apache.org

    website: https://streampipes.apache.org/

    21268
    https://github.com/apache/shardingsphere/issues/22052

    Task

    • Support PostgreSQL And openGauss `\d tableName`
    • Support PostgreSQL And openGauss `\d+`
    • Support PostgreSQL And openGauss `\d+ tableName`
    • Support PostgreSQL And openGauss `l`
    • Support query for MySQL metadata `TABLES`
    • Support query for MySQL metadata `COLUMNS`
    • Support query for MySQL metadata `schemata`
    • Support query for MySQL metadata `ENGINES`
    • Support query for MySQL metadata `FILES`
    • Support query for MySQL metadata `VIEWS`

    Notice, these issues can be a good example.

    https://github.com/apache/shardingsphere/pull/22053
    https://github.com/apache/shardingsphere/pull/22057/
    https://github.com/apache/shardingsphere/pull/22166/
    https://github.com/apache/shardingsphere/pull/22182

    Relevant Skills

    •  Master JAVA language
    •  Have a basic understanding of Zookeeper
    •  Be familiar with MySQL/Postgres SQLs 


    Mentor

    Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org

    Zhengqiang Duan, PMC of Apache ShardingSphere, duanzhengqiang@apache.org

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Chuxin Chen, mail: tuichenchuxin
    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Philipp Zehnder, mail: zehnder (at) apache.org
    Project Devs, mail: dev (at) streampipesshardingsphere.apache.org

    OPC-UA browser for Apache StreamPipes

    Apache StreamPipes

    Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.

    Background

    StreamPipes is grown significantly throughout recent years. We were able to introduce a lot of new features and attracted both users and contributors. Putting the cherry on the cake, we were graduated as an Apache top level project in December 2022. We will of course continue developing new features and never rest to make StreamPipes even more amazing. 

    StreamPipes really shines when connecting Industrial IoT data. Such data sources typically originate from machine controllers, called PLCs (e.g., Siemens S7). But there are also new protocols such as OPC-UA which allow to browse available data within the controller. Our goal is to make connectivity of industrial data sources a matter of minutes.

    Currently, data sources can be connected using the built-in module `StreamPipes Connect` from the UI. We provide a set of adapters for popular protocols that can be customized, e.g., connection details can be added. 

    To make it even easier to connect industrial data sources with StreamPipes, we plan to add an OPC-UA browser. This will be part of the entry page of StreamPipes connect and should allow users to enter connection details of an existing OPC-UA server. Afterwards, a new view in the UI shows available data nodes from the server, their status and current value. Users should be able to select values that should be part of a new adapter. Afterwards, a new adapter can be created by reusing the current workflow to create an OPC-UA data source.

    This is a really cool project for participants interested in full-stack development who would like to get a deeper understanding of industrial IoT protocols. Have fun! 

    Tasks

    • [ ] get familiar with the OPC-UA protocol
    • [ ] develop mockups which demonstrate the user workflow
    • [ ] develop a data model for discovering data from OPC-UA
    • [ ] create the backend business logic for the OPC-UA browser 
    • [ ] create the frontend views to asynchronously browse data and to create a new adapter
    • [ ] write Junit, Component and E2E tests
    • [ ] what ever comes in your mind 💡 further ideas are always welcome

       

     Relevant Skills

    • interest in Industrial IoT and procotols such as OPC-UA
    • Java development skills
    • Angular/Typescript development skills

    Anyways, the most important relevant skill is motivation and readiness to learn during the project!

    Learning Material

    Reference

    Github issue can be found here: https://github.com/apache/streampipes/issues/1390

    Name and contact information

    • Mentor: Dominik Riemer (riemer[at]apache.org).
    • Mailing list: (dev[at]streampipes.apache.org)
    • Website: streampipes.apache.org
    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Dominik Riemer, mail: riemer (at) apache.org
    Project Devs, mail: dev (at) streampipes.apache.org

    SkyWalking

    [GSOC] [SkyWalking] AIOps Log clustering with Flink (Algorithm Optimization)

    Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This year we will proceed on log clustering implementation with a revised architecture and this task will require student to focus on algorithm optimiztion for the clustering technique.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Yihao Chen, mail: yihaochen (at) apache.org
    Project Devs, mail: dev (at) skywalking.apache.org

    [GSOC] [SkyWalking] AIOps Log clustering with Flink (Flink Integration)

    Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This year we will proceed on log clustering implementation with a revised architecture and this task will require student to focus on Flink and its integration with SkyWalking OAP.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Yihao Chen, mail: yihaochen (at) apache.org
    Project Devs, mail: dev (at) skywalking.apache.org

    Apache ShardingSphere Enhance SQLNodeConverterEngine to support more MySQL SQL statements

    Apache ShardingSphere

    Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

    Pagehttps://shardingsphere.apache.org
    Githubhttps://github.com/apache/shardingsphere 

    Background

    The ShardingSphere SQL federation engine provides support for complex SQL statements, and it can well support cross-database join queries, subqueries, aggregation queries and other statements. An important part of SQL federation engine is to convert the SQL statement parsed by ShardingSphere into SqlNode, so that Calcite can be used to implement SQL optimization and federated query.

    Task

    This issue is to solve the MySQL exception that occurs during SQLNodeConverterEngine conversion. The specific case list is as follows.

    • select_char
    • select_extract
    • select_from_dual
    • select_from_with_table
    • select_group_by_with_having_and_window
    • select_not_between_with_single_table
    • select_not_in_with_single_table
    • select_substring
    • select_trim
    • select_weight_string
    • select_where_with_bit_expr_with_ampersand
    • select_where_with_bit_expr_with_caret
    • select_where_with_bit_expr_with_div
    • select_where_with_bit_expr_with_minus_interval
    • select_where_with_bit_expr_with_mod
    • select_where_with_bit_expr_with_mod_sign
    • select_where_with_bit_expr_with_plus_interval
    • select_where_with_bit_expr_with_signed_left_shift
    • select_where_with_bit_expr_with_signed_right_shift
    • select_where_with_bit_expr_with_vertical_bar
    • select_where_with_boolean_primary_with_comparison_subquery
    • select_where_with_boolean_primary_with_is
    • select_where_with_boolean_primary_with_is_not
    • select_where_with_boolean_primary_with_null_safe
    • select_where_with_expr_with_and_sign
    • select_where_with_expr_with_is
    • select_where_with_expr_with_is_not
    • select_where_with_expr_with_not
    • select_where_with_expr_with_not_sign
    • select_where_with_expr_with_or_sign
    • select_where_with_expr_with_xor
    • select_where_with_predicate_with_in_subquery
    • select_where_with_predicate_with_regexp
    • select_where_with_predicate_with_sounds_like
    • select_where_with_simple_expr_with_collate
    • select_where_with_simple_expr_with_match
    • select_where_with_simple_expr_with_not
    • select_where_with_simple_expr_with_odbc_escape_syntax
    • select_where_with_simple_expr_with_row
    • select_where_with_simple_expr_with_tilde
    • select_where_with_simple_expr_with_variable
    • select_window_function
    • select_with_assignment_operator
    • select_with_assignment_operator_and_keyword
    • select_with_case_expression
    • select_with_collate_with_marker
    • select_with_date_format_function
    • select_with_exists_sub_query_with_project
    • select_with_function_name
    • select_with_json_value_return_type
    • select_with_match_against
    • select_with_regexp
    • select_with_schema_name_in_column_projection
    • select_with_schema_name_in_shorthand_projection
    • select_with_spatial_function
    • select_with_trim_expr
    • select_with_trim_expr_from_expr

    You need to compare the difference between actual and expected, and then correct the logic in SQLNodeConverterEngine so that actual can be consistent with expected.

    After you make changes, remember to add case to SUPPORTED_SQL_CASE_IDS to ensure it can be tested.

     
    Notice, these issues can be a good example.

    [GSOC] [SkyWalking] Python Agent Performance Enhancement Plan

    Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This task is about enhancing Python agent performance, the tracking issue can be seen here -< https://github.com/apache/skywalkingshardingsphere/issuespull/10408

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Yihao Chen, mail: yihaochen (at) apache.org
    Project Devs, mail: dev (at) skywalking.apache.org

    14492

    Relevant Skills

     
    1. Master JAVA language

    2. Have a basic understanding of Antlr g4 file

    3. Be familiar with MySQL and Calcite SqlNode

    Targets files

     
    SQLNodeConverterEngineIT

    https://github.com/apache/shardingsphere/blob/master/test/it/optimizer/src/test/java/org/apache/shardingsphere/test/it/optimize/SQLNodeConverterEngineIT.java 

    Mentor

    Zhengqiang Duan, PMC of Apache ShardingSphere, duanzhengqiang@apache.org

    Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.orgImage Added

    Trista Pan, PMC of Apache ShardingSphere, panjuan@apache.org

    [GSOC] [SkyWalking] Pending Task on K8s

    Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This task is about a pending task on K8s.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Yihao ChenZhengqiang Duan, mail: yihaochen duanzhengqiang (at) apache.org
    Project Devs, mail: dev (at) skywalkingshardingsphere.apache.org

    Doris

    [

    GSOC

    GSoC][

    SkyWalking] Self-Observability of the query subsystem in BanyanDB

    Doris]Dictionary Encoding Acceleration

    Apache Doris
    Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.
    Page: https://doris.apache.org

    Github: https://github.com/apache/doris

    Background

    In Apache Doris, dictionary encoding is performed during data writing and compaction. Dictionary encoding will be implemented on string data types by default. The dictionary size of a column for one segment is 1M at most. The dictionary encoding technology accelerates strings during queries, converting them into INT, for example.
     

    Task

    • Phase One: Get familiar with the implementation of Apache Doris dictionary encoding; learning how Apache Doris dictionary encoding accelerates queries.
    •  Phase Two: Evaluate the effectiveness of full dictionary encoding and figure out how to optimize memory in such a case.

    Learning Material

    Page: https://doris.apache.org
    Github: https://github.com/apache/doris

    Mentor

    Background

    SkyWalking BanyanDB is an observability database, aims to ingest, analyze and store Metrics, Tracing and Logging data.

    Objectives

    1. Support EXPLAIN[1] for both measure query and stream query
    2. Add self-observability including trace and metrics for query subsystem
    3. Support EXPLAIN in the client SDK & CLI and add query plan visualization in the UI

    [1]: EXPLAIN in MySQL

    Recommended Skills

    1. Familiar with Go
    2. Have a basic understanding of database query engine
    3. Have an experience of Apache SkyWalking or other APMs

    Mentor

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Jiajing Zhijing Lu, mail: lujiajing luzhijing (at) apache.org
    Project Devs, mail: dev (at) skywalkingdoris.apache.org

    [

    GSOC

    GSoC][

    SkyWalking] Unify query planner and executor in BanyanDB

    Background

    SkyWalking BanyanDB is an observability database, aims to ingest, analyze and store Metrics, Tracing and Logging data.

    Objectives

    1. Fully unify/merge the query planner and executor for Measure and TopN

    Recommended Skills

    1. Familiar with Go
    2. Have a basic understanding of database query engine
    3. Have an experience of Apache SkyWalking

    Mentor

    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Jiajing Lu, mail: lujiajing (at) apache.org
    Project Devs, mail: dev (at) skywalking.apache.org

    [SkyWalking] Add Terraform provider for Apache SkyWalking

    Now the deployment methods for SkyWalking are limited, we only have Helm Chart for users to deploy in Kubernetes, other users that are not using Kubernetes have to do all the house keeping stuffs to set up SkyWalking on, for example, VM.

    This issue aims to add a Terraform provider, so that users can conveniently  spin up a cluster for demonstration or testing, we should evolve the provider and allow users to customize as their need and finally users can use this in their production environment.

    In this task, we will mainly focus on the support for AWS. In the Terraform provider, users need to provide their access key / secret key, and the provider does the rest stuffs: create VMs, create database/OpenSearch or RDS, download SkyWalking tars, configure the SkyWalking, and start the SkyWalking components (OAP/UI), create public IPs/domain name, etc.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Zhenxu Ke, mail: kezhenxu94 (at) apache.org
    Project Devs, mail: dev (at) skywalking.apache.org

    Doris

    Doris] Supports BigQuery/Apache Kudu/Apache Cassandra/Apache Druid in Federated Queries

    Apache Doris
    Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.
    Page: https://doris.apache.org
    Github: https://github.com/apache/doris

    Background

    Apache Doris supports acceleration of queries on external data sources to meet users' needs for federated queries and analysis.
    Currently, Apache Doris supports multiple external catalogs including those from Hive, Iceberg, Hudi, and JDBC. Developers can connect more data sources to Apache Doris based on a unified framework.

    Objective

    Task
    Phase One:

    • Get familiar with the Multi-Catalog structure of Apache Doris, including the metadata synchronization mechanism in FE and the data reading mechanism of BE.
    • Investigate how metadata should be acquired and how data access works regarding the picked data source(s); produce the corresponding design documentation.

    Phase Two:

    • Develop connections to the picked data source(s) and implement access to metadata and data.

    Learning Material

    Page: https://doris.apache.org
    Github: https://github.com/apache/doris

    Mentor

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Zhijing Lu, mail: luzhijing (at) apache.org
    Project Devs, mail: dev (at) doris.apache.org

    [GSoC][Doris]Page Cache Improvement

    [GSoC][Doris]Dictionary Encoding Acceleration

    Apache Doris
    Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.
    Page: https://doris.apache.org

    Github: https://github.com/apache/doris

    Background

    In Apache Doris , dictionary encoding is performed during data writing and compaction. Dictionary encoding will be implemented on string data types by default. The dictionary size of a column for one segment is 1M at most. The dictionary encoding technology accelerates strings during queries, converting them into INT, for example.
     accelerates high-concurrency queries utilizing page cache, where the decompressed data is stored.
    Currently, the page cache in Apache Doris uses a simple LRU algorithm, which reveals a few problems: 

    • Hot data will be phased out in large queries
    • The page cache configuration is immutable and does not support GC.

    Task

    • Phase One: Get familiar with the implementation of Apache Doris dictionary encoding; learning how Apache Doris dictionary encoding accelerates queries. Phase Two: Evaluate the effectiveness of full dictionary encoding and figure out how to optimize memory in such a caseIdentify the impacts on queries when the decompressed data is stored in memory and SSD, respectively, and then determine whether full page cache is required.
    • Phase Two: Improve the cache strategy for Apache Doris based on the results from Phase One.

    Learning Material

    Page: https://doris.apache.org
    Github: https://github.com/apache/doris

    Mentor

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Zhijing Lu, mail: luzhijing (at) apache.org
    Project Devs, mail: dev (at) doris.apache.org

    SkyWalking

    [

    GSoC

    GSOC] [SkyWalking] AIOps Log clustering with Flink (Algorithm Optimization)

    Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This year we will proceed on log clustering implementation with a revised architecture and this task will require student to focus on algorithm optimiztion for the clustering technique.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Yihao Chen, mail: yihaochen (at) apache.org
    Project Devs, mail: dev (at) skywalking.apache.org

    [GSOC] [SkyWalking] AIOps Log clustering with Flink (Flink Integration)

    Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This year we will proceed on log clustering implementation with a revised architecture and this task will require student to focus on Flink and its integration with SkyWalking OAP.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Yihao Chen, mail: yihaochen (at) apache.org
    Project Devs, mail: dev (at) skywalking.apache.org

    [GSOC] [SkyWalking] Python Agent Performance Enhancement Plan

    Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This task is about enhancing Python agent performance, the tracking issue can be seen here -< https://github.com/apache/skywalking/issues/10408

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Yihao Chen, mail: yihaochen (at) apache.org
    Project Devs, mail: dev (at) skywalking.apache.org

    [GSOC] [SkyWalking] Pending Task on K8s

    Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This task is about a pending task on K8s.

    Doris] Supports BigQuery/Apache Kudu/Apache Cassandra/Apache Druid in Federated Queries

    Apache Doris
    Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.
    Page: https://doris.apache.org
    Github: https://github.com/apache/doris

    Background

    Apache Doris supports acceleration of queries on external data sources to meet users' needs for federated queries and analysis.
    Currently, Apache Doris supports multiple external catalogs including those from Hive, Iceberg, Hudi, and JDBC. Developers can connect more data sources to Apache Doris based on a unified framework.

    Objective

    Task
    Phase One:

    • Get familiar with the Multi-Catalog structure of Apache Doris, including the metadata synchronization mechanism in FE and the data reading mechanism of BE.
    • Investigate how metadata should be acquired and how data access works regarding the picked data source(s); produce the corresponding design documentation.

    Phase Two:

    • Develop connections to the picked data source(s) and implement access to metadata and data.

    Learning Material

    Page: https://doris.apache.org
    Github: https://github.com/apache/doris

    Mentor

  • Mentor: Mingyu Chen, Apache Doris PMC Member & Committer, morningman@apache.org Image Removed
  • Mentor: Calvin Kirs, Apache Geode PMC & Committer, Kirs@apache.orgImage Removed
  • Mailing List: dev@doris.apache.org

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Zhijing LuYihao Chen, mail: luzhijing yihaochen (at) apache.org
    Project Devs, mail: dev (at) dorisskywalking.apache.org

    [

    GSoC

    GSOC] [

    Doris]Page Cache Improvement

    SkyWalking] Unify query planner and executor in BanyanDB

    Background

    SkyWalking BanyanDB is an observability database, aims to ingest, analyze and store Metrics, Tracing and Logging data.

    Objectives

    1. Fully unify/merge the query planner and executor for Measure and TopN

    Recommended Skills

    1. Familiar with Go
    2. Have a basic understanding of database query engine
    3. Have an experience of Apache SkyWalking

    Mentor

    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Jiajing Lu, mail: lujiajing (

    Apache Doris
    Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.
    Page: https://doris.apache.org

    Github: https://github.com/apache/doris

    Background

    Apache Doris accelerates high-concurrency queries utilizing page cache, where the decompressed data is stored.
    Currently, the page cache in Apache Doris uses a simple LRU algorithm, which reveals a few problems: 

    • Hot data will be phased out in large queries
    • The page cache configuration is immutable and does not support GC.

    Task

    • Phase One: Identify the impacts on queries when the decompressed data is stored in memory and SSD, respectively, and then determine whether full page cache is required.
    • Phase Two: Improve the cache strategy for Apache Doris based on the results from Phase One.

    Learning Material

    Page: https://doris.apache.org
    Github: https://github.com/apache/doris

    Mentor

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Zhijing Lu, mail: luzhijing (at) apache.org
    Project Devs, mail: dev (at) dorisskywalking.apache.org

    EventMesh

    [GSOC][SkyWalking] Add Terraform provider for Apache SkyWalking

    Now the deployment methods for SkyWalking are limited, we only have Helm Chart for users to deploy in Kubernetes, other users that are not using Kubernetes have to do all the house keeping stuffs to set up SkyWalking on, for example, VM.


    This issue aims to add a Terraform provider, so that users can conveniently  spin up a cluster for demonstration or testing, we should evolve the provider and allow users to customize as their need and finally users can use this in their production environment.


    In this task, we will mainly focus on the support for AWS. In the Terraform provider, users need to provide their access key / secret key, and the provider does the rest stuffs: create VMs, create database/OpenSearch or RDS, download SkyWalking tars, configure the SkyWalking, and start the SkyWalking components (OAP/UI), create public IPs/domain name, etc.

    Apache EventMesh EventMesh official website dos by version and demo show

    Apache EventMesh (incubating)
    Apache EventMesh is a fully serverless platform used to build distributed event-driven applications.

    Website: https://eventmesh.apache.org

    GitHub: https://github.com/apache/incubator-eventmesh

    Upstream Issue: https://github.com/apache/incubator-eventmesh/issues/3327

    Background

    We hope that the community can contribute to the maintenance of documents, including the archiving of Chinese and English content of documents of different release versions, the maintenance of official website documents, the improvement of project quick start documents, feature introduction, etc.

    Task

    1.Discuss with the mentors what you need to do

    2. Learn the details of the Apache EventMesh project

    3. Improve and supplement the content of documents on GitHub, maintain official website documents, record eventmesh quick user experience, and feature display videos

    Recommended Skills

    1.Familiar with MarkDown

    2. Familiar with Java\Go

    Mentor
    Eason Chen, PPMC of Apache EventMesh, https://github.com/qqeasonchen, chenguangsheng@apache.org

    Mike Xue, PPMC of Apache EventMesh, https://github.com/xwm1992, mikexue@apache.org

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Xue WeimingZhenxu Ke, mail: mikexue kezhenxu94 (at) apache.org
    Project Devs, mail: dev (at) eventmeshskywalking.apache.org

    [GSOC] [SkyWalking] Self-Observability of the query subsystem in BanyanDB

    Background

    SkyWalking BanyanDB is an observability database, aims to ingest, analyze and store Metrics, Tracing and Logging data.

    Objectives

    1. Support EXPLAIN[1] for both measure query and stream query
    2. Add self-observability including trace and metrics for query subsystem
    3. Support EXPLAIN in the client SDK & CLI and add query plan visualization in the UI

    [1]: EXPLAIN in MySQL

    Recommended Skills

    1. Familiar with Go
    2. Have a basic understanding of database query engine
    3. Have an experience of Apache SkyWalking or other APMs

    Mentor

    Apache EventMesh Integrate eventmesh runtime on Kubernetes

    Apache EventMesh (incubating)
    Apache EventMesh is a fully serverless platform used to build distributed event-driven applications.

    Website: https://eventmesh.apache.org

    GitHub: https://github.com/apache/incubator-eventmesh

    Upstream Issue: https://github.com/apache/incubator-eventmesh/issues/3327

    Background

    Currently, EventMesh has good usability in microservice scenarios. However, EventMesh's support for Kubernetes is still relatively weak.We hope the community can contribute EventMesh integration with the k8s.

    Task

    1.Discuss with the mentors your implementation idea

    2. Learn the details of the Apache EventMesh project

    3. Integrate EventMesh with the k8s

    Recommended Skills

    1.Familiar with Java

    2.Familiar with Kubernetes

    Mentor
    Eason Chen, PPMC of Apache EventMesh, https://github.com/qqeasonchen, chenguangsheng@apache.org

    Mike Xue, PPMC of Apache EventMesh, https://github.com/xwm1992, mikexue@apache
    • .org
    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Xue WeimingJiajing Lu, mail: mikexue lujiajing (at) apache.org
    Project Devs, mail: dev (at) eventmeshskywalking.apache.org

    ShenYu

    Apache ShenYu Gsoc 2023

    - Support for Kubernetes Service Discovery

    - ShenYu End-To-End SpringCloud plugin test case

    Background:

    Apache ShenYu Shenyu is a Java native API Gateway gateway for service proxy, protocol conversion translation and API governance. Currently, ShenYu has good usability and performance in microservice scenarios. However, ShenYu's support for Kubernetes is still relatively weak.

    Tasks

    1. Support the registration of microservices deployed in K8s Pod to shenyu-admin and use K8s as the register center.
    2. Discuss with mentors, and complete the requirements design and technical design of Shenyu K8s Register Center.
    3. Complete the initial version of Shenyu K8s Register Center.
    4. Complete the CI test of Shenyu K8s Register Center, verify the correctness of the code.
    5. Write the necessary documentation, deployment guides, and instructions for users to connect microservices running inside the K8s Pod to ShenYu

    Relevant Skills

    but Shenyu lack of End-To-End Tests.

    Relevant skills:

    1.Understand the architecture of ShenYu

    2.Understand SpringCloud micro-service and ShenYu SpringCloud proxy plugin.

    3.Understand ShenYu e2e framework and architecture.

    How to coding

    1.please refer to org.apache.shenyu.e2e.testcase.plugin.DividePluginCases

    How to test

    1.start shenyu admin in docker

    2.start shenyu boostrap in docker

    3.run test case org.apache.shenyu.e2e.testcase.plugin.PluginsTest#testDivide

    Task List

    1.develop e2e tests of the springcloud plug-in.

    2.write shenyu e2e springcloud plugin documentation in shenyu-website.

    3.refactor the existing plugin test cases.


    Links:

    website: https://shenyu.apache.org/

    issues: https://github.com/apache/shenyu/issues/44741. Know the use of Apache ShenYu, especially the register center
    2. Familiar with Java and Golang
    3. Familiar with Kubernetes and can use Java or Golang to develop


    Difficulty: Major
    Project size: ~350 ~175 hour (largemedium)
    Potential mentors:
    Yonglun ZhangFengen He, mail: zhangyonglun hefengen (at) apache.org
    Project Devs, mail: dev (at) shenyu.apache.org

    Apache ShenYu Gsoc 2023 -

    ShenYu End-To-End SpringCloud plugin test case

    Design and implement shenyu ingress-controller in k8s

    Background

    Apache ShenYu is a Java native API Gateway

    Background:

    Shenyu is a native API gateway for service proxy, protocol translation conversion and API governance. but Shenyu lack of End-To-End Tests.

    Relevant skills:

    1.Understand the architecture of ShenYu

    2.Understand SpringCloud micro-service and ShenYu SpringCloud proxy plugin.

    3.Understand ShenYu e2e framework and architecture.

    How to coding

    1.please refer to org.apache.shenyu.e2e.testcase.plugin.DividePluginCases

    How to test

    1.start shenyu admin in docker

    2.start shenyu boostrap in docker

    3.run test case org.apache.shenyu.e2e.testcase.plugin.PluginsTest#testDivide

    Task List

    1.develop e2e tests of the springcloud plug-in.

    2.write shenyu e2e springcloud plugin documentation in shenyu-website.

    3.refactor the existing plugin test cases.

    Links:

    website: https://shenyu.apache.org/

    Currently, ShenYu has good usability and performance in microservice scenarios. However, ShenYu's support for Kubernetes is still relatively weak.

    Tasks

    1. Discuss with mentors, and complete the requirements design and technical design of shenyu-ingress-controller.
    2. Complete the initial version of shenyu-ingress-controller, implement the reconcile of k8s ingress api, and make ShenYu as the ingress gateway of k8s.
    3. Complete the ci test of shenyu-ingress-controller, verify the correctness of the code.

    Relevant Skills

    1. Know the use of Apache ShenYu
    2. Familiar with Java and Golang
    3. Familiar with Kubernetes and can use java or golang to develop Kubernetes Controller

    Description

    Issues : https://github.com/apache/shenyu/issues/4438
    website : https://shenyu.apache.org/issues: https://github.com/apache/shenyu/issues/4474

    Difficulty: Major
    Project size: ~175 ~350 hour (mediumlarge)
    Potential mentors:
    Fengen HeYu Xiao, mail: hefengen xiaoyu (at) apache.org
    Project Devs, mail: dev (at) shenyu.apache.org

    Apache ShenYu Gsoc 2023 -

    Design and implement shenyu ingress-controller in k8s

    Shenyu-Admin Internationalization

    Background

    Apache ShenYu Shenyu is a Java native API Gateway gateway for service proxy, protocol conversion translation and API governance. Currently, ShenYu has good usability and performance in microservice scenarios. However, ShenYu's support for Kubernetes is still relatively weak.

    Tasks

    1. Discuss with mentors, and complete the requirements design and technical design of shenyu-ingress-controller.
    2. Complete the initial version of shenyu-ingress-controller, implement the reconcile of k8s ingress api, and make ShenYu as the ingress gateway of k8s.
    3. Complete the ci test of shenyu-ingress-controller, verify the correctness of the code.

    Relevant Skills

    1. Know the use of Apache ShenYu
    2. Familiar with Java and Golang
    3. Familiar with Kubernetes and can use java or golang to develop Kubernetes Controller

    Description

    Issues : https://github.com/apache/shenyu/issues/4438
    website : https://shenyu.apache.org/

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Yu Xiao, mail: xiaoyu (at) apache.org
    Project Devs, mail: dev (at) shenyu.apache.org

    It can manage and maintain the API through Shenyu-admin, and support internationalization in Chinese and English. Unfortunately, Shenyu-admin is only internationalized on the front end. The message prompt returned by the back-end interface is still in English. Therefore, we need to implement internationalization support for the back-end interface.This will lay a good foundation for shenyu to move towards more language support.

    Relevant skills

    • Related skills spring resources
    • Spring Internationalization
    • Front-end react framework

    API reference

                java.util.Locale;
                org.springframework.context.MessageSource;
                org.springframework.context.support.ResourceBundleMessageSource; 

    Interface effect example

         

    Apache ShenYu Gsoc 2023 - Shenyu-Admin Internationalization

    Background

    Shenyu is a native API gateway for service proxy, protocol translation and API governance. It can manage and maintain the API through Shenyu-admin, and support internationalization in Chinese and English. Unfortunately, Shenyu-admin is only internationalized on the front end. The message prompt returned by the back-end interface is still in English. Therefore, we need to implement internationalization support for the back-end interface.This will lay a good foundation for shenyu to move towards more language support.

    Relevant skills

    • Related skills spring resources
    • Spring Internationalization
    • Front-end react framework

    API reference

           ## zh request example
      java.util.Locale;
                org.springframework.context.MessageSource;POST http://localhost:9095/plugin
                org.springframework.context.support.ResourceBundleMessageSource; 

    Interface effect example

                ## zh request example
                POST http://localhost:9095/plugin
                Content-Type: Content-Type: application/json
                Location: cn-zh
                X-Access-Token: xxx
                {
                "name": "test-create-plugin",
                "role": "test-create-plugin",
                "enabled": true,
                "sort": 100
                }
                Respone
                {
                "code": 600,
                "message": "未登录"
                }
                
                ### en request example
                POST http://localhost:9095/plugin
                Content-Type: application/json
                Location: en
                X-Access-Token: xxx
                {
                "name": "test-create-plugin",
                "role": "test-create-plugin",
                "enabled": true,
                "sort": 100
                }
                Respone
                {
                "code": 600,
                "message": "token is error"
                } 

    Task List

    • The task discussed with the tutor how to achieve the internationalization of shenyu-admin background
    • Some prompt message translation
    • Get through the internationalization of front-end, obtain the client region information through http protocol, support the language of the corresponding region.
    • Leave the extension of other multi-language internationalization support interface, so as to facilitate the localization transformation of subsequent users.
    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Keguo Li, mail: likeguo (at) apache.org
    Project Devs, mail: dev (at) shenyu.apache.org

    Commons Statistics

    [GSoC] Summary statistics API for Java 8 streams

    Placeholder for tasks that could be undertaken in this year's GSoC.

    Ideas:

    • Design an updated summary statistics API for use with Java 8 streams based on the summary statistic implementations in the Commons Math stat.descriptive package including moments, rank and summary sub-packages.
    Difficulty: Minor
    Project size: ~350 hour (large)
    Potential mentors:
    Alex Herbert, mail: aherbert (at) apache.org
    Project Devs, mail:

    Commons Numbers

    Apache ShenYu Gsoc 2023 - Support for Kubernetes Service Discovery

    Background

    Apache ShenYu is a Java native API Gateway for service proxy, protocol conversion and API governance. Currently, ShenYu has good usability and performance in microservice scenarios. However, ShenYu's support for Kubernetes is still relatively weak.

    Tasks

    1. Support the registration of microservices deployed in K8s Pod to shenyu-admin and use K8s as the register center.
    2. Discuss with mentors, and complete the requirements design and technical design of Shenyu K8s Register Center.
    3. Complete the initial version of Shenyu K8s Register Center.
    4. Complete the CI test of Shenyu K8s Register Center, verify the correctness of the code.
    5. Write the necessary documentation, deployment guides, and instructions for users to connect microservices running inside the K8s Pod to ShenYu

    Relevant Skills

    1. Know the use of Apache ShenYu, especially the register center
    2. Familiar with Java and Golang
    3. Familiar with Kubernetes and can use Java or Golang to develop

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Yonglun Zhang, mail: zhangyonglun (at) apache.org
    Project Devs, mail: dev (at) shenyu.apache.org

    EventMesh

    Apache EventMesh EventMesh official website dos by version and demo show

    Apache EventMesh (incubating)
    Apache EventMesh is a fully serverless platform used to build distributed event-driven applications.

    Website: https://eventmesh.apache.org

    GitHub: https://github.com/apache/incubator-eventmesh

    Upstream Issue: https://github.com/apache/incubator-eventmesh/issues/3327

    Background

    We hope that the community can contribute to the maintenance of documents, including the archiving of Chinese and English content of documents of different release versions, the maintenance of official website documents, the improvement of project quick start documents, feature introduction, etc.

    Task

    1.Discuss with the mentors what you need to do

    2. Learn the details of the Apache EventMesh project

    3. Improve and supplement the content of documents on GitHub, maintain official website documents, record eventmesh quick user experience, and feature display videos

    Recommended Skills

    1.Familiar with MarkDown

    2. Familiar with Java\Go

    Mentor
    Eason Chen, PPMC of Apache EventMesh, https://github.com/qqeasonchen, chenguangsheng@apache.org

    Mike Xue, PPMC of Apache EventMesh, https://github.com/xwm1992, mikexue@apache.org

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Xue Weiming, mail: mikexue (at) apache.org
    Project Devs, mail: dev (at) eventmesh.apache.org

    Apache EventMesh Integrate eventmesh runtime on Kubernetes

    Apache EventMesh (incubating)
    Apache EventMesh is a fully serverless platform used to build distributed event-driven applications.

    Website: https://eventmesh.apache.org

    GitHub: https://github.com/apache/incubator-eventmesh

    Upstream Issue: https://github.com/apache/incubator-eventmesh/issues/3327

    Background

    Currently, EventMesh has good usability in microservice scenarios. However, EventMesh's support for Kubernetes is still relatively weak.We hope the community can contribute EventMesh integration with the k8s.

    Task

    1.Discuss with the mentors your implementation idea

    2. Learn the details of the Apache EventMesh project

    3. Integrate EventMesh with the k8s

    Recommended Skills

    1.Familiar with Java

    2.Familiar with Kubernetes

    Mentor
    Eason Chen, PPMC of Apache EventMesh, https://github.com/qqeasonchen, chenguangsheng@apache.org

    Mike Xue, PPMC of Apache EventMesh, https://github.com/xwm1992, mikexue@apache.org

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Xue Weiming, mail: mikexue (at) apache.org
    Project Devs, mail: dev (at) eventmesh.apache.org

    Beam

    [GSoC][Beam] An IntelliJ plugin to develop Apache Beam pipelines and the Apache Beam SDKs

    Beam library developers and Beam users would appreciate this : )


    This project involves prototyping a few different solutions, so it will be large.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Pablo Estrada, mail: pabloem

    Add support for extended precision floating-point numbers

    Add implementations of extended precision floating point numbers.

    An extended precision floating point number is a series of floating-point numbers that are non-overlapping such that:

    double-double (a, b):
                |a| > |b|
                a == a + b

    Common representations are double-double and quad-double (see for example David Bailey's paper on a quad-double library: QD).

    Many computations in the Commons Numbers and Statistics libraries use extended precision computations where the accumulated error of a double would lead to complete cancellation of all significant bits; or create intermediate overflow of integer values.

    This project would formalise the code underlying these use cases with a generic library applicable for use in the case where the result is expected to be a finite value and using Java's BigDecimal and/or BigInteger negatively impacts performance.

    An example would be the average of long values where the intermediate sum overflows or the conversion to a double loses bits:

                long[] values = {Long.MAX_VALUE, Long.MAX_VALUE};
                System.out.println(Arrays.stream(values).average().getAsDouble()); System.out.println(Arrays.stream(values).mapToObj(BigDecimal::valueOf)
                .reduce(BigDecimal.ZERO, BigDecimal::add)
                .divide(BigDecimal.valueOf(values.length)).doubleValue());
                long[] values2 = {Long.MAX_VALUE, Long.MIN_VALUE};
                System.out.println(Arrays.stream(values2).asDoubleStream().average().getAsDouble()); System.out.println(Arrays.stream(values2).mapToObj(BigDecimal::valueOf)
                   .reduce(BigDecimal.ZERO, BigDecimal::add)
                .divide(BigDecimal.valueOf(values2.length)).doubleValue());
                

    Outputs:

    -1.0
                9.223372036854776E18
                0.0
                -0.5
    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Alex Herbert, mail: aherbert (at) apache.org
    Project Devs, mail: dev (at) commonsbeam.apache.org

    ...

    [GSoC]

    Update components including machine learning; linear algebra; special functions

    [Beam] Build out Beam Machine Learning Use Cases

    Today, you can do all sorts of Machine Learning using Apache Beam (https://beam.apache.org/documentation/ml/overview/).
     
    Many of our users, however, have a hard time getting started with ML and understanding how Beam can be applied to their day to day work. The goal of this project is to build out a series of Beam pipelines as Jupyter Notebooks demonstrating real world ML use cases, from NLP to image recognition to using large language models. As you go, there may be bugs or friction points as well which will provide opportunities to contribute back to Beam's core ML libraries.


    Mentor for this will be Danny McCormick

    Difficulty: Major

    Placeholder for tasks that could be undertaken in this year's GSoC.

    Ideas (extracted from the "dev" ML):

    1. Redesign and modularize the "ml" package
      -> main goal: enable multi-thread usage.
    2. Abstract the linear algebra utilities
      -> main goal: allow switching to alternative implementations.
    3. Redesign and modularize the "random" package
      -> main goal: general support of low-discrepancy sequences.
    4. Refactor and modularize the "special" package
      -> main goals: ensure accuracy and performance and better API,
      add other functions.
    5. Upgrade the test suite to Junit 5
      -> additional goal: collect a list of "odd" expectations.
    6. Review and finalize pending issues about the refactoring of the "genetic algorithm" functionality (cf. dedicated branch)

    Other suggestions welcome, as well as

    • delineating additional and/or intermediate goals,
    • signalling potential pitfalls and/or alternative approaches to the intended goal(s).
    Difficulty: Minor
    Project size: ~350 hour (large)
    Potential mentors:
    Gilles SadowskiPablo Estrada, mail: erans pabloem (at) apache.org
    Project Devs, mail: dev (at) commonsbeam.apache.org

    Refactoring of GA functionality

    As discussed extensively on the "dev" ML[1][2], there are two competing designs (please review them on the dedicated git branch) for the refactoring of the basic functionality currently implemented in the org.apache.commons.math4.legacy.genetics "legacy" package.

    TL;DR;

    • The discussion has pointed to major (from a maintenance POV) issues of the design proposed by the OP.
    • The alternative (much simpler) design has been implemented as proof-of-concept (indicating that some corner might have been cut).
    • The OP mentioned correctness issues in the "simple" design but did neither fix them nor provided answers on the LM to that effect.
    • Questions concerning other possible "bloat" (e.g. on using a custom representation of the "binary chromosome" concept instead of the BitSet available from the JDK) were also left dangling.
    • Refactoring of the "basic" GA functionality (the purpose of the "proof-of-concept") must be decoupled from the new feature which the OP wanted to implement ("adaptive probability generation").
    • Unit tests (a.o. all those from the "legacy" code) must demonstrate that the refactored code does (or does not) behave correctly, and bugs should be fixed on the "simple" implementation, before implementing the new feature on top of it.

    [1] https://markmail.org/message/qn7gq2y7xjoxukzp
    [2] https://markmail.org/message/f66iii3a4kmjaprr

    Difficulty: Minor
    Project size: ~350 hour (large)
    Potential mentors:
    Gilles Sadowski, mail: erans (at) apache.org
    Project Devs, mail: dev (at) commons.apache.org

    Commons Imaging

    Placeholder for 1.0 release

    A placeholder ticket, to link other issues and organize tasks related to the 1.0 release of Commons Imaging.

    The 1.0 release of Commons Imaging has been postponed several times. Now we have a more clear idea of what's necessary for the 1.0 (see issues with fixVersion 1.0 and 1.0-alpha3, and other open issues), and the tasks are interesting as it involves both basic and advanced programming for tasks such as organize how test images are loaded, or work on performance improvements at byte level and following image format specifications.

    The tasks are not too hard to follow, as normally there are example images that need to work with Imaging, as well as other libraries in C, C++, Rust, PHP, etc., that process these images correctly. Our goal with this issue is to a) improve our docs, b) improve our tests, c) fix possible security issues, d) get the parsers in Commons Imaging ready for the 1.0 release.

    Assigning the label for GSoC 2023, and full time. Although it would be possible to work on a smaller set of tasks for 1.0 as a part time too.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Bruno P. Kinoshita, mail: kinow (at) apache.org
    Project Devs, mail:

    CloudStack

    Commons Statistics

    [GSoC] Summary statistics API for Java 8 streams

    Placeholder for tasks that could be undertaken in this year's GSoC.

    Ideas:

    • Design an updated summary statistics API for use with Java 8 streams based on the summary statistic implementations in the Commons Math stat.descriptive package including moments, rank and summary sub-packages.
    Difficulty: Minor
    Project size: ~350 hour (large)
    Potential mentors:
    Alex Herbert, mail: aherbert (at) apache.org
    Project Devs, mail:

    Commons Numbers

    Add support for extended precision floating-point numbers

    Add implementations of extended precision floating point numbers.

    An extended precision floating point number is a series of floating-point numbers that are non-overlapping such that:

    double-double (a, b):
                |a| > |b|
                a == a + b

    Common representations are double-double and quad-double (see for example David Bailey's paper on a quad-double library: QD).

    Many computations in the Commons Numbers and Statistics libraries use extended precision computations where the accumulated error of a double would lead to complete cancellation of all significant bits; or create intermediate overflow of integer values.

    This project would formalise the code underlying these use cases with a generic library applicable for use in the case where the result is expected to be a finite value and using Java's BigDecimal and/or BigInteger negatively impacts performance.

    An example would be the average of long values where the intermediate sum overflows or the conversion to a double loses bits:

                long[] values = {Long.MAX_VALUE, Long.MAX_VALUE};
                System.out.println(Arrays.stream(values).average().getAsDouble()); System.out.println(Arrays.stream(values).mapToObj(BigDecimal::valueOf)
                .reduce(BigDecimal.ZERO, BigDecimal::add)
                .divide(BigDecimal.valueOf(values.length)).doubleValue());
                long[] values2 = {Long.MAX_VALUE, Long.MIN_VALUE};
                System.out.println(Arrays.stream(values2).asDoubleStream().average().getAsDouble()); System.out.println(Arrays.stream(values2).mapToObj(BigDecimal::valueOf)
                   .reduce(BigDecimal.ZERO, BigDecimal::add)
                .divide(BigDecimal.valueOf(values2.length)).doubleValue());
                

    Outputs:

    -1.0
                9.223372036854776E18
                0.0
                -0.5
    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Alex Herbert, mail: aherbert (at) apache.org
    Project Devs, mail: dev (at) commons.apache.org

    Commons Math

    [GSoC] Update components including machine learning; linear algebra; special functions

    Placeholder for tasks that could be undertaken in this year's GSoC.

    Ideas (extracted from the "dev" ML):

    1. Redesign and modularize the "ml" package
      -> main goal: enable multi-thread usage.
    2. Abstract the linear algebra utilities
      -> main goal: allow switching to alternative implementations.
    3. Redesign and modularize the "random" package
      -> main goal: general support of low-discrepancy sequences.
    4. Refactor and modularize the "special" package
      -> main goals: ensure accuracy and performance and better API,
      add other functions.
    5. Upgrade the test suite to Junit 5
      -> additional goal: collect a list of "odd" expectations.
    6. Review and finalize pending issues about the refactoring of the "genetic algorithm" functionality (cf. dedicated branch)

    Other suggestions welcome, as well as

    • delineating additional and/or intermediate goals,
    • signalling potential pitfalls and/or alternative approaches to the intended goal(s).
    Difficulty: Minor
    Project size: ~350 hour (large)
    Potential mentors:
    Gilles Sadowski, mail: erans

    CloudStack GSoC 2023 - Autodetect IPs used inside the VM

    Github issue: https://github.com/apache/cloudstack/issues/7142

    Description:

    With regards to IP info reporting, Cloudstack relies entirely on it's DHCP data bases and so on. When this is not available (L2 networks etc) no IP information is shown for a given VM.

    I propose we introduce a mechanism for "IP autodetection" and try to discover the IPs used inside the machines by means of querying the hypervisors. For example with KVM/libvirt we can simply do something like this:

     
    {{root@fedora35 ~]# virsh domifaddr win2k22 --source agent
    Name MAC address Protocol Address
    -------------------------------------------------------------------------------
    Ethernet 52:54:00:7b:23:6a ipv4 192.168.0.68/24
    Loopback Pseudo-Interface 1 ipv6 ::1/128

    • - ipv4 127.0.0.1/8}}
      The above command queries the qemu-guest-agent inside the Windows VM. The VM needs to have the qemu-guest-agent installed and running as well as the virtio serial drivers (easily done in this case with virtio-win-guest-tools.exe ) as well as a guest-agent socket channel defined in libvirt.

    Once we have this information we could display it in the UI/API as "Autodetected VM IPs" or something like that.

    I imagine it's very similar for VMWare and XCP-ng.

    Thank you

    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Nicolás Vázquez, mail: nvazquez (at) apache.org
    Project Devs, mail: dev (at) cloudstackcommons.apache.org

    CloudStack GSoC 2023 - Extend Import-Export Instances to the KVM Hypervisor

    Github issue: https://github.com/apache/cloudstack/issues/7127

    Description:

    The Import-Export functionality is only allowed for the Vmware hypervisor. The functionality is developed within a VM ingestion framework that allows the extension to other hypervisors. The Import-Export functionality consists on few APIs and the UI to interact with them:

    • listUnmanagedInstances: Lists unmanaged virtual machines (not existing in CloudStack but existing on the hypervisor side)
    • importUnmanagedInstance: Import an unmanaged VM into CloudStack (this implies populating the database with the corresponding data)
    • unmanageVirtualMachine: Make CloudStack forget a VM but do not remove it on the hypervisor side

    The complexity on KVM should be parsing the existing XML domains into different resources and map them in CloudStack to populate the database correctly.

    Refactoring of GA functionality

    As discussed extensively on the "dev" ML[1][2], there are two competing designs (please review them on the dedicated git branch) for the refactoring of the basic functionality currently implemented in the org.apache.commons.math4.legacy.genetics "legacy" package.

    TL;DR;

    • The discussion has pointed to major (from a maintenance POV) issues of the design proposed by the OP.
    • The alternative (much simpler) design has been implemented as proof-of-concept (indicating that some corner might have been cut).
    • The OP mentioned correctness issues in the "simple" design but did neither fix them nor provided answers on the LM to that effect.
    • Questions concerning other possible "bloat" (e.g. on using a custom representation of the "binary chromosome" concept instead of the BitSet available from the JDK) were also left dangling.
    • Refactoring of the "basic" GA functionality (the purpose of the "proof-of-concept") must be decoupled from the new feature which the OP wanted to implement ("adaptive probability generation").
    • Unit tests (a.o. all those from the "legacy" code) must demonstrate that the refactored code does (or does not) behave correctly, and bugs should be fixed on the "simple" implementation, before implementing the new feature on top of it.

    [1] https://markmail.org/message/qn7gq2y7xjoxukzp
    [2] https://markmail.org/message/f66iii3a4kmjaprr

    Difficulty: Minor
    Project size: ~350 hour (large)
    Potential mentors:
    Gilles Sadowski, mail: erans
    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Nicolás Vázquez, mail: nvazquez (at) apache.org
    Project Devs, mail: dev (at) cloudstackcommons.apache.org

    Commons Imaging

    Placeholder for 1.0 release

    A placeholder ticket, to link other issues and organize tasks related to the 1.0 release of Commons Imaging.

    The 1.0 release of Commons Imaging has been postponed several times. Now we have a more clear idea of what's necessary for the 1.0 (see issues with fixVersion 1.0 and 1.0-alpha3, and other open issues), and the tasks are interesting as it involves both basic and advanced programming for tasks such as organize how test images are loaded, or work on performance improvements at byte level and following image format specifications.

    The tasks are not too hard to follow, as normally there are example images that need to work with Imaging, as well as other libraries in C, C++, Rust, PHP, etc., that process these images correctly. Our goal with this issue is to a) improve our docs, b) improve our tests, c) fix possible security issues, d) get the parsers in Commons Imaging ready for the 1.0 release.

    Assigning the label for GSoC 2023, and full time. Although it would be possible to work on a smaller set of tasks for 1.0 as a part time too.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Bruno P. Kinoshita, mail: kinow (at) apache.org
    Project Devs, mail:

    CloudStack

    CloudStack GSoC 2023 - Autodetect IPs used inside the VM

    Github issue: https://github.com/apache/cloudstack/issues/7142


    Description:

    With regards to IP info reporting, Cloudstack relies entirely on it's DHCP data bases and so on. When this is not available (L2 networks etc) no IP information is shown for a given VM.

    I propose we introduce a mechanism for "IP autodetection" and try to discover the IPs used inside the machines by means of querying the hypervisors. For example with KVM/libvirt we can simply do something like this:

     
    {{root@fedora35 ~]# virsh domifaddr win2k22 --source agent
    Name MAC address Protocol Address
    -------------------------------------------------------------------------------
    Ethernet 52:54:00:7b:23:6a ipv4 192.168.0.68/24
    Loopback Pseudo-Interface 1 ipv6 ::1/128

    • - ipv4 127.0.0.1/8}}
      The above command queries the qemu-guest-agent inside the Windows VM. The VM needs to have the qemu-guest-agent installed and running as well as the virtio serial drivers (easily done in this case with virtio-win-guest-tools.exe ) as well as a guest-agent socket channel defined in libvirt.

    Once we have this information we could display it in the UI/API as "Autodetected VM IPs" or something like that.

    I imagine it's very similar for VMWare and XCP-ng.

    Thank you

    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Nicolás Vázquez, mail: nvazquez (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    CloudStack GSoC 2023 - Extend Import-Export Instances to the KVM Hypervisor

    Github issue: https://github.com/apache/cloudstack/issues/7127


    Description:

    The Import-Export functionality is only allowed for the Vmware hypervisor. The functionality is developed within a VM ingestion framework that allows the extension to other hypervisors. The Import-Export functionality consists on few APIs and the UI to interact with them:

    • listUnmanagedInstances: Lists unmanaged virtual machines (not existing in CloudStack but existing on the hypervisor side)
    • importUnmanagedInstance: Import an unmanaged VM into CloudStack (this implies populating the database with the corresponding data)
    • unmanageVirtualMachine: Make CloudStack forget a VM but do not remove it on the hypervisor side

    The complexity on KVM should be parsing the existing XML domains into different resources and map them in CloudStack to populate the database correctly.

    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Nicolás Vázquez, mail: nvazquez (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    CloudStack GSoC 2023 - Improve ConfigDrive to store network information

    Github issue: https://github.com/apache/cloudstack/issues/2872


    ConfigDrive / cloud-init supports a network_data.json file which can contain network information for a VM.

    By providing the network information using ConfigDrive to a VM we can eliminate the need for DHCP and thus the Virtual Router in some use-cases.

    An example JSON file:

                {
                "links": [
                {
                "ethernet_mac_address": "52:54:00:0d:bf:93",
                "id": "eth0",
                "mtu": 1500,
                "type": "phy"
                }
                ],
                "networks": [
                {
                "id": "eth0",
                "ip_address": "192.168.200.200",
                "link": "eth0",
                "netmask": "255.255.255.0",
                "network_id": "dacd568d-5be6-4786-91fe-750c374b78b4",
                "routes": [
                {
                "gateway": "192.168.200.1",
                "netmask": "0.0.0.0",
                "network": "0.0.0.0"
                }
                ],
                "type": "ipv4"
                },
                {
                "id": "eth0",
                "ip_address": "2001:db8:100::1337",
                "link": "eth0",
                "netmask": "64",
                "network_id": "dacd568d-5be6-4786-91fe-750c374b78b4",
                "routes": [
                {
                "gateway": "2001:db8:100::1",
                "netmask": "0",
                "network": "::"
                }
                ],
                "type": "ipv6"
                }
                ],
                "services": [
                {
                "address": "8.8.8.8",
                "type": "dns"
                }
                ]
                }

    In Basic Networking and Advanced Networking zones which are using a shared network you wouldn't require a VR anymore.

    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Nicolás Vázquez, mail: nvazquez (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    CloudStack GSoC 2023 - User friendly name of Downloaded Templates Volumes and ISOs

    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Nicolás Vázquez, mail: nvazquez (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    CloudStack GSoC 2023 - Test button addition in Domains LDAP config

    Github issue: https://github.com/apache/cloudstack/issues/6934


    Please add a button to test the ldaps connection or a list button to list some user button.

    Image Added

    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Nicolás Vázquez, mail: nvazquez (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    CloudStack GSoC 2023 - Configure NFS version for Primary Storage

    Github issue: https://github.com/apache/cloudstack/issues/4482


    NFS Primary Storage mounts are handled by libvirt.

    Currently libvirt defaults to NFS version 3 when mounting while it does support NFS version 4 if provided in the XML definition: https://libvirt.org/formatstorage.html#StoragePoolSource

    CloudStack GSoC 2023 - Improve ConfigDrive to store network information

    Github issue: https://github.com/apache/cloudstack/issues/2872

    ConfigDrive / cloud-init supports a network_data.json file which can contain network information for a VM.

    By providing the network information using ConfigDrive to a VM we can eliminate the need for DHCP and thus the Virtual Router in some use-cases.

    An example JSON file:

    { "links": [ { "ethernet_mac_address": "52:54:00:0d:bf:93", "id": "eth0", "mtu": 1500, "type": "phy" } ], "networks": [ { "id": "eth0", "ip_address": "192.168.200.200", "link": "eth0", "netmask": "255.255.255.0", "network_id": "dacd568d-5be6-4786-91fe-750c374b78b4", "routes": [ { "gateway": "192.168.200.1", "netmask": "0.0.0.0", "network": "0.0.0.0" } ], "type": "ipv4"
                
    },
    <source>
                
    { "id": "eth0",
    <host name='localhost'/>
                
    "ip_address": "2001:db8:100::1337", "link": "eth0",
    <dir path='/var/lib/libvirt/images'/>
                
    "netmask": "64",
    <format type='nfs'/>
                
    "network_id": "dacd568d-5be6-4786-91fe-750c374b78b4",
    <protocol ver='4'/>
                
    "routes": [
    </source>
                
    { "gateway": "2001:db8:100::1", "netmask": "0", "network": "::" } ], "type": "ipv6" } ], "services": [ { "address": "8.8.8.8", "type": "dns" } ] }

     
    Maybe pass the argument 'nfsvers' to the URL provided to the Management Server and then pass this down to the Hypervisors which generate the XML for libvirt.

    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Nicolás Vázquez, mail: nvazquez (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    CloudStack GSoC 2023 - Use Calico or Cilium in CKS

    Github issue: https://github.com/apache/cloudstack/issues/6637


    Weave project are looking for maintainers, it may be worth exploring what CNI is widely used and standard/stable for CKS use-case.

    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Nicolás Vázquez, mail: nvazquez (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    CloudStack GSoC 2023 - SSL LetsEncrypt the Console Proxy

    Github issue: https://github.com/apache/cloudstack/issues/3141


    New Global Option For Letsencrypt enable on console proxy. Letsencrypt domain name option for letsencrypt ssl auto renew

    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Nicolás Vázquez, mail: nvazquez (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    CloudStack GSoC 2023 - Direct Download extension to Ceph storage

    Github issue: https://github.com/apache/cloudstack/issues/3065


    Extend the Direct Download functionality to work with Ceph storageIn Basic Networking and Advanced Networking zones which are using a shared network you wouldn't require a VR anymore.

    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Nicolás Vázquez, mail: nvazquez (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    CloudStack GSoC 2023 - User friendly name of Downloaded Templates Volumes and ISOs

    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Nicolás Vázquez, mail: nvazquez (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    Apache Nemo

    Enhance Nemo to support autoscaling for bursty loads

    The load of streaming jobs usually fluctuate according to the input rate or operations (e.g., window). Supporting the automatic scaling could reduce the operational cost of running streaming applications, while minimizing the performance degradation that can be caused by the bursty loads. 


    We can harness the cloud resources such as VMs and serverless frameworks to acquire computing resources on demand. To realize the automatic scaling, the following features should be implemented.


    1) state migration: scaling jobs require moving tasks (or partitioning a task to multiple ones). In this situation, the internal state of the task should be serialized/deserialized. 

    2) input/output rerouting: if a task is moved to a new worker, the input and output of the task should be redirected. 

    3) dynamic Executor or Task creation/deletion: Executor}}s or {{Task can be dynamically created or deleted. 

    4) scaling policy: a scaling policy that decides when and how to scale out/in should be implemented. 

    CloudStack GSoC 2023 - Test button addition in Domains LDAP config

    Github issue: https://github.com/apache/cloudstack/issues/6934

    Please add a button to test the ldaps connection or a list button to list some user button.

    Image Removed

    Difficulty: Major
    Project size: ~175 ~350 hour (mediumlarge)
    Potential mentors:
    Nicolás VázquezTae-Geon Um, mail: nvazquez taegeonum (at) apache.org
    Project Devs, mail: dev (at) cloudstacknemo.apache.org

    Collect task statistics necessary for estimating duration


    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Hwarim Hyun, mail: hwarim (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org

    Detect skewed task periodically


    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Hwarim Hyun, mail: hwarim (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org

    Dynamic Task Sizing on Nemo

    This is an umbrella issue to keep track of the issues related to the dynamic task sizing feature on Nemo.

    Dynamic task sizing needs to consider a workload and try to decide on the optimal task size based on the runtime metrics and characteristics. It should have an effect on the parallelism and the partitions, on how many partitions an intermediate data should be divided/shuffled into, and to effectively handle skews in the meanwhile

    CloudStack GSoC 2023 - Configure NFS version for Primary Storage

    Github issue: https://github.com/apache/cloudstack/issues/4482

    NFS Primary Storage mounts are handled by libvirt.

    Currently libvirt defaults to NFS version 3 when mounting while it does support NFS version 4 if provided in the XML definition: https://libvirt.org/formatstorage.html#StoragePoolSource

                <source>
                <host name='localhost'/>
                <dir path='/var/lib/libvirt/images'/>
                <format type='nfs'/>
                <protocol ver='4'/>
                </source>
                

     
    Maybe pass the argument 'nfsvers' to the URL provided to the Management Server and then pass this down to the Hypervisors which generate the XML for libvirt.

    Difficulty: Major
    Project size: ~175 ~350 hour (mediumlarge)
    Potential mentors:
    Nicolás VázquezWonook, mail: nvazquez wonook (at) apache.org
    Project Devs, mail: dev (at) cloudstacknemo.apache.org

    Dynamic Work Stealing on Nemo for handling skews

    We aim to handle the problem on throttled resources (heterogeneous resources) and skewed input data. In order to solve this problem, we suggest dynamic work stealing that can dynamically track task statuses and steal workloads among each other. To do this, we have the following action items:

    • Dynamically collecting task statistics during execution
    • Detecting skewed tasks periodically
    • Splitting the data allocated in skewed tasks and reallocating them into new tasks
    • Synchronizing the optimization procedure
    • Evaluation of the resulting implementations

    CloudStack GSoC 2023 - Use Calico or Cilium in CKS

    Github issue: https://github.com/apache/cloudstack/issues/6637

    Weave project are looking for maintainers, it may be worth exploring what CNI is widely used and standard/stable for CKS use-case.
    Difficulty: Major
    Project size: ~175 ~350 hour (mediumlarge)
    Potential mentors:
    Nicolás VázquezWonook, mail: nvazquez wonook (at) apache.org
    Project Devs, mail: dev (at) cloudstacknemo.apache.org

    Implement an Accurate Simulator based on Functional model

    Missing a deadline often has significant consequences for the business. And simulator can contributes to other approach for optimization 

    So Implement a Simulator for Stream Processing Based on Functional models. 

    There are some requirements:

    • Simulation should be able to execute before or during job execution
    • When a simulation is executed during job is running, It must be fast enough not to affect the job. 
    • Information about running environment is received through argument.
    • At least network topology should be considered for the WAN environment. 

    CloudStack GSoC 2023 - SSL LetsEncrypt the Console Proxy

    Github issue: https://github.com/apache/cloudstack/issues/3141

    New Global Option For Letsencrypt enable on console proxy. Letsencrypt domain name option for letsencrypt ssl auto renew



    Difficulty: Major
    Project size: ~175 ~350 hour (mediumlarge)
    Potential mentors:
    Nicolás VázquezLee Hae Dong, mail: nvazquez Lemarais (at) apache.org
    Project Devs, mail: dev (at) cloudstacknemo.apache.org

    Implement a model that represent a task level exeuction time with statistical analysis

    The current SimulatedTaskExecutor is hardly available. because it needs actual metric to predict execution time. To increase utilization, we need new model that predicts a task level execution time with statistical analysis. 

    Some of the related TODOs are as follows:

    • Find factors that affect a task level execution time. with loose grid search.
    • Infer the most suitable model with tight grid search. 

    CloudStack GSoC 2023 - Direct Download extension to Ceph storage

    Github issue: https://github.com/apache/cloudstack/issues/3065

    Extend the Direct Download functionality to work with Ceph storage
    Difficulty: Major
    Project size: ~175 ~350 hour (mediumlarge)
    Potential mentors:
    Nicolás VázquezLee Hae Dong, mail: nvazquez Lemarais (at) apache.org
    Project Devs, mail: dev (at) cloudstacknemo.apache.org

    Apache Nemo

    Implement spill mechanism on Nemo

    Currently, Nemo doesn't have a spill mechanism. This makes executors prone to memory problems such as OOM(Out Of Memory) or GC when task data is large. For example, handling skewed shuffle data in Nemo results in OOM and executor failure, as all data has to be handled in-memory.

    We need to spill in-memory data to secondary storage when there are not enough memory in executor.

    Enhance Nemo to support autoscaling for bursty loads

    The load of streaming jobs usually fluctuate according to the input rate or operations (e.g., window). Supporting the automatic scaling could reduce the operational cost of running streaming applications, while minimizing the performance degradation that can be caused by the bursty loads. 

    We can harness the cloud resources such as VMs and serverless frameworks to acquire computing resources on demand. To realize the automatic scaling, the following features should be implemented.

    1) state migration: scaling jobs require moving tasks (or partitioning a task to multiple ones). In this situation, the internal state of the task should be serialized/deserialized. 

    2) input/output rerouting: if a task is moved to a new worker, the input and output of the task should be redirected. 

    3) dynamic Executor or Task creation/deletion: Executor}}s or {{Task can be dynamically created or deleted. 

    4) scaling policy: a scaling policy that decides when and how to scale out/in should be implemented. 

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Tae-Geon UmJeongyoon Eo, mail: taegeonum jeongyoon (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org
    Collect task statistics necessary for estimating duration

    Approximate the factors that affect the stage group level execution time

    There are some factors that can affect the stage group level simulation, such as a latency, the rate of skewed data and the error rate of the executor etc. It is required to find a reasonable distribution form for these factors. Such as the normal distribution or the landau distribution. In actual running, It makes it possible to approximate the model with a small amount of data.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Lee Hae Dong, mail: Lemarais (at) apache.org
    Project DevsHwarim Hyun, mail: hwarim dev (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org
    Detect skewed task periodically
    nemo.apache.org

    Efficient Caching and Spilling on Nemo

    In-memory caching and spilling are essential features in in-memory big data processing frameworks, and Nemo needs one.

    • Identify and persist frequently used data and unpersist it when its usage ended
    • Spill in-memory data to disk upon memory pressure
    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Hwarim HyunJeongyoon Eo, mail: hwarim jeongyoon (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org

    Runtime Level Caching Mechanism

    If the the compile time identifies what data can be cached, the runtime requires logic to make this happen.

    Implementation needs:

    • (Driver) receive and update the status of blocks from various Executors, right now this seems to be best implemented as part of BlockManagerMaster
    • (Driver) communicate to the  Executors the availability, location and status of blocks
    • Possible concurrency issues:
    1. Concurrency in Driver when multiple Executors update/inquire the same block information
    2. Concurrency in Executor when a single cached block is accessed simultaneously

    Dynamic Task Sizing on Nemo

    This is an umbrella issue to keep track of the issues related to the dynamic task sizing feature on Nemo.

    Dynamic task sizing needs to consider a workload and try to decide on the optimal task size based on the runtime metrics and characteristics. It should have an effect on the parallelism and the partitions, on how many partitions an intermediate data should be divided/shuffled into, and to effectively handle skews in the meanwhile
    1. .


    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    WonookDongjoo Lee, mail: wonook codinggosu (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org

    Efficient Dynamic Reconfiguration in Stream Processing

    In Stream processing, we have many methods, starting from the primitive checkpoint-and-replay to a more fancy version of reconfiguration and reinitiation of stream workloads. We aim to find a way to find the most effective and efficient way of reconfiguring stream workloads. Sub-issues are to be created later on.

    Dynamic Work Stealing on Nemo for handling skews

    We aim to handle the problem on throttled resources (heterogeneous resources) and skewed input data. In order to solve this problem, we suggest dynamic work stealing that can dynamically track task statuses and steal workloads among each other. To do this, we have the following action items:

  • Dynamically collecting task statistics during execution
  • Detecting skewed tasks periodically
  • Splitting the data allocated in skewed tasks and reallocating them into new tasks
  • Synchronizing the optimization procedure
  • Evaluation of the resulting implementations

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Wonook, mail: wonook (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org

    Evaluate the performance of Work Stealing implementation


    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Hwarim Hyun, mail: hwarim (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org

    Nemo on Google Dataproc

    Issues for making it easy to install and use Nemo on Google Dataproc.

    Implement an Accurate Simulator based on Functional model

    Missing a deadline often has significant consequences for the business. And simulator can contributes to other approach for optimization 

    So Implement a Simulator for Stream Processing Based on Functional models. 

    There are some requirements:

  • Simulation should be able to execute before or during job execution
  • When a simulation is executed during job is running, It must be fast enough not to affect the job. 
  • Information about running environment is received through argument.
  • At least network topology should be considered for the WAN environment. 

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Lee Hae DongJohn Yang, mail: Lemarais johnyangk (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org.apache.org

    Apache Dubbo

    Dubbo GSoC 2023 - Refactor the http layer

    Background

    Dubbo currently supports the rest protocol based on http1, and the triple protocol based on http2, but currently the two protocols based on the http protocol are implemented independently, and at the same time, they cannot replace the underlying implementation, and their respective implementation costs are relatively high.

    Target

    In order to reduce maintenance costs, we hope to be able to abstract http. The underlying implementation of the target implementation of http has nothing to do with the protocol, and we hope that different protocols can reuse related implementations.

    Implement a model that represent a task level exeuction time with statistical analysis

    The current SimulatedTaskExecutor is hardly available. because it needs actual metric to predict execution time. To increase utilization, we need new model that predicts a task level execution time with statistical analysis. 

    Some of the related TODOs are as follows:

  • Find factors that affect a task level execution time. with loose grid search.
  • Infer the most suitable model with tight grid search. 

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Lee Hae DongAlbumen Kevin, mail: Lemarais albumenj (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org

    Dubbo GSoC 2023 - Integration suite on Kubernetes

    As a development framework that is closely related to users, Dubbo may have a huge impact on users if any problems occur during the iteration process. Therefore, Dubbo needs a complete set of automated regression testing tools.
    At present, Dubbo already has a set of testing tools based on docker-compose, but this set of tools cannot test the compatibility in the kubernetes environment. At the same time, we also need a more reliable test case construction system to ensure that the test cases are sufficiently complete

    Implement spill mechanism on Nemo

    Currently, Nemo doesn't have a spill mechanism. This makes executors prone to memory problems such as OOM(Out Of Memory) or GC when task data is large. For example, handling skewed shuffle data in Nemo results in OOM and executor failure, as all data has to be handled in-memory.

    We need to spill in-memory data to secondary storage when there are not enough memory in executor.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Jeongyoon EoAlbumen Kevin, mail: jeongyoon albumenj (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org

    Dubbo GSoC 2023 - Dubbo usage scanner

    As a development framework closely related to users, Dubbo provides many functional features (such as configuring timeouts, retries, etc.). We hope that a tool can be given to users to scan which features are used, which features are deprecated, which ones will be deprecated in the future, and so on. Based on this tool, we can provide users with a better migration solution.
    Suggestion: You can consider based on static code scanning or javaagent implementation

    Approximate the factors that affect the stage group level execution time

    There are some factors that can affect the stage group level simulation, such as a latency, the rate of skewed data and the error rate of the executor etc. It is required to find a reasonable distribution form for these factors. Such as the normal distribution or the landau distribution. In actual running, It makes it possible to approximate the model with a small amount of data.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Lee Hae DongAlbumen Kevin, mail: Lemarais albumenj (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org

    Dubbo GSoC 2023 - Remove jprotoc in compiler

    Dubbo supports the communication mode based on the gRPC protocol through Triple. For this reason, Dubbo has developed a compiling plug-in for proto files based on jprotoc. Due to the activeness of jprotoc, currently Dubbo compiler cannot run well on the latest protobuf version. Therefore, we need to consider implementing a new compiler with reference to gRPC.

    Efficient Caching and Spilling on Nemo

    In-memory caching and spilling are essential features in in-memory big data processing frameworks, and Nemo needs one.

  • Identify and persist frequently used data and unpersist it when its usage ended
  • Spill in-memory data to disk upon memory pressure

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Jeongyoon EoAlbumen Kevin, mail: jeongyoon albumenj (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org

    Dubbo GSoC 2023 - Dubbo i18n log

    Dubbo is a development framework that is closely related to users, and many usages by users may cause exceptions handled by Dubbo. Usually, in this case, users can only judge through logs. We hope to provide an i18n localized log output tool to provide users with a more friendly log troubleshooting experience

    Runtime Level Caching Mechanism

    If the the compile time identifies what data can be cached, the runtime requires logic to make this happen.

    Implementation needs:

    • (Driver) receive and update the status of blocks from various Executors, right now this seems to be best implemented as part of BlockManagerMaster
    • (Driver) communicate to the  Executors the availability, location and status of blocks
    • Possible concurrency issues:
  • Concurrency in Driver when multiple Executors update/inquire the same block information
  • Concurrency in Executor when a single cached block is accessed simultaneously

    .

    Difficulty: Major
    Project size: ~350 ~175 hour (largemedium)
    Potential mentors:
    Dongjoo LeeAlbumen Kevin, mail: codinggosu albumenj (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org

    Dubbo GSoC 2023 - Refactor dubbo project to gradle

    As more and more projects start to develop based on Gradle and profit from Gradle, Dubbo also hopes to migrate to the Gradle project. This task requires you to transform the dubbo project[1] into a gradle project.


     [1] https://github.com/apache/dubbo

    Efficient Dynamic Reconfiguration in Stream Processing

    In Stream processing, we have many methods, starting from the primitive checkpoint-and-replay to a more fancy version of reconfiguration and reinitiation of stream workloads. We aim to find a way to find the most effective and efficient way of reconfiguring stream workloads. Sub-issues are to be created later on.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    WonookAlbumen Kevin, mail: wonook albumenj (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org
    Evaluate the performance of Work Stealing implementation

    Dubbo GSoC 2023 - Metrics on Dubbo Admin

    Dubbo Admin is a console of Dubbo. Today, Dubbo's observability is becoming more and more powerful. We need to directly observe some indicators of Dubbo on Dubbo Admin, and even put forward suggestions for users to improve problems.

    Difficulty: Major
    Project size: ~350 ~175 hour (largemedium)
    Potential mentors:
    Hwarim HyunAlbumen Kevin, mail: hwarim albumenj (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org

    Nemo on Google Dataproc

    , mail:

    Dubbo GSoC 2023 - Refactor Connection

    Background

    At present, the abstraction of connection by client in different protocols in Dubbo is not perfect. For example, there is a big discrepancy between the client abstraction of connection in dubbo and triple protocols. As a result, the enhancement of connection-related functions in the client is more complicated, and the implementation cannot be reused. At the same time, the client also needs to implement a lot of repetitive code when extending the protocol.

    Target

    Reduce the complexity of the client part when extending the protocol, and increase the reuse of connection-related modulesIssues for making it easy to install and use Nemo on Google Dataproc.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    John YangAlbumen Kevin, mail: johnyangk albumenj (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org

    ...

    ...

    Dubbo

    GSoC 2023 -

    Integration suite on Kubernetes

    IDL management

    Background

    Dubbo currently supports protobuf as a serialization method. Protobuf relies on proto (Idl) for code generation, but currently lacks tools for managing Idl files. For example, for java users, proto files are used for each compilation. It is more troublesome, and everyone is used to using jar packages for dependencies.

    Target

    Implement an Idl management and control platform, support idl files to automatically generate dependency packages in various languages, and push them to relevant dependency warehouses

    As a development framework that is closely related to users, Dubbo may have a huge impact on users if any problems occur during the iteration process. Therefore, Dubbo needs a complete set of automated regression testing tools.
    At present, Dubbo already has a set of testing tools based on docker-compose, but this set of tools cannot test the compatibility in the kubernetes environment. At the same time, we also need a more reliable test case construction system to ensure that the test cases are sufficiently complete.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Albumen Kevin, mail: albumenj (at) apache.org
    Project Devs, mail:

    Dubbo GSoC 2023 -

    Dubbo usage scanner

    Service Deployer

    For a large number of monolithic applications, problems such as performance will be encountered during large-scale deployment. For interface-oriented programming languages, Dubbo provides the capability of RPC remote calls, and we can help applications decouple through interfaces. Therefore, we can provide a deployer to help users realize the decoupling and splitting of microservices during deployment, and quickly provide performance optimization capabilities

    As a development framework closely related to users, Dubbo provides many functional features (such as configuring timeouts, retries, etc.). We hope that a tool can be given to users to scan which features are used, which features are deprecated, which ones will be deprecated in the future, and so on. Based on this tool, we can provide users with a better migration solution.
    Suggestion: You can consider based on static code scanning or javaagent implementation.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Albumen Kevin, mail: albumenj (at) apache.org
    Project Devs, mail:

    Dubbo GSoC 2023 -

    Remove jprotoc in compiler

    API manager

    Since Dubbo runs on a distributed architecture, it naturally has the problem of difficult API interface definition management. It is often difficult for us to know which interface is running in the production environment. So we can provide an API-defined reporting platform, and even a management platform. This platform can automatically collect all APIs of the cluster, or can be directly defined by the user, and then unified distribution management is carried out through a mechanism similar to git and maven package management

    Dubbo supports the communication mode based on the gRPC protocol through Triple. For this reason, Dubbo has developed a compiling plug-in for proto files based on jprotoc. Due to the activeness of jprotoc, currently Dubbo compiler cannot run well on the latest protobuf version. Therefore, we need to consider implementing a new compiler with reference to gRPC.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Albumen Kevin, mail: albumenj (at) apache.org
    Project Devs, mail:

    Dubbo GSoC 2023 -

    Dubbo i18n log

    JSON compatibility check

    Dubbo is a development framework that is closely related to users, and many usages by users may cause exceptions handled by Dubbo. Usually, in this case, users can only judge through logs. We hope to provide an i18n localized log output tool to provide users with a more friendly log troubleshooting experiencecurrently supports a large number of Java language features through hessian under the Java SDK, such as generics, interfaces, etc. These capabilities will not be compatible when calling across systems. Therefore, Dubbo needs to provide the ability to detect the interface definition and determine whether the interface published by the user can be described by native json.

    Difficulty: Major
    Project size: ~175 ~350 hour (mediumlarge)
    Potential mentors:
    Albumen Kevin, mail: albumenj (at) apache.org
    Project Devs, mail:

    Dubbo GSoC 2023 -

    Refactor dubbo project to gradle

    Automated Performance Testing Mechanism

    Dubbo currently provides a very simple performance testing tool. But for such a complex framework as Dubbo, the functional coverage is very low. We urgently need a testing tool that can test multiple complex scenarios. In addition, we also hope that this set of testing tools can be run automatically, so that we can track the current performance of Dubbo in time.

    As more and more projects start to develop based on Gradle and profit from Gradle, Dubbo also hopes to migrate to the Gradle project. This task requires you to transform the dubbo project[1] into a gradle project.

     [1] https://github.com/apache/dubbo

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Albumen Kevin, mail: albumenj (at) apache.org
    Project Devs, mail:

    Dubbo GSoC 2023 -

    Metrics on Dubbo Admin

    Dubbo SPI Extensions on WASM

    WebAssembly (abbreviated Wasm) is a binary instruction format for a stack-based virtual machine. Many capabilities of Dubbo support extensions, such as custom interceptors, routing, load balancing, etc. In order to allow the user's implementation to be used on Dubbo's multiple language SDKs, we can implement cross-platform operation based on wasm capabilities.


    The implementation of this topic needs to provide a set of mechanisms for Wasm on Dubbo, covering the implementation of Java and Go. Also supports at least Filter, Router and Loadbalance

    Dubbo Admin is a console of Dubbo. Today, Dubbo's observability is becoming more and more powerful. We need to directly observe some indicators of Dubbo on Dubbo Admin, and even put forward suggestions for users to improve problems.

    Difficulty: Major
    Project size: ~175 ~350 hour (mediumlarge)
    Potential mentors:
    Albumen Kevin, mail: albumenj (at) apache.org
    Project Devs, mail:

    Dubbo GSoC 2023 - Dubbo Client on WASM

    WebAssembly (abbreviated Wasm) is a binary instruction format for a stack-based virtual machine. For web client users, we can provide Dubbo's wasm client, so that front-end developers can simply initiate Dubbo requests in the browser, and realize Dubbo's full-link unification.

    This task needs to be implemented on a browser such as Chrome to initiate a request to the Dubbo backend

    Dubbo GSoC 2023 - Refactor Connection

    Background

    At present, the abstraction of connection by client in different protocols in Dubbo is not perfect. For example, there is a big discrepancy between the client abstraction of connection in dubbo and triple protocols. As a result, the enhancement of connection-related functions in the client is more complicated, and the implementation cannot be reused. At the same time, the client also needs to implement a lot of repetitive code when extending the protocol.

    Target

    Reduce the complexity of the client part when extending the protocol, and increase the reuse of connection-related modules.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Albumen Kevin, mail: albumenj (at) apache.org
    Project Devs, mail:

    Dubbo GSoC 2023 -

    IDL management

    Pure Dubbo RPC API

    At present, Dubbo provides RPC capabilities and a large number of service governance capabilities. This has led to the fact that Dubbo cannot be used well if some of Dubbo's own components only need to use RPC capabilities or some users who need extreme lightweight.
    Goal: To provide a Dubbo RPC kernel, users can directly program for service calls and focus on RPC.

    Background

    Dubbo currently supports protobuf as a serialization method. Protobuf relies on proto (Idl) for code generation, but currently lacks tools for managing Idl files. For example, for java users, proto files are used for each compilation. It is more troublesome, and everyone is used to using jar packages for dependencies.

    Target

    Implement an Idl management and control platform, support idl files to automatically generate dependency packages in various languages, and push them to relevant dependency warehouses

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Albumen Kevin, mail: albumenj (at) apache.org
    Project Devs, mail:

    Dubbo GSoC 2023 -

    Refactor the http layer

    HTTP/3 Rest Support

    HTTP/3 has been formalized as a standard in the last year. Dubbo, as a framework that supports publishing and invoking Web services, needs to support the HTTP/3 protocol.

    This task needs to expand the implementation of the current rest protocol to support publishing HTTP/3 services and calling HTTP/3 services

    Background

    Dubbo currently supports the rest protocol based on http1, and the triple protocol based on http2, but currently the two protocols based on the http protocol are implemented independently, and at the same time, they cannot replace the underlying implementation, and their respective implementation costs are relatively high.

    Target

    In order to reduce maintenance costs, we hope to be able to abstract http. The underlying implementation of the target implementation of http has nothing to do with the protocol, and we hope that different protocols can reuse related implementations.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Albumen Kevin, mail: albumenj (at) apache.org
    Project Devs, mail:

    ...