...
Contents
- James Server
- Beam
- TrafficControl
- ShardingSphere
- StreamPipes
- RocketMQ
- SkyWalking
- ShenYu
- EventMesh
- Commons Statistics
- Commons Numbers
- Commons Math
- Commons Imaging
- CloudStack
- Apache Dubbo
- Nemo
- Apache Dubbo
- Dubbo GSoC 2023 - Refactor the http layerConnection
- Dubbo GSoC 2023 - Refactor ConnectionIDL management
- Dubbo GSoC 2023 - IDL managementRefactor the http layer
- Apache Commons All
- Airavata
...
Adopt Pulsar as the messaging technology backing the distributed James server
https://www.mail-archive.com/server-dev@james.apache.org/msg71462.html
A good long term objective for the PMC is to drop RabbitMQ in
favor of pulsar (third parties could package their own components using
RabbitMQ if they wishes...)
This means:
- Solve the bugs that were found during the Pulsar MailQueue review
- Pulsar MailQueue need to allow listing blobs in order to be
deduplication friendly. - Provide an event bus based on Pulsar
- Provide a task manager based on Pulsar
- Package a distributed server backed by pulsar, deprecate then replace
the current one. - (optionally) support mail queue priorities
While contributions would of course be welcomed on this topic, we could
offer it as part of GSOC 2022, and we could co-mentor it with mentors of
the Pulsar community (see [3])
[3] https://lists.apache.org/thread/y9s7f6hmh51ky30l20yx0dlz458gw259
Would such a plan gain traction around here ?
Implement a web ui for James administration
James today provides a command line tool to do administration tasks like creating a domain, listing users, setting quota, etc.
It requires access to JMX port and even if lot of admins are confortable with such tools, to make our user base broader, we probably should expose the same commands in Rest and provide a fancy default web ui.
The task would need some basic skills on frontend tools to design an administration board, knowledge on what REST mean and enough Java understanding to add commands to existing Rest backend.
In the team, we have a strong focus on test (who want a mail server that is not tested enough ?) so we will explain and/or teach the student how to have the right test coverage of the features using modern tools like Cucumber, Selenium, rest-assured, etc.
[GSOC] James as a (distributed) MX server
Why ?
Alternatives like Postfix...
- Do not offer a unified view of the mail queue across nodes
- Requires statefull persistent storage
Given Apache James recent push to adopt a distributed mail queue based on Pulsar supporting delays (JAMES-3687), it starts making sense developing tooling for MX related tooling.
I propose myself to mentor a Gsoc on this topic.
Benefits for the student
At the end of this GSOC you will...
- Have a solid understanding of email relaying and associated mechanics
- Understand James modular architecture (mailet/ matcher / routes)
- Have a hands-on expertise in SQL / NoSQL working with technologies like Cassandra, Redis, JPA...
- Identify fix and solve architecture problems.
- Conduct performance tests and develop an operational mindset
Inventory...
James ships a couple of MX related tools within smtp-hooks/mailets in default packages. It would make sense to me to move those as an extension.
James supports today...
checks agains DNS blacklists. `DNSRBLHandler` or `URIRBLHandler` smtp hook for instance. This can be moved as an extension IMO.
We would need a little performance benchmark to document performance implications of activating DNS-RBL.
Finally as quoted by a gitter guy: it would make more sens to have this done as a MailHook rather as a RcptHook as it would avoid doing the same job again and over again for each recipients. See JAMES-3820 .
Grey listing. There's an existing implementation using JDBC as an underlying storage.
Move it as an extension.
Remove JDBC storage, propose 2 storage possibilities: in-memory for single node, REDIS for a distributed topology.
Some work around whitelist mailets? Move it as an extension, propose JPA, Cassandra, and XML configured implementations ? With a route to manage entries in there for JPA + Cassandra ?
I would expect a student to do his own little audit and come up with extra suggestions!
[GSOC] James as a (distributed) MX server
Why ?
Alternatives like Postfix...
- Do not offer a unified view of the mail queue across nodes
- Requires statefull persistent storage
Given Apache James recent push to adopt a distributed mail queue based on Pulsar supporting delays (JAMES-3687), it starts making sense developing tooling for MX related tooling.
I propose myself to mentor a Gsoc on this topic.
Benefits for the student
At the end of this GSOC you will...
- Have a solid understanding of email relaying and associated mechanics
- Understand James modular architecture (mailet/ matcher / routes)
- Have a hands-on expertise in SQL / NoSQL working with technologies like Cassandra, Redis, JPA...
- Identify fix and solve architecture problems.
- Conduct performance tests and develop an operational mindset
Inventory...
James ships a couple of MX related tools within smtp-hooks/mailets in default packages. It would make sense to me to move those as an extension.
James supports today...
checks agains DNS blacklists. `DNSRBLHandler` or `URIRBLHandler` smtp hook for instance. This can be moved as an extension IMO.
We would need a little performance benchmark to document performance implications of activating DNS-RBL.
Finally as quoted by a gitter guy: it would make more sens to have this done as a MailHook rather as a RcptHook as it would avoid doing the same job again and over again for each recipients. See JAMES-3820 .
Grey listing. There's an existing implementation using JDBC as an underlying storage.
Move it as an extension.
Remove JDBC storage, propose 2 storage possibilities: in-memory for single node, REDIS for a distributed topology.
Some work around whitelist mailets? Move it as an extension, propose JPA, Cassandra, and XML configured implementations ? With a route to manage entries in there for JPA + Cassandra ?
I would expect a student to do his own little audit and come up with extra suggestions!
Implement a web ui for James administration
James today provides a command line tool to do administration tasks like creating a domain, listing users, setting quota, etc.
It requires access to JMX port and even if lot of admins are confortable with such tools, to make our user base broader, we probably should expose the same commands in Rest and provide a fancy default web ui.
The task would need some basic skills on frontend tools to design an administration board, knowledge on what REST mean and enough Java understanding to add commands to existing Rest backend.
In the team, we have a strong focus on test (who want a mail server that is not tested enough ?) so we will explain and/or teach the student how to have the right test coverage of the features using modern tools like Cucumber, Selenium, rest-assured, etc.
Beam
[GSoC][Beam] An IntelliJ plugin to develop Apache Beam pipelines and the Apache Beam SDKs
Beam library developers and Beam users would appreciate this : )
This project involves prototyping a few different solutions, so it will be large.
...
GSoC Integrate RocketMQ 5.0 client with Spring
Apache RocketMQ
Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.
Page: https://rocketmq.apache.org
Github: https://github.com/apache/rocketmq
Background
RocketMQ 5.0 client has been released recently, we need to integrate it with Spring.
Related issue: https://github.com/apache/rocketmq-clients/issues/275
Task
- Familiar with RocketMQ 5.0 java client usage, you could see more details from https://github.com/apache/rocketmq-clients/tree/master/java and https://rocketmq.apache.org/docs/quickStart/01quickstart
- Integrate with Spring.
Relevant Skills
- Java language
- Basic knowledge of RocketMQ 5.0
- Spring
Mentor
Rongtong Jin, PMC of Apache RocketMQ, jinrongtong@apache.org
Yangkun Ai, PMC of Apache RocketMQ, aaronai@apache.org
SkyWalking
[GSOC] [SkyWalking] AIOps Log clustering with Flink (Algorithm Optimization)
Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This year we will proceed on log clustering implementation with a revised architecture and this task will require student to focus on algorithm optimiztion for the clustering technique.
[GSOC] [SkyWalking] AIOps Log clustering with Flink (Flink Integration)
Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This year we will proceed on log clustering implementation with a revised architecture and this task will require student to focus on Flink and its integration with SkyWalking OAP.
GSoC Make RocketMQ support higher versions of Java
Apache RocketMQ
Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.
Page: https://rocketmq.apache.org
Github: https://github.com/apache/rocketmq
Background
RocketMQ is a widely used message middleware system in the Java community, which mainly supports Java8. As Java has evolved many new features and improvements have been added to the language and the Java Virtual Machine (JVM). However, RocketMQ still lacks compatibility with the latest Java versions, preventing users from taking advantage of new features and performance improvements. Therefore, we are seeking community support to upgrade RocketMQ to support higher versions of Java and enable the use of new features and JVM parameters.
Task
We aim to update the RocketMQ codebase to support newer versions of Java in a cross-compile manner. The goal is to enable RocketMQ to work with Java17, while maintaining backward compatibility with previous versions of Java. This will involve identifying and updating any dependencies that need to be changed to support the new Java versions, as well as testing and verifying that the new version of RocketMQ works correctly. With these updates, users will be able to take advantage of the latest Java features and performance improvements. We hope that the community can come together to support this task and make RocketMQ a more versatile and powerful middleware system.
Relevant Skills
- Java language
- Having a good understanding of the new features in higher versions of Java, particularly LTS versions.
Mentor
Yangkun Ai, PMC of Apache RocketMQ, aaronai@apache.org
[GSOC] [SkyWalking] Python Agent Performance Enhancement Plan
Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This task is about enhancing Python agent performance, the tracking issue can be seen here -< https://github.com/apache/skywalking/issues/10408
[
GSOC] [SkyWalking] Pending Task on K8sApache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This task is about a pending task on K8s.
ShenYu
Apache ShenYu Gsoc 2023 - Design and implement shenyu ingress-controller in k8s
Background
Apache ShenYu is a Java native API Gateway for service proxy, protocol conversion and API governance. Currently, ShenYu has good usability and performance in microservice scenarios. However, ShenYu's support for Kubernetes is still relatively weak.
Tasks
1. Discuss with mentors, and complete the requirements design and technical design of shenyu-ingress-controller.
2. Complete the initial version of shenyu-ingress-controller, implement the reconcile of k8s ingress api, and make ShenYu as the ingress gateway of k8s.
3. Complete the ci test of shenyu-ingress-controller, verify the correctness of the code.
Relevant Skills
1. Know the use of Apache ShenYu
2. Familiar with Java and Golang
3. Familiar with Kubernetes and can use java or golang to develop Kubernetes Controller
Description
Issues : https://github.com/apache/shenyu/issues/4438
website : https://shenyu.apache.org/
Commons Statistics
GSoC] RocketMQ TieredStore Integration with HDFS
[GSoC] RocketMQ TieredStore Integration with HDFS
Github Issue: https://github.com/apache/rocketmq/issues/6282
Apache RocketMQ and HDFS
- Apache RocketMQ is a cloud native messaging and streaming platform, making it simple to build event-driven applications.
- Hadoop Distributed File System (HDFS) is a distributed file system designed to store and manage large data sets across multiple servers or clusters. HDFS provides a reliable, scalable, and fault-tolerant platform for storing and accessing data that can be accessed by a variety of applications running on the hadoop cluster.
Background
High-speed storage media, such as solid-state drives (SSDs), are typically more expensive than traditional hard disk drives (HDDs). To minimize storage costs, the local data disk size of a rocketmq broker is often limited. HDFS can store large amounts of data at a lower cost, it has better support for storing and retrieving data sequentially rather than randomly. In order to preserve message data over a long period or facilitate message export, the RocketMQ project previously introduced a tiered storage plugin. Now it is necessary to implement a storage plugin to save data on hdfs.
Relevant Skills
- Interest in messging middleware and distributed storage system
- Java development skills
- Having a good understanding of rocketmq and hdfs models
Anyways, the most important relevant skill is motivation and readiness to learn during the project!
Tasks
- understand the basic concepts and principles in distributed systems
- provide related design documents
- develop one that uses hdfs as the backend storage plugin to store rocketmq message data
- write effective unit test code
- *suggest improvements to the tiered storage interface
- *what ever comes in your mind further ideas are always welcome
Learning Material
- RocketMQ HomePage (https://rocketmq.apache.org) Github: https://github.com/apache/rocketmq
- RocketMQ Tiered Storage Design (https://github.com/apache/rocketmq/wiki/RIP-57-Tiered-storage-for-RocketMQ)
- HDFS HomePage (https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html)
Name and contact information
- Mentor: Zhimin Li, Apache RocketMQ Committer, lizhimin@apache.org
- Mailing List: dev@rocketmq.apache.org
- Website: https://rocketmq.apache.org/ and https://hadoop.apache.org/
[GSoC] Summary statistics API for Java 8 streams
Placeholder for tasks that could be undertaken in this year's GSoC.
Ideas:
- Design an updated summary statistics API for use with Java 8 streams based on the summary statistic implementations in the Commons Math stat.descriptive package including moments, rank and summary sub-packages.
Commons Numbers
[GSoC] [RocketMQ] The performance tuning of RocketMQ proxy
Apache RocketMQ
Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity, and flexible scalability.
Page: https://rocketmq.apache.org
Repo: https://github.com/apache/rocketmq
Background
RocketMQ 5.0 has released a new module called `proxy`, which supports gRPC and remoting protocol. Additionally, it can be deployed in two modes, namely Local and Cluster modes. The performance tuning task will provide contributors with a comprehensive understanding of Apache RocketMQ and its intricate data flow, presenting a unique opportunity for beginners to acquaint themselves with and actively participate in our community.
Task
The task is to tune RocketMQ proxy for optimal performance involves latency and throughput. possess a thorough knowledge of Java implementation and possess the ability to fine-tune Netty, gRPC, the operating system, and RocketMQ itself. We anticipate that the developer responsible for this task will provide a performance report about measurements of both latency and throughput.
Relevant Skills
Basic knowledge of RocketMQ 5.0, Netty, gRPC, and operating system.
Mailing List: dev@rocketmq.apache.org
Mentor
Zhouxiang Zhan, committer of Apache RocketMQ, zhouxzhan@apache.org
SkyWalking
[GSOC] [SkyWalking] AIOps Log clustering with Flink (Algorithm Optimization)
Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This year we will proceed on log clustering implementation with a revised architecture and this task will require student to focus on algorithm optimiztion for the clustering technique.
[GSOC] [SkyWalking] AIOps Log clustering with Flink (Flink Integration)
Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This year we will proceed on log clustering implementation with a revised architecture and this task will require student to focus on Flink and its integration with SkyWalking OAP.
[GSOC] [SkyWalking] Python Agent Performance Enhancement Plan
Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This task is about enhancing Python agent performance, the tracking issue can be seen here -< https://github.com/apache/skywalking/issues/10408
[GSOC] [SkyWalking] Pending Task on K8s
Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This task is about a pending task on K8s.
Add support for extended precision floating-point numbers
Add implementations of extended precision floating point numbers.
An extended precision floating point number is a series of floating-point numbers that are non-overlapping such that:
double-double (a, b):
|a| > |b|
a == a + b
Common representations are double-double and quad-double (see for example David Bailey's paper on a quad-double library: QD).
Many computations in the Commons Numbers and Statistics libraries use extended precision computations where the accumulated error of a double would lead to complete cancellation of all significant bits; or create intermediate overflow of integer values.
This project would formalise the code underlying these use cases with a generic library applicable for use in the case where the result is expected to be a finite value and using Java's BigDecimal and/or BigInteger negatively impacts performance.
An example would be the average of long values where the intermediate sum overflows or the conversion to a double loses bits:
long[] values = {Long.MAX_VALUE, Long.MAX_VALUE}; System.out.println(Arrays.stream(values).average().getAsDouble()); System.out.println(Arrays.stream(values).mapToObj(BigDecimal::valueOf) .reduce(BigDecimal.ZERO, BigDecimal::add) .divide(BigDecimal.valueOf(values.length)).doubleValue()); long[] values2 = {Long.MAX_VALUE, Long.MIN_VALUE}; System.out.println(Arrays.stream(values2).asDoubleStream().average().getAsDouble()); System.out.println(Arrays.stream(values2).mapToObj(BigDecimal::valueOf) .reduce(BigDecimal.ZERO, BigDecimal::add) .divide(BigDecimal.valueOf(values2.length)).doubleValue());
Outputs:
-1.0
9.223372036854776E18
0.0
-0.5
Commons Math
ShenYu
Apache ShenYu Gsoc 2023 - Design and implement shenyu ingress-controller in k8s
Background
Apache ShenYu is a Java native API Gateway for service proxy, protocol conversion and API governance. Currently, ShenYu has good usability and performance in microservice scenarios. However, ShenYu's support for Kubernetes is still relatively weak.
Tasks
1. Discuss with mentors, and complete the requirements design and technical design of shenyu-ingress-controller.
2. Complete the initial version of shenyu-ingress-controller, implement the reconcile of k8s ingress api, and make ShenYu as the ingress gateway of k8s.
3. Complete the ci test of shenyu-ingress-controller, verify the correctness of the code.
Relevant Skills
1. Know the use of Apache ShenYu
2. Familiar with Java and Golang
3. Familiar with Kubernetes and can use java or golang to develop Kubernetes Controller
Description
Issues : https://github.com/apache/shenyu/issues/4438
website : https://shenyu.apache.org/
[GSoC] Update components including machine learning; linear algebra; special functions
Placeholder for tasks that could be undertaken in this year's GSoC.
Ideas (extracted from the "dev" ML):
- Redesign and modularize the "ml" package
-> main goal: enable multi-thread usage. - Abstract the linear algebra utilities
-> main goal: allow switching to alternative implementations. - Redesign and modularize the "random" package
-> main goal: general support of low-discrepancy sequences. - Refactor and modularize the "special" package
-> main goals: ensure accuracy and performance and better API,
add other functions. - Upgrade the test suite to Junit 5
-> additional goal: collect a list of "odd" expectations. - Review and finalize pending issues about the refactoring of the "genetic algorithm" functionality (cf. dedicated branch)
Other suggestions welcome, as well as
- delineating additional and/or intermediate goals,
- signalling potential pitfalls and/or alternative approaches to the intended goal(s).
Commons Imaging
EventMesh
Apache EventMesh EventMesh official website dos by version and demo show
Apache EventMesh (incubating)
Apache EventMesh is a fully serverless platform used to build distributed event-driven applications.
Website: https://eventmesh.apache.org
GitHub: https://github.com/apache/incubator-eventmesh
Upstream Issue: https://github.com/apache/incubator-eventmesh/issues/3327
Background
We hope that the community can contribute to the maintenance of documents, including the archiving of Chinese and English content of documents of different release versions, the maintenance of official website documents, the improvement of project quick start documents, feature introduction, etc.
Task
1.Discuss with the mentors what you need to do
2. Learn the details of the Apache EventMesh project
3. Improve and supplement the content of documents on GitHub, maintain official website documents, record eventmesh quick user experience, and feature display videos
Recommended Skills
1.Familiar with MarkDown
2. Familiar with Java\Go
Mentor
Eason Chen, PPMC of Apache EventMesh, https://github.com/qqeasonchen, chenguangsheng@apache.org
Mike Xue, PPMC of Apache EventMesh, https://github.com/xwm1992, mikexue@apache.org
Placeholder for 1.0 release
A placeholder ticket, to link other issues and organize tasks related to the 1.0 release of Commons Imaging.
The 1.0 release of Commons Imaging has been postponed several times. Now we have a more clear idea of what's necessary for the 1.0 (see issues with fixVersion 1.0 and 1.0-alpha3, and other open issues), and the tasks are interesting as it involves both basic and advanced programming for tasks such as organize how test images are loaded, or work on performance improvements at byte level and following image format specifications.
The tasks are not too hard to follow, as normally there are example images that need to work with Imaging, as well as other libraries in C, C++, Rust, PHP, etc., that process these images correctly. Our goal with this issue is to a) improve our docs, b) improve our tests, c) fix possible security issues, d) get the parsers in Commons Imaging ready for the 1.0 release.
Assigning the label for GSoC 2023, and full time. Although it would be possible to work on a smaller set of tasks for 1.0 as a part time too.
CloudStack
Apache EventMesh Integrate eventmesh runtime on Kubernetes
Apache EventMesh (incubating)
Apache EventMesh is a fully serverless platform used to build distributed event-driven applications.
Website: https://eventmesh.apache.org
GitHub: https://github.com/apache/incubator-eventmesh
Upstream Issue: https://github.com/apache/incubator-eventmesh/issues/3327
Background
Currently, EventMesh has good usability in microservice scenarios. However, EventMesh's support for Kubernetes is still relatively weak.We hope the community can contribute EventMesh integration with the k8s.
Task
1.Discuss with the mentors your implementation idea
2. Learn the details of the Apache EventMesh project
3. Integrate EventMesh with the k8s
Recommended Skills
1.Familiar with Java
2.Familiar with Kubernetes
Mentor
Eason Chen, PPMC of Apache EventMesh, https://github.com/qqeasonchen, chenguangsheng@apache.org
Mike Xue, PPMC of Apache EventMesh, https://github.com/xwm1992, mikexue@apache.org
Commons Statistics
[GSoC] Summary statistics API for Java 8 streams
Placeholder for tasks that could be undertaken in this year's GSoC.
Ideas:
- Design an updated summary statistics API for use with Java 8 streams based on the summary statistic implementations in the Commons Math stat.descriptive package including moments, rank and summary sub-packages.
Commons Numbers
Add support for extended precision floating-point numbers
Add implementations of extended precision floating point numbers.
An extended precision floating point number is a series of floating-point numbers that are non-overlapping such that:
double-double (a, b):
|a| > |b|
a == a + b
Common representations are double-double and quad-double (see for example David Bailey's paper on a quad-double library: QD).
Many computations in the Commons Numbers and Statistics libraries use extended precision computations where the accumulated error of a double would lead to complete cancellation of all significant bits; or create intermediate overflow of integer values.
This project would formalise the code underlying these use cases with a generic library applicable for use in the case where the result is expected to be a finite value and using Java's BigDecimal and/or BigInteger negatively impacts performance.
An example would be the average of long values where the intermediate sum overflows or the conversion to a double loses bits:
long[] values = {Long.MAX_VALUE, Long.MAX_VALUE}; System.out.println(Arrays.stream(values).average().getAsDouble()); System.out.println(Arrays.stream(values).mapToObj(BigDecimal::valueOf) .reduce(BigDecimal.ZERO, BigDecimal::add) .divide(BigDecimal.valueOf(values.length)).doubleValue()); long[] values2 = {Long.MAX_VALUE, Long.MIN_VALUE}; System.out.println(Arrays.stream(values2).asDoubleStream().average().getAsDouble()); System.out.println(Arrays.stream(values2).mapToObj(BigDecimal::valueOf) .reduce(BigDecimal.ZERO, BigDecimal::add) .divide(BigDecimal.valueOf(values2.length)).doubleValue());
Outputs:
-1.0
9.223372036854776E18
0.0
-0.5
Commons Math
[GSoC] Update components including machine learning; linear algebra; special functions
Placeholder for tasks that could be undertaken in this year's GSoC.
Ideas (extracted from the "dev" ML):
- Redesign and modularize the "ml" package
-> main goal: enable multi-thread usage. - Abstract the linear algebra utilities
-> main goal: allow switching to alternative implementations. - Redesign and modularize the "random" package
-> main goal: general support of low-discrepancy sequences. - Refactor and modularize the "special" package
-> main goals: ensure accuracy and performance and better API,
add other functions. - Upgrade the test suite to Junit 5
-> additional goal: collect a list of "odd" expectations. - Review and finalize pending issues about the refactoring of the "genetic algorithm" functionality (cf. dedicated branch)
Other suggestions welcome, as well as
- delineating additional and/or intermediate goals,
- signalling potential pitfalls and/or alternative approaches to the intended goal(s).
Refactoring of GA functionality
As discussed extensively on the "dev" ML[1][2], there are two competing designs (please review them on the dedicated git branch) for the refactoring of the basic functionality currently implemented in the org.apache.commons.math4.legacy.genetics "legacy" package.
TL;DR;
- The discussion has pointed to major (from a maintenance POV) issues of the design proposed by the OP.
- The alternative (much simpler) design has been implemented as proof-of-concept (indicating that some corner might have been cut).
- The OP mentioned correctness issues in the "simple" design but did neither fix them nor provided answers on the LM to that effect.
- Questions concerning other possible "bloat" (e.g. on using a custom representation of the "binary chromosome" concept instead of the BitSet available from the JDK) were also left dangling.
- Refactoring of the "basic" GA functionality (the purpose of the "proof-of-concept") must be decoupled from the new feature which the OP wanted to implement ("adaptive probability generation").
- Unit tests (a.o. all those from the "legacy" code) must demonstrate that the refactored code does (or does not) behave correctly, and bugs should be fixed on the "simple" implementation, before implementing the new feature on top of it.
[1] https://markmail.org/message/qn7gq2y7xjoxukzp
[2] https://markmail.org/message/f66iii3a4kmjaprr
Commons Imaging
Placeholder for 1.0 release
A placeholder ticket, to link other issues and organize tasks related to the 1.0 release of Commons Imaging.
The 1.0 release of Commons Imaging has been postponed several times. Now we have a more clear idea of what's necessary for the 1.0 (see issues with fixVersion 1.0 and 1.0-alpha3, and other open issues), and the tasks are interesting as it involves both basic and advanced programming for tasks such as organize how test images are loaded, or work on performance improvements at byte level and following image format specifications.
The tasks are not too hard to follow, as normally there are example images that need to work with Imaging, as well as other libraries in C, C++, Rust, PHP, etc., that process these images correctly. Our goal with this issue is to a) improve our docs, b) improve our tests, c) fix possible security issues, d) get the parsers in Commons Imaging ready for the 1.0 release.
Assigning the label for GSoC 2023, and full time. Although it would be possible to work on a smaller set of tasks for 1.0 as a part time too.
CloudStack
CloudStack GSoC 2023 - Autodetect IPs used inside the VM
Github issue: https://github.com/apache/cloudstack/issues/7142
Description:
With regards to IP info reporting, Cloudstack relies entirely on it's DHCP data bases and so on. When this is not available (L2 networks etc) no IP information is shown for a given VM.
I propose we introduce a mechanism for "IP autodetection" and try to discover the IPs used inside the machines by means of querying the hypervisors. For example with KVM/libvirt we can simply do something like this:
{{root@fedora35 ~]# virsh domifaddr win2k22 --source agent
Name MAC address Protocol Address
-------------------------------------------------------------------------------
Ethernet 52:54:00:7b:23:6a ipv4 192.168.0.68/24
Loopback Pseudo-Interface 1 ipv6 ::1/128
- - ipv4 127.0.0.1/8}}
The above command queries the qemu-guest-agent inside the Windows VM. The VM needs to have the qemu-guest-agent installed and running as well as the virtio serial drivers (easily done in this case with virtio-win-guest-tools.exe ) as well as a guest-agent socket channel defined in libvirt.
Once we have this information we could display it in the UI/API as "Autodetected VM IPs" or something like that.
I imagine it's very similar for VMWare and XCP-ng.
Thank you
CloudStack GSoC 2023 - Extend Import-Export Instances to the KVM Hypervisor
Github issue: https://github.com/apache/cloudstack/issues/7127
Description:
The Import-Export functionality is only allowed for the Vmware hypervisor. The functionality is developed within a VM ingestion framework that allows the extension to other hypervisors. The Import-Export functionality consists on few APIs and the UI to interact with them:
- listUnmanagedInstances: Lists unmanaged virtual machines (not existing in CloudStack but existing on the hypervisor side)
- importUnmanagedInstance: Import an unmanaged VM into CloudStack (this implies populating the database with the corresponding data)
- unmanageVirtualMachine: Make CloudStack forget a VM but do not remove it on the hypervisor side
The complexity on KVM should be parsing the existing XML domains into different resources and map them in CloudStack to populate the database correctly.
Apache Nemo
Enhance Nemo to support autoscaling for bursty loads
The load of streaming jobs usually fluctuate according to the input rate or operations (e.g., window). Supporting the automatic scaling could reduce the operational cost of running streaming applications, while minimizing the performance degradation that can be caused by the bursty loads.
We can harness the cloud resources such as VMs and serverless frameworks to acquire computing resources on demand. To realize the automatic scaling, the following features should be implemented.
1) state migration: scaling jobs require moving tasks (or partitioning a task to multiple ones). In this situation, the internal state of the task should be serialized/deserialized.
2) input/output rerouting: if a task is moved to a new worker, the input and output of the task should be redirected.
3) dynamic Executor or Task creation/deletion: Executor}}s or {{Task can be dynamically created or deleted.
4) scaling policy: a scaling policy that decides when and how to scale out/in should be implemented.
Collect task statistics necessary for estimating duration
Detect skewed task periodically
Dynamic Task Sizing on Nemo
This is an umbrella issue to keep track of the issues related to the dynamic task sizing feature on Nemo.
Dynamic task sizing needs to consider a workload and try to decide on the optimal task size based on the runtime metrics and characteristics. It should have an effect on the parallelism and the partitions, on how many partitions an intermediate data should be divided/shuffled into, and to effectively handle skews in the meanwhile.
Dynamic Work Stealing on Nemo for handling skews
We aim to handle the problem on throttled resources (heterogeneous resources) and skewed input data. In order to solve this problem, we suggest dynamic work stealing that can dynamically track task statuses and steal workloads among each other. To do this, we have the following action items:
- Dynamically collecting task statistics during execution
- Detecting skewed tasks periodically
- Splitting the data allocated in skewed tasks and reallocating them into new tasks
- Synchronizing the optimization procedure
- Evaluation of the resulting implementations
Implement an Accurate Simulator based on Functional model
Missing a deadline often has significant consequences for the business. And simulator can contributes to other approach for optimization
So Implement a Simulator for Stream Processing Based on Functional models.
There are some requirements:
- Simulation should be able to execute before or during job execution
- When a simulation is executed during job is running, It must be fast enough not to affect the job.
- Information about running environment is received through argument.
- At least network topology should be considered for the WAN environment.
Implement a model that represent a task level exeuction time with statistical analysis
The current SimulatedTaskExecutor is hardly available. because it needs actual metric to predict execution time. To increase utilization, we need new model that predicts a task level execution time with statistical analysis.
Some of the related TODOs are as follows:
- Find factors that affect a task level execution time. with loose grid search.
- Infer the most suitable model with tight grid search.
Implement spill mechanism on Nemo
Currently, Nemo doesn't have a spill mechanism. This makes executors prone to memory problems such as OOM(Out Of Memory) or GC when task data is large. For example, handling skewed shuffle data in Nemo results in OOM and executor failure, as all data has to be handled in-memory.
We need to spill in-memory data to secondary storage when there are not enough memory in executor.
Approximate the factors that affect the stage group level execution time
There are some factors that can affect the stage group level simulation, such as a latency, the rate of skewed data and the error rate of the executor etc. It is required to find a reasonable distribution form for these factors. Such as the normal distribution or the landau distribution. In actual running, It makes it possible to approximate the model with a small amount of data.
Efficient Caching and Spilling on Nemo
In-memory caching and spilling are essential features in in-memory big data processing frameworks, and Nemo needs one.
- Identify and persist frequently used data and unpersist it when its usage ended
- Spill in-memory data to disk upon memory pressure
Runtime Level Caching Mechanism
If the the compile time identifies what data can be cached, the runtime requires logic to make this happen.
Implementation needs:
- (Driver) receive and update the status of blocks from various Executors, right now this seems to be best implemented as part of BlockManagerMaster
- (Driver) communicate to the Executors the availability, location and status of blocks
- Possible concurrency issues:
- Concurrency in Driver when multiple Executors update/inquire the same block information
- Concurrency in Executor when a single cached block is accessed simultaneously.
Efficient Dynamic Reconfiguration in Stream Processing
In Stream processing, we have many methods, starting from the primitive checkpoint-and-replay to a more fancy version of reconfiguration and reinitiation of stream workloads. We aim to find a way to find the most effective and efficient way of reconfiguring stream workloads. Sub-issues are to be created later on.
CloudStack GSoC 2023 - Autodetect IPs used inside the VM
Github issue: https://github.com/apache/cloudstack/issues/7142
Description:
With regards to IP info reporting, Cloudstack relies entirely on it's DHCP data bases and so on. When this is not available (L2 networks etc) no IP information is shown for a given VM.
I propose we introduce a mechanism for "IP autodetection" and try to discover the IPs used inside the machines by means of querying the hypervisors. For example with KVM/libvirt we can simply do something like this:
{{root@fedora35 ~]# virsh domifaddr win2k22 --source agent
Name MAC address Protocol Address
-------------------------------------------------------------------------------
Ethernet 52:54:00:7b:23:6a ipv4 192.168.0.68/24
Loopback Pseudo-Interface 1 ipv6 ::1/128
- - ipv4 127.0.0.1/8}}
The above command queries the qemu-guest-agent inside the Windows VM. The VM needs to have the qemu-guest-agent installed and running as well as the virtio serial drivers (easily done in this case with virtio-win-guest-tools.exe ) as well as a guest-agent socket channel defined in libvirt.
Once we have this information we could display it in the UI/API as "Autodetected VM IPs" or something like that.
I imagine it's very similar for VMWare and XCP-ng.
Thank you
Evaluate the performance of Work Stealing implementation
Nemo on Google Dataproc
Issues for making it easy to install and use Nemo on Google Dataproc
CloudStack GSoC 2023 - Extend Import-Export Instances to the KVM Hypervisor
Github issue: https://github.com/apache/cloudstack/issues/7127
Description:
The Import-Export functionality is only allowed for the Vmware hypervisor. The functionality is developed within a VM ingestion framework that allows the extension to other hypervisors. The Import-Export functionality consists on few APIs and the UI to interact with them:
- listUnmanagedInstances: Lists unmanaged virtual machines (not existing in CloudStack but existing on the hypervisor side)
- importUnmanagedInstance: Import an unmanaged VM into CloudStack (this implies populating the database with the corresponding data)
- unmanageVirtualMachine: Make CloudStack forget a VM but do not remove it on the hypervisor side
The complexity on KVM should be parsing the existing XML domains into different resources and map them in CloudStack to populate the database correctly.
Apache Dubbo
Dubbo GSoC 2023 - Integration suite on Kubernetes
As a development framework that is closely related to users, Dubbo may have a huge impact on users if any problems occur during the iteration process. Therefore, Dubbo needs a complete set of automated regression testing tools.
At present, Dubbo already has a set of testing tools based on docker-compose, but this set of tools cannot test the compatibility in the kubernetes environment. At the same time, we also need a more reliable test case construction system to ensure that the test cases are sufficiently complete.
...
Dubbo GSoC 2023 - Metrics on Dubbo Admin
Dubbo Admin is a console of Dubbo. Today, Dubbo's observability is becoming more and more powerful. We need to directly observe some indicators of Dubbo on Dubbo Admin, and even put forward suggestions for users to improve problems.
Dubbo GSoC 2023 - Refactor
the http layerConnection
Background
At present, the abstraction of connection by client in different protocols in Dubbo is not perfect. For example, there is a big discrepancy between the client abstraction of connection in dubbo and triple protocols. As a result, the enhancement of connection-related functions in the client is more complicated, and the implementation cannot be reused. At the same time, the client also needs to implement a lot of repetitive code when extending the protocol.
Target
Reduce the complexity of the client part when extending the protocol, and increase the reuse of connection-related modules
Background
Dubbo currently supports the rest protocol based on http1, and the triple protocol based on http2, but currently the two protocols based on the http protocol are implemented independently, and at the same time, they cannot replace the underlying implementation, and their respective implementation costs are relatively high.
Target
In order to reduce maintenance costs, we hope to be able to abstract http. The underlying implementation of the target implementation of http has nothing to do with the protocol, and we hope that different protocols can reuse related implementations.
Dubbo GSoC 2023 -
Refactor ConnectionIDL management
Background
At present, the abstraction of connection by client in different protocols in Dubbo is not perfect. For example, there is a big discrepancy between the client abstraction of connection in dubbo and triple protocols. As a result, the enhancement of connection-related functions in the client is more complicated, and the implementation cannot be reused. At the same time, the client also needs to implement a lot of repetitive code when extending the protocol.
Target
Dubbo currently supports protobuf as a serialization method. Protobuf relies on proto (Idl) for code generation, but currently lacks tools for managing Idl files. For example, for java users, proto files are used for each compilation. It is more troublesome, and everyone is used to using jar packages for dependencies.
Target
Implement an Idl management and control platform, support idl files to automatically generate dependency packages in various languages, and push them to relevant dependency warehousesReduce the complexity of the client part when extending the protocol, and increase the reuse of connection-related modules.
Dubbo GSoC 2023 -
IDL managementRefactor the http layer
Background
Dubbo currently supports protobuf as a serialization method. Protobuf relies on proto (Idl) for code generation, but currently lacks tools for managing Idl files. For example, for java users, proto files are used for each compilation. It is more troublesome, and everyone is used to using jar packages for dependencies.
Target
the rest protocol based on http1, and the triple protocol based on http2, but currently the two protocols based on the http protocol are implemented independently, and at the same time, they cannot replace the underlying implementation, and their respective implementation costs are relatively high.
Target
In order to reduce maintenance costs, we hope to be able to abstract http. The underlying implementation of the target implementation of http has nothing to do with the protocol, and we hope that different protocols can reuse related implementations.Implement an Idl management and control platform, support idl files to automatically generate dependency packages in various languages, and push them to relevant dependency warehouses
...