Contents
James Server
Adopt Pulsar as the messaging technology backing the distributed James server
https://www.mail-archive.com/server-dev@james.apache.org/msg71462.html
A good long term objective for the PMC is to drop RabbitMQ in
favor of pulsar (third parties could package their own components using
RabbitMQ if they wishes...)
This means:
- Solve the bugs that were found during the Pulsar MailQueue review
- Pulsar MailQueue need to allow listing blobs in order to be
deduplication friendly. - Provide an event bus based on Pulsar
- Provide a task manager based on Pulsar
- Package a distributed server backed by pulsar, deprecate then replace
the current one. - (optionally) support mail queue priorities
While contributions would of course be welcomed on this topic, we could
offer it as part of GSOC 2022, and we could co-mentor it with mentors of
the Pulsar community (see [3])
[3] https://lists.apache.org/thread/y9s7f6hmh51ky30l20yx0dlz458gw259
Would such a plan gain traction around here ?
...
Apache SkyWalking: Add the webapp of banyandb
BanyanDB, as an observability database, aims to ingest, analyze and store Metrics, Tracing, and Logging data. It's designed to handle observability data generated by Apache SkyWalking.
We need a web-based application to
- Query the data from the banyandb's data nodes
- Monitor the performance of the backend
- Render the topology of server nodes
...
APISIX
Apache APISIX: Multi programing languages SDK support
Project title:
Multiple programming languages client SDK support with OpenAPI generator.
Apache APISIX is a dynamic, real-time, high-performance API gateway.
It provides rich traffic management features such as load balancing, dynamic upstream, canary release, circuit breaking, authentication, observability, and more.
Page: https://apisix.apache.org/
Github: https://github.com/apache/apisix
Background:
OpenAPI Generator allows the generation of API client libraries (SDK generation), server stubs, documentation, and configuration automatically given an OpenAPI Spec.
We can use it to provide Apache APISIX Admin and Control API SDKs in multiple programming languages. In the future, we may potentially integrate Java SDK into Spring framework and the starter of Spring boot or even make integration with ASP .Net
Task:
Generate a multilingual SDK through the definition files of the OpenAPI specification and use the OpenAPI Generator tool to generate client SDKs for Admin and Control APIs.
Difficulty: Normal
Project size: ~350 hours.
References:
GSOC: Varnish Cache support in Apache Traffic Control
Background
Apache Traffic Control is a Content Delivery Network (CDN) control plane for large scale content distribution.
Traffic Control currently requires Apache Traffic Server as the underlying cache. Help us expand the scope by integrating with the very popular Varnish Cache.
There are multiple aspects to this project:
- Configuration Generation: Write software to build Varnish configuration files (VCL). This code will be implemented in our Traffic Ops and cache client side utilities, both written in Go.
- Health Monitoring: Implement monitoring of the Varnish cache health and performance. This code will run both in the Traffic Monitor component and within Varnish. Traffic Monitor is written in Go and Varnish is written in C.
- Testing: Adding automated tests for new code
Skills:
- Proficiency in Go is required
- A basic knowledge of HTTP and caching is preferred, but not required for this project.
APISIX
Apache APISIX: Elasticsearch plugin
Apache APISIX is a dynamic, real-time, high-performance API gateway.
It provides rich traffic management features such as load balancing, dynamic upstream, canary release, circuit breaking, authentication, observability, and more.
Page: https://apisix.apache.org/
Github: https://github.com/apache/apisix
Background: Elasticsearch is a widespread search engine based on Apache Lucene. It allows users to index, store, and search for data via a REST API. Data going through APISIX are good candidates to be transferred to Elasticsearch for later analysis.
Task: The intern should evaluate different possible designs, analyze their pros and cons, and implement at least one in agreement with the mentor.
In particular, the intern should investigate ES requirements for writing data (amount of data, frequency, etc.) prior to any development.
Difficulty: Normal
Project size: ~175 hours.
References:
Potential Mentor: ZhengSong Tu, https://github.com/tzssangglass
Apache APISIX:
Multi programing languages SDK supportSupport local file and data center configuration conversion, import and export
Apache APISIX is a dynamic
Project title:
Multiple programming languages client SDK support with OpenAPI generator.
Apache APISIX is a dynamic, real-time, high-performance API gateway.
It provides rich traffic management features such as load balancing, dynamic upstream, canary release, circuit breaking, authentication, observability, and more.
Page: https://apisix.apache.org/
Github: https://github.com/apache/apisix
Background:
OpenAPI Generator allows the generation of API client libraries (SDK generation), server stubs, documentation, and configuration automatically given an OpenAPI Spec.
We can use it to provide Apache APISIX Admin and Control API SDKs in multiple programming languages. In the future, we may potentially integrate Java SDK into Spring framework and the starter of Spring boot or even make integration with ASP .Net
Task:
Project title:
Datacenter and local file configuration conversion, export and import are supported via Apache APISIX CLI.
Background:
Apache APISIX supports running in standalone mode. At this point, Apache APISIX will rely on the local configuration file `conf/apisix.yaml` for routing and policy settings.
Apache APISIX CLI supports the conversion, import and export of data center and local file configuration data, making Apache APISIX easier to switch and apply between different environments and scenarios.
Task:
Add two commands `bin/apisix conf_export` and `bin/apisix conf_import` to Apache APISIX CLI, and complete the conversion, import and export of remote data center and local file configuration data through the above commandsGenerate a multilingual SDK through the definition files of the OpenAPI specification and use the OpenAPI Generator tool to generate client SDKs for Admin and Control APIs.
Difficulty: Normal
Project size: ~350 ~350 hours.
References:
swaggeriotoolsswagger-codegen/https://github.com/OpenAPITools/openapi-generatorCommons Math
Apache APISIX: Java Plugin Runner Improvement
Background:
At the moment, the Java runner plugin requires you to use an existing template project and change it according to one’s needs.
Task:
Improve developer experience on the existing Java plugin runner so that we can attract and increase the number of users from the Java community.
Limitations:
- The architecture doesn’t manage multiple plugins. All need to be set in the same project
- The standard Java unit of deployment is the JAR.
- The plugin doesn’t allow for other widespread JVM-based languages (e.g., Scala, Kotlin, Clojure, Groovy). Though it would be technically feasible, we would need to change the template’s language
Requirements:
The new plugin runner:
- MUST use the JAR as the unit of deployment
- MUST not require the usage of a project template
- MAY require the plugin to follow a certain class hierarchy (i.e., extends JavaPlugin)
- MAY use a more specific format to enforce a structure
- MUST allow multiple plugins to be deployed
- MUST use isolated classloader for each plugin
- MUST allow any JVM-compatible bytecode to run, whatever the language it was generated from
- MAY allow hot reloading of Java plugins
- MAY require a single JAR per plugin (to ease the classpath management of shared libraries)
- MUST define a minimum JVM version
Difficulty: Normal
Project size: ~350 hours.
GSoC 2022
Placeholder for tasks that could be undertaken in this year's GSoC.
Ideas (extracted from the "dev" ML):
- Redesign and modularize the "ml" package
-> main goal: enable multi-thread usage. - Abstract the linear algebra utilities
-> main goal: allow switching to alternative implementations. - Redesign and modularize the "random" package
-> main goal: general support of low-discrepancy sequences. - Refactor and modularize the "special" package
-> main goals: ensure accuracy and performance and better API,
add other functions. - Upgrade the test suite to Junit 5
-> additional goal: collect a list of "odd" expectations.
Other suggestions welcome, as well as
- delineating additional and/or intermediate goals,
- signalling potential pitfalls and/or alternative approaches to the intended goal(s).
Cassandra
Produce and verify BoundedReadCompactionStrategy as a unified general purpose compaction algorithm
The existing compaction strategies have a number of drawbacks that make all three unsuitable as a general use compaction strategy, for example STCS creates giant files that are hard to back up, mess with read performance and the page cache, and led to many of the early re-open bugs. LCS improved dramatically on this but also has various issues e.g. lack of performant full compaction or due to the strict leveling with e.g. bulk loading when writes exceed the rate we can do the L0 - L1 promotion.
In this talk I proposed a novel compaction strategy that aims to expose a single tunable that the user can control for the read amplification. Raise the min_threshold_levels and you tradeoff read/space performance for write performance. Since then a proof of concept patch has been published along with some rudimentary documentation but this is still not tracked in Jira.
The remaining work here is
1. Validate the algorithm is correct via test suites and performance testing stress testing and benchmarking with OSS tools (e.g. cassandra-stress, tlp-stress, or ndbench). When issues are found (there likely will be issues as the patch is a PoC), devise how to adjust the algorithm and implementation appropriately. Key metric of success is we can run Cassandra stably for more than 24 hours while applying sustained load, with minimal compaction load (and also compaction can keep up).
2. Do more in depth experiments measuring performance across a wide range of workloads (e.g. write heavy, read heavy, balanced, time series, register update, etc ...) and in comparison with LCS (leveled), STCS (size tiered), and TWCS (time window). Key metrics of success are establishing that as we tune max_read_per_read we should get more predictable read latency under low system load (ρ < 30%) while not degrading at high system load (ρ > 70%), and we should match LCS performance under low load while doing better at high load.
Once this is validated a Cassandra blog post reporting on the findings (positive or negative) may be advisable.
Community Development
[SkyWalking] Log outlier detection
Currently Apache SkyWalking can collect logs from various sources like user agents and Envoy access logs, it also provides a log analysis language to analyze the logs and produce some metrics, with those metrics, users can configure rules to trigger alerts and react to those abnormal/exceptional logs.
But in reality, production environment exceptional logs are not known in advance and users can't enumerate all possible exceptional logs.
This task aims to add an algorithm that can identify outlier log(s) from the massive logs, and draw the users attention to see whether there is error in the system.
The algorithm should be able to learn from bot the history logs and streaming logs, and adjust itself to increase the accuracy.
TrafficControl
GSOC: Varnish Cache support in Apache Traffic Control
Background
Apache Traffic Control is a Content Delivery Network (CDN) control plane for large scale content distribution.
Traffic Control currently requires Apache Traffic Server as the underlying cache. Help us expand the scope by integrating with the very popular Varnish Cache.
There are multiple aspects to this project:
- Configuration Generation: Write software to build Varnish configuration files (VCL). This code will be implemented in our Traffic Ops and cache client side utilities, both written in Go.
- Health Monitoring: Implement monitoring of the Varnish cache health and performance. This code will run both in the Traffic Monitor component and within Varnish. Traffic Monitor is written in Go and Varnish is written in C.
- Testing: Adding automated tests for new code
Skills:
- Proficiency in Go is required
- A basic knowledge of HTTP and caching is preferred, but not required for this project.
Commons Numbers
GSoC 2022
Placeholder for tasks that could be undertaken in this year's GSoC.
Ideas:
- Update the support for complex numbers in the complex package to allow operations to be performed on lists of complex numbers. This requires abstracting the representation of multiple complex numbers into a list structure storing real and imaginary parts that can be efficiently iterated to apply all the operations supported by the Complex class. Operations should modify the numbers in place allowing efficient, zero allocation complex number math to be performed on large datasets.
Commons Math
GSoC 2022
Placeholder for tasks that could be undertaken in this year's GSoC.
Ideas (extracted from the "dev" ML):
- Redesign and modularize the "ml" package
-> main goal: enable multi-thread usage. - Abstract the linear algebra utilities
-> main goal: allow switching to alternative implementations. - Redesign and modularize the "random" package
-> main goal: general support of low-discrepancy sequences. - Refactor and modularize the "special" package
-> main goals: ensure accuracy and performance and better API,
add other functions. - Upgrade the test suite to Junit 5
-> additional goal: collect a list of "odd" expectations.
Other suggestions welcome, as well as
- delineating additional and/or intermediate goals,
- signalling potential pitfalls and/or alternative approaches to the intended goal(s).
Beam
A generic Beam IO Sink for Java
It would be desirable to develop a Beam Sink that supports all of the 'best practices' like throttling, auto-sharding, exactly-once capabilities, etc.
A design proposal is here: https://docs.google.com/document/d/1UIWv6wnD86GYAkeqbVWCG3mx4dTZ9WstUUThPWQmcFM/edit#heading=h.smc16ifdre2
A prototype for the API and parts of implementation is here: https://github.com/apache/beam/pull/16763
Contact Pablo Estrada on dev@beam.apache.org if you have questions, or comment here.
Apache Nemo
Efficient Dynamic Reconfiguration in Stream Processing
In Stream processing, we have many methods, starting from the primitive checkpoint-and-replay to a more fancy version of reconfiguration and reinitiation of stream workloads. We aim to find a way to find the most effective and efficient way of reconfiguring stream workloads. Sub-issues are to be created later on.
Application structure-aware caching on Nemo
Nemo has a policy layer that allows powerful optimization with configurable runtime modules. In terms of caching, it is possible to identify frequently used data and decide to cache them in-memory ahead of execution, without user annotation.
Implementation would need:
- On policy layer, build compile-time pass that identify reused data and mark them as cached
- On runtime, design and implement caching mechanism that manages per-executor cached data and discard them when these are no longer used.
CloudStack
CloudStack Terraform Provider - Add support for Kubernetes Clusters
As of now the CloudStack Terraform Provider does not support managing CKS clusters
This proposal aims to add support to the CloudStack Terraform Provider to manage CKS clusters
This would involve supporting the following actions on CKS clusters :
- Create
- Stop / Start
- Scale
- Upgrade
- Delete
[Optional]
Support the following actions on the binary ISOs :
- Register
- Enable / Disable
- Delete
Duration
- 175 hours
Potential Mentors
- Harikrishna Patnala
- David Jumani
References
Implement spill mechanism on Nemo
Currently, Nemo doesn't have a spill mechanism. This makes executors prone to memory problems such as OOM(Out Of Memory) or GC when task data is large. For example, handling skewed shuffle data in Nemo results in OOM and executor failure, as all data has to be handled in-memory.
We need to spill in-memory data to secondary storage when there are not enough memory in executor.
View Logs in the UI
As of now, when an admin encounters an issue or error in CloudStack, the maximum information they can immediately get is the API failure response which provides a reason for the failure. At times this might not be sufficinet to diagnose the error and would require the admin to investiage the CloudStack logs. This would require the admin or the sysadmin to log into the VM running CloudStack and either view or export the logs, and then dive into identifying the issue. This idea aims to eiliminate that step.
The goal of this is to provide admins the ability to view the logs directly in the UI. This would make diagnosing failures and RCAs much quicker.
Provide the ability display the logs in the UI
Add an API / WebSocket (and UI) support to :
- View the logs
- Live follow the logs (similar to 'tail -f')
Duration
- 175 hours
Potential Mentors
- David Jumani
References
Add the ability to Safely Shutdown / restart CloudStack
Shutting down / Restarting Cloudstack is a necessary step in upgrades, system maintenance, etc. As of now, there is no way to safely shutdown or restart CloudStack. It is directly terminated via systemd. Since this is the case, any asyncronous job or background task is abrubptly terminated and can fail. As of now, CloudStack maintains a list of asynchronous jobs wihtin it's database along with their status.
This idea aims to provide a way to safely shutdown CloudStack. It involves two parts :
- Prevent new asynchronous jobs from being added to CloudStack when a safe shutdown is triggered
- Check the status of the async jobs and Shut down CloudStack when all the jobs have been completed
Provide the ability to safely shutdown CloudStack
Add API (and/or UI) support to :
- Trigger a safe shutdown
- (Optional) Support restarts
- (Optional) Support a forced shutdown when CloudStack will quit even if there are async jobs running
Duration
- Some Experience : 175 hours
- Newbie : 350 hours
Potential Mentors
- David Jumani
References
Cassandra
Produce and verify BoundedReadCompactionStrategy as a unified general purpose compaction algorithm
The existing compaction strategies have a number of drawbacks that make all three unsuitable as a general use compaction strategy, for example STCS creates giant files that are hard to back up, mess with read performance and the page cache, and led to many of the early re-open bugs. LCS improved dramatically on this but also has various issues e.g. lack of performant full compaction or due to the strict leveling with e.g. bulk loading when writes exceed the rate we can do the L0 - L1 promotion.
In this talk I proposed a novel compaction strategy that aims to expose a single tunable that the user can control for the read amplification. Raise the min_threshold_levels and you tradeoff read/space performance for write performance. Since then a proof of concept patch has been published along with some rudimentary documentation but this is still not tracked in Jira.
The remaining work here is
1. Validate the algorithm is correct via test suites and performance testing stress testing and benchmarking with OSS tools (e.g. cassandra-stress, tlp-stress, or ndbench). When issues are found (there likely will be issues as the patch is a PoC), devise how to adjust the algorithm and implementation appropriately. Key metric of success is we can run Cassandra stably for more than 24 hours while applying sustained load, with minimal compaction load (and also compaction can keep up).
2. Do more in depth experiments measuring performance across a wide range of workloads (e.g. write heavy, read heavy, balanced, time series, register update, etc ...) and in comparison with LCS (leveled), STCS (size tiered), and TWCS (time window). Key metrics of success are establishing that as we tune max_read_per_read we should get more predictable read latency under low system load (ρ < 30%) while not degrading at high system load (ρ > 70%), and we should match LCS performance under low load while doing better at high load.
Once this is validated a Cassandra blog post reporting on the findings (positive or negative) may be advisable.
Add support for EXPLAIN statements
We should provide users a way to understand how their query will be executed and some information on the amount of work that will be performed.
Explain statements are the most common way to do that.
A CEP Draft has been open for that: (DRAFT) CEP-4: Explain. This draft propose to add support for EXPLAIN and EXPLAIN ANALYZE but I believe that we should split the work in 2 parts because a simple EXPLAIN would already provide relevant information.
To complete this work I believe that the following steps will be required:
- Rework and submit the CEP
- Add missing statistics
- Implements the logic behind the EXPLAIN statements
Beam
A generic Beam IO Sink for Java
It would be desirable to develop a Beam Sink that supports all of the 'best practices' like throttling, auto-sharding, exactly-once capabilities, etc.
A design proposal is here: https://docs.google.com/document/d/1UIWv6wnD86GYAkeqbVWCG3mx4dTZ9WstUUThPWQmcFM/edit#heading=h.smc16ifdre2
A prototype for the API and parts of implementation is here: https://github.com/apache/beam/pull/16763
Contact Pablo Estrada on dev@beam.apache.org if you have questions, or comment here.
Apache Nemo
Efficient Dynamic Reconfiguration in Stream Processing
In Stream processing, we have many methods, starting from the primitive checkpoint-and-replay to a more fancy version of reconfiguration and reinitiation of stream workloads. We aim to find a way to find the most effective and efficient way of reconfiguring stream workloads. Sub-issues are to be created later on.
Application structure-aware caching on Nemo
Nemo has a policy layer that allows powerful optimization with configurable runtime modules. In terms of caching, it is possible to identify frequently used data and decide to cache them in-memory ahead of execution, without user annotation.
Implementation would need:
- On policy layer, build compile-time pass that identify reused data and mark them as cached
- On runtime, design and implement caching mechanism that manages per-executor cached data and discard them when these are no longer used.
Implement spill mechanism on Nemo
Currently, Nemo doesn't have a spill mechanism. This makes executors prone to memory problems such as OOM(Out Of Memory) or GC when task data is large. For example, handling skewed shuffle data in Nemo results in OOM and executor failure, as all data has to be handled in-memory.
We need to spill in-memory data to secondary storage when there are not enough memory in executor.
Efficient Caching and Spilling on Nemo
In-memory caching and spilling are essential features in in-memory big data processing frameworks, and Nemo needs one.
- Identify and persist frequently used data and unpersist it when its usage ended
- Spill in-memory data to disk upon memory pressure
Apache Commons Statistics
GSoC 2022
Placeholder for tasks that could be undertaken in this year's GSoC.
Ideas:
- Design an updated summary statistics API for use with Java 8 streams based on the summary statistic implementations in the Commons Math stat.descriptive package including moments, rank and summary sub-packages.
Apache Commons Geometry
GSoC 2022
Placeholder for tasks that could be undertaken in this year's GSoC.
Ideas:
- Examine and potentially redesign the API and algorithms in the commons-geometry-enclosing module. The goal here is to make consistent use of the newer geometry API and ensure that the algorithms are sound.
- Examine and potentially redesign the API and algorithms in the commons-geometry-hull module. The goal here is to make consistent use of the newer geometry API and ensure that the algorithms are sound (see GEOMETRY-144).
- Design and implement a parser/writer for the PLY file format in the commons-geometry-io-euclidean module.
- Design an API for advanced 3D mesh data structures (e.g. halfedge meshes) and operations (e.g. surface subdivision, smoothing, etc). This may end up being another module, e.g. commons-geometry-mesh.
- Create a series of user guides and/or tutorials demonstrating best-practice use of the library.
- other ideas ... ?
Efficient Caching and Spilling on Nemo
In-memory caching and spilling are essential features in in-memory big data processing frameworks, and Nemo needs one.
- Identify and persist frequently used data and unpersist it when its usage ended
- Spill in-memory data to disk upon memory pressure