Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Contents

...

CloudStack Terraform Provider - Add support for Kubernetes Clusters

As of now the CloudStack Terraform Provider does not support managing CKS clusters

This proposal aims to add support to the CloudStack Terraform Provider to manage CKS clusters

This would involve supporting the following actions on CKS clusters :

  • Create
  • Stop / Start
  • Scale
  • Upgrade
  • Delete

[Optional]
Support the following actions on the binary ISOs :

  • Register
  • Enable / Disable
  • Delete


Duration

  • 175 hours


Potential Mentors

  • Harikrishna Patnala
  • David Jumani

References


Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
David Jumani, mail: davidjumani (at) apache.org
Project Devs, mail: dev (at) cloudstack.apache.org

Cassandra

GSoC 2022 Idea Instant Instance Deploy (using VM Definitions)

Background

Currently, Deploy Instances/Virtual Machines(VMs) in Cloudstack requires to specify some offerings, template and other settings through the API (check the API here: https://cloudstack.apache.org/api/apidocs-4.16/apis/deployVirtualMachine.html) or the 'Instance Deployment Wizard' in the UI.

Requirement

Provision to user/operator to quick deploy an instance using a VM definition/profile. The VM definition/profile would hold the details of the template, offerings (including any custom values - size, iops), ssh keypair, instance group, affinity group and other settings (boot type, dynamic scaling, userdata, keyboard language, etc) that are required, and the underlying definition/profile id can be used to launch an instance. At the minimum, the definition should hold all the mandatory details for deploying an instance. With this, only the VM definitions/profiles (and other important options, with the associated billing details) can be exposed to the users for VM deployment, instead of the offerings and other VM options.

Need to add new APIs (and/or UI) support for the VM definition/profile CRUD operations, and support for definition in the deployVirtualMachine API.

Relevant Skills

  • Java, MySQL
  • Vue.js (for UI)
  • Some knowledge of Virtualization and CloudStack

Difficulty

Medium

Potential Mentors

  • Suresh Kumar Anaparti
  • David Jumani

Project Scope/Duration

Medium / 175 hours

References

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Suresh Kumar Anaparti, mail: sureshkumar.anaparti (at) apache.org
Project Devs, mail: dev (at) cloudstack.apache.org

GSoC 2022 Idea Report / Manage the VM jobs in CloudStack

Background

CloudStack allows users/operators to perform various operations on the Virtual Machines (VMs). When multiple operations are performed on a VM at the same time, these operations are maintained and sync-ed using the sync queues. Any long running job (eg. volume snapshot) of a VM keeps other jobs in waiting/pending state, and only be picked once the active job is finished. Currently, it is not possible for an operator to list the pending jobs on a VM, cancel or re-prioritise any job if needed.

Requirement

Provision to admin/operator, to the list the pending jobs of a VM, cancel or re-prioritise a job if needed. Also, allow to clear all the pending jobs of a VM.

Add API (and/or UI) support to

  • List the active jobs for a VM
  • List all the pending jobs of a VM (in queue, by their order of execution)
  • Re-prioritise a job from the pending jobs (if possible)
  • Cancel any job from the pending jobs
  • Clear all the pending jobs of a VM

Relevant Skills

  • Java, MySQL
  • Vue.js (for UI)
  • Some knowledge of CloudStack and its Job framework

Difficulty

Medium

Potential Mentors

  • Suresh Kumar Anaparti
  • Any Developer from CS Community

Project Scope/Duration

Large / 350hrs (can be Medium / 175 hours - with reduced scope of API/UI work)

References

Future Extensions

This can be extended for other resources (hosts, primary storage, network, etc).
[APIs should take resource type as a param for generic implementation]

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Suresh Kumar Anaparti, mail: sureshkumar.anaparti (at) apache.org
Project Devs, mail: dev (at) cloudstack.apache.org

Cassandra

Produce and verify BoundedReadCompactionStrategy as a unified general purpose compaction algorithm

The existing compaction strategies have a number of drawbacks that make all three unsuitable as a general use compaction strategy, for example STCS creates giant files that are hard to back up, mess with read performance and the page cache, and led to many of the early re-open bugs. LCS improved dramatically on this but also has various issues e.g. lack of performant full compaction or due to the strict leveling with e.g. bulk loading when writes exceed the rate we can do the L0 - L1 promotion.

In this talk I proposed a novel compaction strategy that aims to expose a single tunable that the user can control for the read amplification

Produce and verify BoundedReadCompactionStrategy as a unified general purpose compaction algorithm

The existing compaction strategies have a number of drawbacks that make all three unsuitable as a general use compaction strategy, for example STCS creates giant files that are hard to back up, mess with read performance and the page cache, and led to many of the early re-open bugs. LCS improved dramatically on this but also has various issues e.g. lack of performant full compaction or due to the strict leveling with e.g. bulk loading when writes exceed the rate we can do the L0 - L1 promotion.

In this talk I proposed a novel compaction strategy that aims to expose a single tunable that the user can control for the read amplification. Raise the min_threshold_levels and you tradeoff read/space performance for write performance. Since then a proof of concept patch has been published along with some rudimentary documentation but this is still not tracked in Jira.

The remaining work here is

1. Validate the algorithm is correct via test suites and performance testing stress testing and benchmarking with OSS tools (e.g. cassandra-stress, tlp-stress, or ndbench). When issues are found (there likely will be issues as the patch is a PoC), devise how to adjust the algorithm and implementation appropriately. Key metric of success is we can run Cassandra stably for more than 24 hours while applying sustained load, with minimal compaction load (and also compaction can keep up).

2. Do more in depth experiments measuring performance across a wide range of workloads (e.g. write heavy, read heavy, balanced, time series, register update, etc ...) and in comparison with LCS (leveled), STCS (size tiered), and TWCS (time window). Key metrics of success are establishing that as we tune max_read_per_read we should get more predictable read latency under low system load (ρ < 30%) while not degrading at high system load (ρ > 70%), and we should match LCS performance under low load while doing better at high load.

Once this is validated a Cassandra blog post reporting on the findings (positive or negative) may be advisable.


Difficulty: Normal
Project size: ~350 hour (large)
Potential mentors:
, mail: (at) apache.org
Project Devs, mail: dev (at) cassandra.apache.org

...

A Complex Event Processing (CEP) library/extension for Apache Beam

Apache Beam [1] is a unified and portable programming model for data processing jobs. The Beam model [2, 3, 4] has rich mechanisms to process endless streams of events.

Complex Event Processing [5] lets you match patterns of events in streams to detect important patterns in data and react to them.

Some examples of uses of CEP are fraud detection for example by detecting unusual behavior (patterns of activity), e.g. network intrusion, suspicious banking transactions, etc. Also trend detection is another interesting use case in the context of sensors and IoT.

The goal of this issue is to implement an efficient pattern matching library inspired by [6] and existing libraries like Apache Flink CEP [7] using the Apache Beam Java SDK and the Beam style guides [8]. Because of the time constraints of GSoC we will probably try to cover first simple patterns of the ‘a followed by b followed by c’ kind, and then if there is still time try to cover more advanced ones e.g. optional, atLeastOne, oneOrMore, etc.

[1] https://beam.apache.org/
[2] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
[3] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
[4] https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43864.pdf
[5] https://en.wikipedia.org/wiki/Complex_event_processing
[6] https://wwwpeople.cs.oreillyumass.comedu/~yanlei/ideaspublications/the-world-beyond-batch-streaming-101
[3sase-sigmod08.pdf
[7] https://wwwci.oreillyapache.comorg/projects/ideasflink/theflink-world-beyond-batch-streaming-102
[4docs-stable/dev/libs/cep.html
[8] https://staticbeam.googleusercontentapache.com/media/research.google.com/en//pubs/archive/43864.pdf
[5] https://en.wikipedia.org/wiki/Complex_event_processing
[6] https://people.cs.umass.edu/~yanlei/publications/sase-sigmod08.pdf
[7] https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/cep.html
[8] https://beam.apache.org/contribute/ptransform-style-guide/org/contribute/ptransform-style-guide/


Difficulty: P3
Project size: ~350 hour (large)
Potential mentors:
Ismaël Mejía, mail: iemejia (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

A Beam runner for Ray

Ray (https://ray.io) is a framework to develop distributed applications. There is a push to develop several libraries to support vario7us forms for AI/ML analytics with Ray. There is an opportunity to develop a Beam runner for Ray.


https://docs.google.com/document/u/1/d/1vt78s48Q0aBhaUCHrVrTUsProJSP8-EBqDDRGTPEr0Y/edit

Difficulty: P2
Project size: ~350 hour (large)
Potential mentors:
Pablo Estrada, mail: pabloem (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

Run code in examples in Beam's Pydoc

We have the Beam Pydoc set up, and some functions have examples written into their documentaztion, however we do not run the examples that we express in Pydoc.

This work item consists in improving the Pydoc for Apache Beam to run examples, adding some examples, and reformatting any existing examples / existing Pydoc that needs to be better expressed for Beam.

Difficulty: P2
Project size: ~175 hour (medium)
Potential mentors:
Pablo Estrada, mail: pabloem
Difficulty: P3
Project size: ~350 hour (large)
Potential mentors:
Ismaël Mejía, mail: iemejia (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

CLONE - A generic Beam

runner for Ray

IO Sink for Java

It would be desirable to develop a Beam Sink that supports all of the 'best practices' like throttling, auto-sharding, exactly-once capabilities, etc.

A design proposal is here:

Ray (https://ray.io) is a framework to develop distributed applications. There is a push to develop several libraries to support vario7us forms for AI/ML analytics with Ray. There is an opportunity to develop a Beam runner for Ray.

https://docs.google.com/document/u/1/d/1vt78s48Q0aBhaUCHrVrTUsProJSP8-EBqDDRGTPEr0Y/edit

Difficulty: P2
Project size: ~350 hour (large)
Potential mentors:
Pablo Estrada, mail: pabloem (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

docs.google.com/document/d/1UIWv6wnD86GYAkeqbVWCG3mx4dTZ9WstUUThPWQmcFM/edit#heading=h.smc16ifdre2

A prototype for the API and parts of implementation is here: https://github.com/apache/beam/pull/16763

Contact Pablo Estrada on dev@beam.apache.orgImage Added if you have questions, or comment here

Run code in examples in Beam's Pydoc

We have the Beam Pydoc set up, and some functions have examples written into their documentaztion, however we do not run the examples that we express in Pydoc.

This work item consists in improving the Pydoc for Apache Beam to run examples, adding some examples, and reformatting any existing examples / existing Pydoc that needs to be better expressed for Beam.

Difficulty: P2
Project size: ~175 ~350 hour (mediumlarge)
Potential mentors:
Pablo Estrada, mail: pabloem (at) apache.org
Project Devs, mail: dev (at) beam.apache.org
CLONE -

A generic Beam IO Sink for Java

It would be desirable to develop a Beam Sink that supports all of the 'best practices' like throttling, auto-sharding, exactly-once capabilities, etc.

A design proposal is here: https://docs.google.com/document/d/1UIWv6wnD86GYAkeqbVWCG3mx4dTZ9WstUUThPWQmcFM/edit#heading=h.smc16ifdre2

A prototype for the API and parts of implementation is here: https://github.com/apache/beam/pull/16763

Contact Pablo Estrada on dev@beam.apache.orgImage Removed if you have questions, or comment here.

, exactly-once capabilities, etc.

A design proposal is here: https://docs.google.com/document/d/1UIWv6wnD86GYAkeqbVWCG3mx4dTZ9WstUUThPWQmcFM/edit#heading=h.smc16ifdre2

A prototype for the API and parts of implementation is here: https://github.com/apache/beam/pull/16763

Contact Pablo Estrada on dev@beam.apache.orgImage Added if you have questions, or comment here.

Difficulty: P2
Project size: ~350 hour (large)
Potential mentors:
Pablo Estrada, mail: pabloem (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

Apache Nemo

Efficient Dynamic Reconfiguration in Stream Processing

In Stream processing, we have many methods, starting from the primitive checkpoint-and-replay to a more fancy version of reconfiguration and reinitiation of stream workloads. We aim to find a way to find the most effective and efficient way of reconfiguring stream workloads. Sub-issues are to be created later on.

Difficulty: MajorDifficulty: P2
Project size: ~350 hour (large)
Potential mentors:
Pablo EstradaWonook, mail: pabloem wonook (at) apache.org
Project Devs, mail: dev (at) beamnemo.apache.org

A generic Beam IO Sink for Java

It would be desirable to develop a Beam Sink that supports all of the 'best practices' like throttling, auto-sharding, exactly-once capabilities, etc.

A design proposal is here: https://docs.google.com/document/d/1UIWv6wnD86GYAkeqbVWCG3mx4dTZ9WstUUThPWQmcFM/edit#heading=h.smc16ifdre2

A prototype for the API and parts of implementation is here: https://github.com/apache/beam/pull/16763

Contact Pablo Estrada on dev@beam.apache.orgImage Removed if you have questions, or comment here.

Application structure-aware caching on Nemo

Nemo has a policy layer that allows powerful optimization with configurable runtime modules. In terms of caching, it is possible to identify frequently used data and decide to cache them in-memory ahead of execution, without user annotation.

Implementation would need:

  • On policy layer, build compile-time pass that identify reused data and mark them as cached
  • On runtime, design and implement caching mechanism that manages per-executor cached data and discard them when these are no longer used.
Difficulty: MajorDifficulty: P2
Project size: ~350 hour (large)
Potential mentors:
Pablo EstradaJeongyoon Eo, mail: pabloem jeongyoon (at) apache.org
Project Devs, mail: dev (at) beamnemo.apache.org

...

Implement spill mechanism on Nemo

Efficient Dynamic Reconfiguration in Stream Processing

Currently, Nemo doesn't have a spill mechanism. This makes executors prone to memory problems such as OOM(Out Of Memory) or GC when task data is large. For example, handling skewed shuffle data in Nemo results in OOM and executor failure, as all data has to be handled in-memory.

We need to spill in-memory data to secondary storage when there are not enough memory in executorIn Stream processing, we have many methods, starting from the primitive checkpoint-and-replay to a more fancy version of reconfiguration and reinitiation of stream workloads. We aim to find a way to find the most effective and efficient way of reconfiguring stream workloads. Sub-issues are to be created later on.

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
WonookJeongyoon Eo, mail: wonook jeongyoon (at) apache.org
Project Devs, mail: dev (at) nemo.apache.org
Application structure-aware caching

Efficient Caching and Spilling on Nemo

In-memory caching and spilling are essential features in in-memory big data processing frameworks, and Nemo needs one.

  • Identify and persist frequently used data and unpersist it when its usage ended
  • Spill in-memory data to disk upon memory pressure

Nemo has a policy layer that allows powerful optimization with configurable runtime modules. In terms of caching, it is possible to identify frequently used data and decide to cache them in-memory ahead of execution, without user annotation.

Implementation would need:

  • On policy layer, build compile-time pass that identify reused data and mark them as cached
  • On runtime, design and implement caching mechanism that manages per-executor cached data and discard them when these are no longer used.
Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Jeongyoon Eo, mail: jeongyoon (at) apache.org
Project Devs, mail: dev (at) nemo.apache.org

Implement spill mechanism on Nemo

Currently, Nemo doesn't have a spill mechanism. This makes executors prone to memory problems such as OOM(Out Of Memory) or GC when task data is large. For example, handling skewed shuffle data in Nemo results in OOM and executor failure, as all data has to be handled in-memory.

We need to spill in-memory data to secondary storage when there are not enough memory in executor.

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Jeongyoon Eo, mail: jeongyoon (at) apache.org
Project Devs, mail: dev (at) nemo.apache.org

Enhance Nemo to support autoscaling for bursty loads

The load of streaming jobs usually fluctuate according to the input rate or operations (e.g., window). Supporting the automatic scaling could reduce the operational cost of running streaming applications, while minimizing the performance degradation that can be caused by the bursty loads. 


We can harness the cloud resources such as VMs and serverless frameworks to acquire computing resources on demand. To realize the automatic scaling, the following features should be implemented.


1) state migration: scaling jobs require moving tasks (or partitioning a task to multiple ones). In this situation, the internal state of the task should be serialized/deserialized. 

2) input/output rerouting: if a task is moved to a new worker, the input and output of the task should be redirected. 

3) dynamic Executor or Task creation/deletion: Executor}}s or {{Task can be dynamically created or deleted. 

4) scaling policy: a scaling policy that decides when and how to scale out/in should be implemented. 

Efficient Caching and Spilling on Nemo

In-memory caching and spilling are essential features in in-memory big data processing frameworks, and Nemo needs one.

  • Identify and persist frequently used data and unpersist it when its usage ended
  • Spill in-memory data to disk upon memory pressure
Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Jeongyoon EoTae-Geon Um, mail: jeongyoon taegeonum (at) apache.org
Project Devs, mail: dev (at) nemo.apache.org

...