Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Contents

...

Apache APISIX Introduce a Storage abstraction

Background:

Some plugins require storing data. For example, limit-count needs to keep track of originators of requests to limit how many requests the same client can send.

The plugin provides several data stores: local, Redis single node, and Redis cluster.


Now, other plugins that need to store data would also need to provide such configuration. Moreover, what if users want to store the data in MongoDB, Hazelcast, or in a plain SQL database?


Tasks:

  • Introduce a Storage abstraction, on the same level as Upstream
  • Create Storage concretions for local, Redis single node, and Redis cluster
  • Migrate the limit-count plugin to use this abstraction
  • If time allows, create a new plugin that uses this abstraction
  • It time allows, create a new Storage implementation


Who is a Potential Mentor: Bozhong Yu, email: imbozhong@gmail.com and  https://github.com/zaunist,


Difficulty: Normal
Project size: ~350 hours.

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Bobur Umurzokov, mail: bumurzokov (at) apache.org
Project Devs, mail: dev (at) apisix.apache.org

Apache APISIX

Refactoring Dashboard plugin orchestration

Profile Toolkit

Background:
At the moment, Apache APISIX does not have a very useful profile tool for profiling CPU or memory, and the developer can only use benchmarking or printing logs to profile the Apache APISIX.
 
Description:
Use eBPF to create a profile tool for Apache APISIX, use eBPF to capture the Lua call stack information in Apache APISIX, and draw it into a CPU flame graph.
 
Task
1. Use eBPF to capture and parse the Lua call stack information in Apache APISIX, summarize it, and generate a CPU flame graph
2. Use eBPF to capture and parse C and Lua mixed call stack information at the same time, summarize it and generate a CPU flame graph
3. Support grabbing Apache APISIX processes running in Docker
4. Support for grabbing Apache APISIX Openresty luajit32/luajit64 mode
 
Recommended Skills:
1. Familiar with Lua/C
2. Have some knowledge about eBPF and Openresty
3. Familiar with profile
 
Mentor
Hui Li(Tencent), PMC of Apache APISIX

Apache APISIX is a dynamic, real-time, high-performance API gateway.

It provides rich traffic management features such as load balancing, dynamic upstream, canary release, circuit breaking, authentication, observability, and more.

Pagehttps://apisix.apache.org/

Github: https://github.com/apache/apisix

Project title:  Refactoring Dashboard plugin orchestration

Background: 

Apache APISIX Dashboard currently supports plugin orchestration, which supports designing the execution flow of plugins through a visual flow editor and finally generating Lua code that can be executed by Apache APISIX.

This feature currently has poor usability, inability to automatically replenish default configuration fields, poor support for multi-stage plugins, poor usability of generated code, etc.

Task:

Refactor the frontend and backend modules to improve the experience of using the visual editor and the quality of code generation. Code generators written in Lua need to be ported to other languages to achieve better code readability and maintainability and reduce black boxes.

Skills:

  • Golang
  • JavaScript / TypeScript
  • Lua

Difficulty: Hard
Project size: ~350 hours.
Potential Mentor: Zeping Bai, bzp2010@apache.orgImage Removed, https://github.com/bzp2010miss-you, [yousa@apache.orgImage Added([yousa@apache.org|mailto:yousa@apache.org])
 
Difficulty:

Major

Hard
Project size: ~350 hour (large)
Potential mentors:
Hui Li, mail: yousa (at) apache.org

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Bobur UmurzokovZeping Bai, mail: bzp2010 bumurzokov (at) apache.org
Project Devs, mail: dev (at) apisix.apache.org

Apache APISIX

Java Plugin Runner Improvement

Background:

At the moment, the Java runner plugin requires you to use an existing template project and change it according to one’s needs.

Task:

Improve developer experience on the existing Java plugin runner so that we can attract and increase the number of users from the Java community.

Limitations:

  • The architecture doesn’t manage multiple plugins. All need to be set in the same project
  • The standard Java unit of deployment is the JAR.
  • The plugin doesn’t allow for other widespread JVM-based languages (e.g., Scala, Kotlin, Clojure, Groovy). Though it would be technically feasible, we would need to change the template’s language

Requirements:

The new plugin runner:

  • MUST use the JAR as the unit of deployment
  • MUST not require the usage of a project template
  • MAY require the plugin to follow a certain class hierarchy (i.e., extends JavaPlugin)
  • MAY use a more specific format to enforce a structure
  • MUST allow multiple plugins to be deployed
  • MUST use isolated classloader for each plugin
  • MUST allow any JVM-compatible bytecode to run, whatever the language it was generated from
  • MAY allow hot reloading of Java plugins
  • MAY require a single JAR per plugin (to ease the classpath management of shared libraries)
  • MUST define a minimum JVM version

Difficulty: Normal
Project size: ~350 hours.

Refactoring Dashboard plugin orchestration

Apache APISIX is a dynamic, real-time, high-performance API gateway.

It provides rich traffic management features such as load balancing, dynamic upstream, canary release, circuit breaking, authentication, observability, and more.

Pagehttps://apisix.apache.org/

Github: https://github.com/apache/apisix


Project title:  Refactoring Dashboard plugin orchestration

Background: 

Apache APISIX Dashboard currently supports plugin orchestration, which supports designing the execution flow of plugins through a visual flow editor and finally generating Lua code that can be executed by Apache APISIX.

This feature currently has poor usability, inability to automatically replenish default configuration fields, poor support for multi-stage plugins, poor usability of generated code, etc.

Task:

Refactor the frontend and backend modules to improve the experience of using the visual editor and the quality of code generation. Code generators written in Lua need to be ported to other languages to achieve better code readability and maintainability and reduce black boxes.

Skills:

  • Golang
  • JavaScript / TypeScript
  • Lua

Difficulty: Hard
Project size: ~350 hours.
Potential Mentor: Zeping Bai, bzp2010@apache.orgImage Added, https://github.com/bzp2010

 

DifficultyDifficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Bobur UmurzokovZeping Bai, mail: bumurzokov bzp2010 (at) apache.org
Project Devs, mail: dev (at) apisix.apache.org

SkyWalking

Apache APISIX Java Plugin Runner Improvement

Background:


At the moment, the Java runner plugin requires you to use an existing template project and change it according to one’s needs.

Task:

Improve developer experience on the existing Java plugin runner so that we can attract and increase the number of users from the Java community.

Limitations:

  • The architecture doesn’t manage multiple plugins. All need to be set in the same project
  • The standard Java unit of deployment is the JAR.
  • The plugin doesn’t allow for other widespread JVM-based languages (e.g., Scala, Kotlin, Clojure, Groovy). Though it would be technically feasible, we would need to change the template’s language

Requirements:

The new plugin runner:

  • MUST use the JAR as the unit of deployment
  • MUST not require the usage of a project template
  • MAY require the plugin to follow a certain class hierarchy (i.e., extends JavaPlugin)
  • MAY use a more specific format to enforce a structure
  • MUST allow multiple plugins to be deployed
  • MUST use isolated classloader for each plugin
  • MUST allow any JVM-compatible bytecode to run, whatever the language it was generated from
  • MAY allow hot reloading of Java plugins
  • MAY require a single JAR per plugin (to ease the classpath management of shared libraries)
  • MUST define a minimum JVM version


Difficulty: Normal
Project size: ~350 hours.

[SkyWalking] Log outlier detection

Currently Apache SkyWalking can collect logs from various sources like user agents and Envoy access logs, it also provides a log analysis language to analyze the logs and produce some metrics, with those metrics, users can configure rules to trigger alerts and react to those abnormal/exceptional logs.

But in reality, production environment exceptional logs are not known in advance and users can't enumerate all possible exceptional logs.

This task aims to add an algorithm that can identify outlier log(s) from the massive logs, and draw the users attention to see whether there is error in the system.

The algorithm should be able to learn from bot the history logs and streaming logs, and adjust itself to increase the accuracy.

Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Zhenxu Ke, mail: kezhenxu94 (at) apache.org
Project Devs, mail: dev (at) skywalking.apache.org

Apache SkyWalking Add the webapp of banyandb

BanyanDB, as an observability database, aims to ingest, analyze and store Metrics, Tracing, and Logging data. It's designed to handle observability data generated by Apache SkyWalking. 

We need a web-based application to 

  • Query the data from the banyandb's data nodes
  • Monitor the performance of the backend
  • Render the topology of server nodes

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Hongtao GaoBobur Umurzokov, mail: hanahmily bumurzokov (at) apache.org
    Project Devs, mail: dev (at) skywalkingapisix.apache.org

    ShardingSphere

    SkyWalking

    [SkyWalking] Log outlier detection

    Currently Apache SkyWalking can collect logs from various sources like user agents and Envoy access logs, it also provides a log analysis language to analyze the logs and produce some metrics, with those metrics, users can configure rules to trigger alerts and react to those abnormal/exceptional logs.


    But in reality, production environment exceptional logs are not known in advance and users can't enumerate all possible exceptional logs.


    This task aims to add an algorithm that can identify outlier log(s) from the massive logs, and draw the users attention to see whether there is error in the system.


    The algorithm should be able to learn from bot the history logs and streaming logs, and adjust itself to increase the accuracy.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Zhenxu Ke, mail: kezhenxu94 (at) apache.org
    Project Devs, mail: dev (at) skywalking.apache.org

    Apache SkyWalking Add the webapp of banyandb

    BanyanDB, as an observability database, aims to ingest, analyze and store Metrics, Tracing, and Logging data. It's designed to handle observability data generated by Apache SkyWalking. 


    We need a web-based application to 

    • Query the data from the banyandb's data nodes
    • Monitor the performance of the backend
    • Render the topology of server nodes


    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Hongtao Gao, mail: hanahmily (at) apache.org
    Project Devs, mail: dev (at) skywalking.apache.org

    ShardingSphere

    Apache ShardingSphere Develop an external tool to convert YAML configuration into DistSQL scripts

    Apache ShardingSphere

     
    Apache ShardingSphere

    Apache ShardingSphere Develop an external tool to convert YAML configuration into DistSQL scripts

    Apache ShardingSphere

     
    Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.
    Pagehttps://shardingsphere.apache.org
    Githubhttps://github.com/apache/shardingsphere 

    Background

    Since version 5.0.0, ShrdingSphere provides its own management language: DistSQL, which greatly facilitates users to manage distributed databases.
    There are now many users who want to convert from legacy YAML configuration to DistSQL, and we want to design a tool to help them. (For ShardingSphere-Proxy only)
     
    More details:
    https://shardingsphere.apache.org/document/current/en/concepts/distsql/

    Task

    Design and implement a command line tool that allows the user to enter a path to a YAML configuration file and output a DistSQL script file.
    This means that when a user uses the generated DistSQL script, it is possible to create a configuration result equivalent to a YAML file.

     
    We have provided a DistSQL for exporting schema configuration, which is related to this issue, to help you understand this issue.

    • The tool should convert both datasources and rule configuration in YAML to corresponding DistSQL RDL
    • The tool needs to run independently, but it can depend on the jar package of ShardingSphere.
    • When the tool starts, it is best to prompt the currently applicable ShardingSphere version.
    • It is best to use the Java language, so that the jar package provided by ShardingSphere can be reused

     
    Notice:

    • There is currently no suitable module in the ShardingSphere repository for standalone tools, so a new module needs to be added.

    Relevant Skills

     
    1. Master JAVA language
    2. Understand the schema configurations of ShardingSphere-Proxy
    3. Understand DistSQL RDL 

    Mentor

    Longtao Jiang, Committer of Apache ShardingSphere, jianglongtao@apache.org
    Chengxiang Lan, Committer of Apache ShardingSphere, lanchengxiang@apache.org

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Longtao Jiang, mail: jianglongtao (at) apache.org
    Project Devs, mail: dev (at) shardingsphere.apache.org

    Apache ShardingSphere Solve unsupported Postgres sql about statements that start with 'c' for ShardingSphere Parser

    Apache ShardingSphere

    Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.
    Page: https://shardingsphere.apache.org
    Github: https://github.com/apache/shardingsphere 

    Background

    Since version 5.0.0, ShrdingSphere provides its own management language: DistSQL, which greatly facilitates users to manage distributed databases.
    There are now many users who want to convert from legacy YAML configuration to DistSQL, and we want to design a tool to help them. (For ShardingSphere-Proxy only)ShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer`, `openGauss` and `Oracle`, which means we have to understand different database dialect SQLs.
     
    More details:
    https://shardingsphere.apache.org/document/current/en/referenceconcepts/sharding/parsedistsql/ 

    Task

    This issue is to solve the unsupported postgres sql about alter in this file . * CALL

    • CHECKPOINT
    • CLOSE
    • CLUSTER
    • COMMENT
    • COPY
    • CREATE ACCESS METHOD
    • CREATE AGGREGATE
    • CREATE CAST
    • CREATE COLLATION
    • CREATE EVENT TRIGGER
    • CREATE FOREIGN DATA WRAPPER
    • CREATE FOREIGN TABLE
    • CREATE GROUP
    • CREATE MATERIALIZED VIEW
    • CREATE OPERATOR
    • CREATE POLICY
    • CREATE PUBLICATION

     
    You can learn more here. *
    You may need to try to get why it's not supported.(antlr4 grammar? or not implement visit method) You can use antlr4 plugins to help you to analyze. You may need to visit an official doc to check the grammar.

     
    Notice, these issues can be a good example.
    support alter foreign table for pg/og
    support alter materialized view for pg/og.

    Design and implement a command line tool that allows the user to enter a path to a YAML configuration file and output a DistSQL script file.
    This means that when a user uses the generated DistSQL script, it is possible to create a configuration result equivalent to a YAML file.

     
    We have provided a DistSQL for exporting schema configuration, which is related to this issue, to help you understand this issue.

    • The tool should convert both datasources and rule configuration in YAML to corresponding DistSQL RDL
    • The tool needs to run independently, but it can depend on the jar package of ShardingSphere.
    • When the tool starts, it is best to prompt the currently applicable ShardingSphere version.
    • It is best to use the Java language, so that the jar package provided by ShardingSphere can be reused

     
    Notice:

    • There is currently no suitable module in the ShardingSphere repository for standalone tools, so a new module needs to be added.

    Relevant Skills

     
    1. Master JAVA language
    2. Have a basic understanding of Antlr g4 file
    3. Be familiar with Postgres SQLs 

    Targets files

     
    1. Postgres SQLs g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-postgresql/src/main/antlr4/org/apache/shardingsphere/sql/parser/autogen/PostgreSQLStatement.g4

    Mentor

    Understand the schema configurations of ShardingSphere-Proxy
    3. Understand DistSQL RDL 

    Mentor

    Longtao JiangZhengqiang Duan, Committer of Apache ShardingSphere, duanzhengqiang@apachejianglongtao@apache.org
    Haoran MengChengxiang Lan, PMC Committer of Apache ShardingSphere, menghaoran@apachelanchengxiang@apache.org


    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Zhengqiang DuanLongtao Jiang, mail: duanzhengqiang jianglongtao (at) apache.org
    Project Devs, mail: dev (at) shardingsphere.apache.org

    Apache ShardingSphere Solve unsupported Postgres sql about

    alter statement

    statements that start with 'c' for ShardingSphere Parser

    Apache ShardingSphere

    Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

    Page: https://shardingsphere.apache.org
    Github: https://github.com/apache/shardingsphere 

     Background

    Background

    ShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer`, `openGauss` and `Oracle`, which means we have to understand different database dialect SQLs.
     
    More details:
    https://shardingsphere.apache.org/document/current/en/reference/sharding/parse/ 

    Task

    This issue is to solve the unsupported postgres sql about alter in this file . * ALTER OPERATORCALL

    • ALTER POLICYCHECKPOINTALTER PUBLICATION
    • CLOSE
    • ALTER ROUTINECLUSTERALTER RULE
    • COMMENT
    • ALTER SCHEMA
    • ALTER SEQUENCE
    • ALTER SERVER
    • ALTER STATISTICS
    • ALTER SUBSCRIPTION
    • ALTER TABLE
    • ALTER TEXT SEARCH
    • ALTER TRIGGER
    • ALTER TYPE
    ALTER VIEW
    • COPY
    • CREATE ACCESS METHOD
    • CREATE AGGREGATE
    • CREATE CAST
    • CREATE COLLATION
    • CREATE EVENT TRIGGER
    • CREATE FOREIGN DATA WRAPPER
    • CREATE FOREIGN TABLE
    • CREATE GROUP
    • CREATE MATERIALIZED VIEW
    • CREATE OPERATOR
    • CREATE POLICY
    • CREATE PUBLICATION

     
    You can learn more here. *
    You may need to try to get why it's not supported.(antlr4 grammar? or not implement visit method) You can use antlr4 plugins to help you to analyze. You may need to visit an official doc to check the grammar.

     
    Notice, these issues can be a good example.
    support alter foreign table for pg/og
    support alter materialized view for pg/og.

    Relevant Skills

     
    1. Master JAVA language
    2. Have a basic understanding of Antlr g4 file
    3. Be familiar with Postgres SQLsSQLs 

    Targets files

     
    1. Postgres SQLs g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-postgresql/src/main/antlr4/org/apache/shardingsphere/sql/parser/autogen/PostgreSQLStatement.g4

    Mentor

    Trista PanZhengqiang Duan, PMC Committer of Apache ShardingSphere, https://tristazero.github.ioZhengqiang Duan, Committer of ApacheShardingSphere, https://github.com/strongduanmuduanzhengqiang@apache.org
    Haoran Meng, PMC of Apache ShardingSphere, menghaoran@apache.org

    Difficulty: Major
    Project size: ~175 ~350 hour (mediumlarge)
    Potential mentors:
    Juan PanZhengqiang Duan, mail: panjuan duanzhengqiang (at) apache.org
    Project Devs, mail: dev (at) shardingsphere.apache.org

    ShenYu

    Apache ShardingSphere Solve unsupported Postgres sql about alter statement for ShardingSphere Parser

    Apache ShardingSphere
    Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.
    Page

    Apache ShenYu add logging-elasticsearch plugin

    Apache ShenYu (incubating)

    A High-performance,multi-protocol,extensible,responsive API Gateway. Compatible with a variety of mainstream framework systems, support hot plug, users can customize the development, meet the current situation and future needs of users in a variety of scenarios, experienced the temper of large-scale scenes

    Website

    : https://

    shenyu

    shardingsphere.apache.org

    GitHub

    Github: https://github.com/apache/

    incubator-shenyu
  • Linked GitHub Issue: https://github.com/apache/incubator-shenyu/issues/2896
  • Description

    1. add logging-elasticsearch plugin, it Use elasticsearch to store shenyu's logs.
    2. Take the shenyu gateway log information, write it to elasticSearch and display it.
    3. Can add module like this :

                   shenyu-plugin
                   ------ shenyu-plugin-logging-elasticsearch

    Task

    • Add shenyu-plugin-logging-elasticsearch module and impl write it to elasticSearch
    • Complete unit test for this module
    • Complete the integration for this module
    • Complete doc for this module in shenyu website

    Recommended Skills

    •  Familiar with Java and reactor Java
    •  Know the usage of shenyu plugin ecology 
    •  Know the usage of elasticSearch java client
    •  Have some knowledge about  Docker

    Mentor

    XiaoYu, PPMC of Apache ShenYu, https://github.com/yu199195, [xiaoyu@apache.org](xiaoyu@apache.orgImage Removed)

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Xiao Yu, mail: xiaoyu (at) apache.org
    Project Devs, mail: dev (at) shenyu.apache.org

    Apache ShenYu Improve integration test and deployment methods

    shardingsphere

     Background

    ShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer`, `openGauss` and `Oracle`, which means we have to understand different database dialect SQLs.
    More details:
    https://shardingsphere.apache.org/document/current/en/reference/sharding/parse/

    Task

    This issue is to solve the unsupported postgres sql about alter in this file . * ALTER OPERATOR

    • ALTER POLICY
    • ALTER PUBLICATION
    • ALTER ROUTINE
    • ALTER RULE
    • ALTER SCHEMA
    • ALTER SEQUENCE
    • ALTER SERVER
    • ALTER STATISTICS
    • ALTER SUBSCRIPTION
    • ALTER TABLE
    • ALTER TEXT SEARCH
    • ALTER TRIGGER
    • ALTER TYPE
    • ALTER VIEW

    You can learn more here. *
    You may need to try to get why it's not supported.(antlr4 grammar? or not implement visit method) You can use antlr4 plugins to help you to analyze. You may need to visit an official doc to check the grammar.

    Notice, these issues can be a good example.
    support alter foreign table for pg/og
    support alter materialized view for pg/og.

    Relevant Skills

    1. Master JAVA language
    2. Have a basic understanding of Antlr g4 file
    3. Be familiar with Postgres SQLs

    Targets files

    1. Postgres SQLs g4 file

    Apache ShenYu (incubating)

    A High-performance,multi-protocol,extensible,responsive API Gateway. Compatible with a variety of mainstream framework systems, support hot plug, users can customize the development, meet the current situation and future needs of users in a variety of scenarios, experienced the temper of large-scale scenes

    Website: https://shenyu.apache.org

    GitHub: https://github.com/apache/incubator-shenyu

    Linked GitHub Issue: https://github.com/apache/incubator-shenyu/issues/2890

    Background

    1. ShenYu is still vacant with helm deployment, so we need to write charts for it, and then complete the integration test.
    2. Shenyu already has a relatively complete integration testing framework, but some plug-ins have not been tested, and some tests are not perfect.

    Task

    • Write helm chart for Apache ShenYu
    • Complete the integration test of deploying Apache ShenYu with helm in Kubernetes
    • Documentation for helm deployment
    • Complete the integration test of the Oauth2 plugin
    • Improve the integration test of other existing plugin

    Recommended Skills

    Familiar with Java

    Know the usage of spring-framework

    Have some knowledge about Kubernetes and Docker

    Mentor

    /shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-postgresql/src/main/antlr4/org/apache/shardingsphere/sql/parser/autogen/PostgreSQLStatement.g4

    Mentor

    Trista Pan, PMC of Apache ShardingSphere, https://tristazero.github.io

    Zhengqiang Duan, Committer of ApacheShardingSphereKunshuai Zhu, Committer of Apache ShenYu, https://github.com/JooKS-me, jooks@apache.orgImage Removedstrongduanmu

    Difficulty: Major
    Project size: ~350 ~175 hour (largemedium)
    Potential mentors:
    Kunshuai ZhuJuan Pan, mail: jooks panjuan (at) apache.org
    Project Devs, mail: dev (at) shenyushardingsphere.apache.org

    Apache

    ShenYu add logging-kafka plugin

    ShardingSphere Solve unsupported Postgres sql about statements that start with 'd', 'e', 'f', 'i' for ShardingSphere Parser

    Apache ShardingSphere

    Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.

    Page

    Apache ShenYu (incubating)

    A High-performance,multi-protocol,extensible,responsive API Gateway. Compatible with a variety of mainstream framework systems, support hot plug, users can customize the development, meet the current situation and future needs of users in a variety of scenarios, experienced the temper of large-scale scenes

    Website

    https://

    shenyu

    shardingsphere.apache.org

    GitHub

    Githubhttps://github.com/apache/

    incubator-shenyuLinked GitHub Issue: 

    shardingsphere 

    Background

    ShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer`, `openGauss` and `Oracle`, which means we have to understand different database dialect SQLs.
     
    More details:
    https://

    github.com/apache/incubator-shenyu/issues/2917

    Description

    1. Add logging-kafka plugin, it Use Kafka to store shenyu's logs.
    2. Take the shenyu gateway log information, write it to Kafka and display it.
    3. Can add module like this :
      shenyu-plugin
      shenyu-plugin-logging-kafka

    Task

    • Add shenyu-plugin-logging-kafka module and impl write it to Kafka
    • Complete unit test for this module
    • Complete the integration for this module
    • Complete doc for this module in shenyu website

    Recommended Skills

    •  Familiar with Java
    • Know the usage of shenyu plugin ecology
    •  Know the usage of Kafka java client
    •  Have some knowledge about  Docker

    Mentor

    Zhang Yonglun, PPMC of Apache ShenYu, https://github.com/tuohai666, [zhangyonglun@apache.org](zhangyonglun@apache.org)

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Yonglun Zhang, mail: zhangyonglun (at) apache.org
    Project Devs, mail: dev (at) shenyu.apache.org

    TrafficControl

    shardingsphere.apache.org/document/current/en/reference/sharding/parse/ 

    Task

    This issue is to solve the unsupported postgres sql about alter in this file . * CALL

    • DO
    • DROP FUNCTION
    • DROP INDEX
    • DROP INSTANCE RULE
    • DROP REWRITE RULE
    • EXECUTE
    • EXPLAIN
    • FETCH
    • FETCH ABSOLUTE
    • FETCH ALL
    • FETCH BACKWARD
    • FETCH FIRST
    • FETCH LAST
    • FETCH NEXT
    • FETCH PRIOR
    • FETCH RELATIVE
    • IMPORT FOREIGN SCHEMA

     
    You can learn more here. *
    You may need to try to get why it's not supported.(antlr4 grammar? or not implement visit method) You can use antlr4 plugins to help you to analyze. You may need to visit an official doc to check the grammar.

     
    Notice, these issues can be a good example.
    support alter foreign table for pg/og
    support alter materialized view for pg/og.

    Relevant Skills

     
    1. Master JAVA language
    2. Have a basic understanding of Antlr g4 file
    3. Be familiar with Postgres SQLs 

    Targets files

     
    1. Postgres SQLs g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-postgresql/src/main/antlr4/org/apache/shardingsphere/sql/parser/autogen/PostgreSQLStatement.g4

    Mentor

    Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org

    Zhengqiang Duan, Committer of Apache ShardingSphere, duanzhengqiang@apache.org

    Difficulty: Major

    GSOC Varnish Cache support in Apache Traffic Control

    Background
    Apache Traffic Control is a Content Delivery Network (CDN) control plane for large scale content distribution.

    Traffic Control currently requires Apache Traffic Server as the underlying cache. Help us expand the scope by integrating with the very popular Varnish Cache.

    There are multiple aspects to this project:

    • Configuration Generation: Write software to build Varnish configuration files (VCL). This code will be implemented in our Traffic Ops and cache client side utilities, both written in Go.
    • Health Monitoring: Implement monitoring of the Varnish cache health and performance. This code will run both in the Traffic Monitor component and within Varnish. Traffic Monitor is written in Go and Varnish is written in C.
    • Testing: Adding automated tests for new code

    Skills:

    • Proficiency in Go is required
    • A basic knowledge of HTTP and caching is preferred, but not required for this project.
    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Eric FriedrichChuxin Chen, mail: friede tuichenchuxin (at) apache.org
    Project Devs, mail: dev (at) trafficcontrolshardingsphere.apache.org

    RocketMQ

    ShenYu

    Apache ShenYu add logging-elasticsearch plugin

    Apache ShenYu (incubating)

    A High-performance,multi-protocol,extensible,responsive API Gateway. Compatible with a variety of mainstream framework systems, support hot plug, users can customize the development, meet the current situation and future needs of users in a variety of scenarios, experienced the temper of large-scale scenes

    Description

    1. add logging-elasticsearch plugin, it Use elasticsearch to store shenyu's logs.
    2. Take the shenyu gateway log information, write it to elasticSearch and display it.
    3. Can add module like this :

                   shenyu-plugin
                   ------ shenyu-plugin-logging-elasticsearch

    Task

    • Add shenyu-plugin-logging-elasticsearch module and impl write it to elasticSearch
    • Complete unit test for this module
    • Complete the integration for this module
    • Complete doc for this module in shenyu website

    Recommended Skills

    •  Familiar with Java and reactor Java
    •  Know the usage of shenyu plugin ecology 
    •  Know the usage of elasticSearch java client
    •  Have some knowledge about  Docker

    Mentor

    XiaoYu, PPMC of Apache ShenYu, https://github.com/yu199195, [xiaoyu@apache.org](xiaoyu@apache.orgImage Added)

    GSOC Support connect to Doris in Apache RocketMQ Streams

    Apache RocketMQ™ is a unified messaging engine, lightweight data processing platform,

    Apache RocketMQ Streams is a Lightweight Streaming Project for RocketMQ , which can be deployed separately or in cluster mode.
    Various types of data input and output: source supports RocketMQ while sink supports databases and RocketMQ, etc.

    Apache Doris is an MPP-based interactive SQL data warehousing for reporting and analysis. Its original name was Palo, developed in Baidu. After donated to Apache Software Foundation, it was renamed Doris.

    • Doris provides high concurrent low latency point query performance, as well as high throughput queries of ad-hoc analysis.
    • Doris provides batch data loading and real-time mini-batch data loading.
    • Doris provides high availability, reliability, fault tolerance, and scalability.

    The main advantages of Doris are the simplicity (of developing, deploying and using) and meeting many data serving requirements in a single system. For details, refer to Overview.

    The Apache Doris Sink in RocketMQ allows moving data from RocketMQ to Doris. It writes data from topics in RocketMQ to tables in Doris.

    So, in this project, you need to implement a sink based on RocketMQ Streams API, and will executed on RocketMQ Streams runtime.

    You should learn before applying for this topic

    Mentor

    tigerlee@apache.org Image Removed

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Li WeiXiao Yu, mail: tigerlee xiaoyu (at) apache.org
    Project Devs, mail: dev (at) rocketmqshenyu.apache.org

    GSOC Support connect to Clickhouse in Apache RocketMQ Connect

    Apache RocketMQ™ is a unified messaging engine, lightweight data processing platform,

    Apache RocketMQ Streams is a Lightweight Streaming Project for RocketMQ , which can be deployed separately or in cluster mode.
    Various types of data input and output: source supports RocketMQ while sink supports databases and RocketMQ, etc.

    ClickHouse® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP). built by the creators of the fastest OLAP database on Earth

    • True Column-Oriented Database Management System
    • Data Compression¶
    • Disk Storage of Data
    • Parallel Processing on Multiple Cores
    • Distributed Processing on Multiple Servers
    • SQL Support
    • Vector Computation Engine
    • Real-time Data Updates
    • Primary Index
    • Secondary Indexes
    • Suitable for Online Queries
    • Support for Approximated Calculations
    • Adaptive Join Algorithm
    • Data Replication and Data Integrity Support
    • Role-Based Access Control
    • Features that Can Be Considered Disadvantages

    The Clickhouse Sink in RocketMQ allows moving data from RocketMQ to Clickhouse. It writes data from topics in RocketMQ to tables in Clickhouse.

    So, in this project, you need to implement a sink based on RocketMQ Streams API, and will executed on RocketMQ Streams runtime.

    You should learn before applying for this topic

    Mentor

    Apache ShenYu add logging-kafka plugin

    Apache ShenYu (incubating)

    A High-performance,multi-protocol,extensible,responsive API Gateway. Compatible with a variety of mainstream framework systems, support hot plug, users can customize the development, meet the current situation and future needs of users in a variety of scenarios, experienced the temper of large-scale scenes

    Description

    1. Add logging-kafka plugin, it Use Kafka to store shenyu's logs.
    2. Take the shenyu gateway log information, write it to Kafka and display it.
    3. Can add module like this :
      shenyu-plugin
      shenyu-plugin-logging-kafka

    Task

    • Add shenyu-plugin-logging-kafka module and impl write it to Kafka
    • Complete unit test for this module
    • Complete the integration for this module
    • Complete doc for this module in shenyu website

    Recommended Skills

    •  Familiar with Java
    • Know the usage of shenyu plugin ecology
    •  Know the usage of Kafka java client
    •  Have some knowledge about  Docker

    Mentor

    Zhang Yonglun, PPMC of Apache ShenYu, https://github.com/tuohai666, [zhangyonglun@apache.org](zhangyonglun@apache.org)

    Difficulty: Major
    Project size:
    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Li WeiYonglun Zhang, mail: tigerlee zhangyonglun (at) apache.org
    Project Devs, mail: dev (at) rocketmqshenyu.apache.org

    Community Development

    Apache ShenYu Integration tests cover more scenarios

    Apache ShenYu

    Apache EventMesh Support Knative as Eventing Infra

    Apache EventMesh

    (incubating)

    EventMesh is a dynamic event-driven application runtime used to decouple the application and backend middleware layer, which supports a wide range of use cases that encompass complex multi-cloud, widely distributed topologies using diverse technology stacks.

    Website: https://eventmesh.apache.org

    A High-performance,multi-protocol,extensible,responsive API Gateway. Compatible with a variety of mainstream framework systems, support hot plug, users can customize the development, meet the current situation and future needs of users in a variety of scenarios, experienced the temper of large-scale scenes

    Website: https://shenyu.apache.org

    GitHub: GitHub: https://github.com/apache/incubator-eventmeshshenyu

    Linked GitHub Issue: https://github.com/apache/incubator-eventmeshshenyu/issues/7902890

    Background

    1. EventMesh have supported the CloudEvent protocol and need to use this integrating with Knative 

    Task

    • Get to know the CloudEvents spec
    • Run the Knative and familiar with Knative communication protocol 
    • Implement the Knative-Connector module and delivering the events to Knative via EventMesh

    Recommended Skills

    Familiar with Java

    Know the principal of CloudEvents and Knative

    Have some knowledge about Kubernetes and Docker

    Mentor

    Shenyu already has a relatively complete integration testing framework, but some plug-ins have not been tested, such as oathu2 plugin, cache plugin, metrics plugin, log-rockermq plugin, and etc.

    Task

    • Complete the integration test of the Oauth2 plugin
    • Complete the integration test of the cache plugin
    • Complete the integration test of the metrics plugin
    • Complete the integration test of the log-rocketmq plugin
    • And more, if you want.

    Recommended Skills

    Familiar with Java

    Know the usage of spring-framework

    Have some knowledge about Docker

    Mentor

    Kunshuai Zhu, PPMC of Apache ShenYuEasonc Chen, PPMC of Apache EventMesh, https://github.com/qqeasonchenJooKS-me, chenguangsheng@apache jooks@apache.orgImage Added

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Xue WeimingKunshuai Zhu, mail: mikexue jooks (at) apache.org
    Project Devs, mail: dev (at) shenyu.apache.org

    TrafficControl

    GSOC Varnish Cache support in Apache Traffic Control

    Background
    Apache Traffic Control is a Content Delivery Network (CDN) control plane for large scale content distribution.

    Traffic Control currently requires Apache Traffic Server as the underlying cache. Help us expand the scope by integrating with the very popular Varnish Cache.

    There are multiple aspects to this project:

    • Configuration Generation: Write software to build Varnish configuration files (VCL). This code will be implemented in our Traffic Ops and cache client side utilities, both written in Go.
    • Health Monitoring: Implement monitoring of the Varnish cache health and performance. This code will run both in the Traffic Monitor component and within Varnish. Traffic Monitor is written in Go and Varnish is written in C.
    • Testing: Adding automated tests for new code

    Skills:

    • Proficiency in Go is required
    • A basic knowledge of HTTP and caching is preferred, but not required for this project.

    Apache IoTDB integration with gRPC

    Background:

    Apache IoTDB uses Thrift as its RPC layer. However, there are some voices in the community: do we need to support gPRC?

    We noticed:

    • thrift has to apply memory for each RPC call (get data from the network into a byte array, and then convert the bytes to objects), and it is hard to control the whole memory cost for large RPC.
    • thrift connection may be broken when there are too many concurrent connections.
    • thrift does not support stream mode

    So, we'd like to know whether gRPC is better.

    Tasks:

    • implement IoTDB's RPC layer using gRPC.
      • including the sync/async mode 
      • sub-tasks: the C++, c#, python API wrappers are also desired. 
    • have a performance test
      • throughput, memory cost and jitter, etc..
    • write a report to compare them

    References:

    iotdb's current thrift RPC specification:

  •  https://github.com/apache/iotdb/tree/master/thrift
  • there are some on-going thrift apis: thrift-datanode, thrift-confignode, thrift-cluster, thrift-sync
    Difficulty: Major
    Project size: ~175 ~350 hour (mediumlarge)
    Potential mentors:
    Xiangdong HuangEric Friedrich, mail: hxd friede (at) apache.org
    Project Devs, mail:

    Apache EventMesh EventMesh supports dashboard

    Apache EventMesh (incubating)

    EventMesh is a dynamic event-driven application runtime used to decouple the application and backend middleware layer, which supports a wide range of use cases that encompass complex multi-cloud, widely distributed topologies using diverse technology stacks.

    Website: https://eventmesh.apache.org

    GitHub: https://github.com/apache/incubator-eventmesh

    Linked GitHub Issue: https://github.com/apache/incubator-eventmesh/issues/700

    Background

    1. Currently, there is no console page for EventMesh. We hope the community can contribute a visual control page based on EventMesh.

    Task

    • Get familiar with the EventMesh
    • Support and implement more interface under the admin controller module 

    Recommended Skills

    Familiar with Java\HTML\CSS maybe need vue.js framework or others

    Know the restful API specifications 

    Have the knowledge about the basics of HTTP communication 

    Mentor

    Mike Xue, PPMC of Apache EventMesh, https://github.com/xwm1992mikexue@apache.org

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Xue Weiming, mail: mikexue (at) apache.org
    Project Devs, mail:

    DolphinScheduler

    GSOC Support etcd as registry

    Apache DolphinScheduler

    Apache DolphinScheduler is a distributed and extensible workflow scheduler platform with powerful DAG visual interfaces, dedicated to solving complex job dependencies in the data pipeline and providing various types of jobs available out of box.

    Website: https://dolphinscheduler.apache.org/en-us/index.html

    GitHub: https://github.com/apache/dolphinscheduler

    Linked GitHub Issue: https://github.com/apache/dolphinscheduler/issues/8975

    Background

    Right now, we use zookeeper as registry, and we also use zookeeper to store some metadata of master and worker.

    We have already implemented the registry plug-in architecture, it's needed to support Etcd as a new registry plugin choose. This can help user who only familiar with Etcd to use DolphinScheduler.

    Task

    This task is aim to support etcd as registry.

    Recommended Skills

    • Familiar with Java{}
    • Know how to use Etcd

    Mentors

    Wenjun Ruan, wenjun@apache.orgImage Removed

    ShunFeng Cai, caishunfeng@apache.org

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Wenjun Ruan, mail: wenjun (at) apache.org
    Project Devs, mail: dev (at) dolphinscheduler.apache.org

    GSoC Python API CLI enhancement

    About pydolphinscheduler

    PyDolphinScheduler is Python API for Apache DolphinScheduler, which allows you to define your workflow by Python code, aka workflow-as-codes. You could see more detail about PyDolphinScheduler in its document[4]. And all the source code hold as the submodule in DolphinScheduler main codebase[5].

    The Goal

    Make pydolphinscheduler's CLI more powerful, make it can operate the model of DolphinScheduler, run pydolphinscheduler's code, visualize its DAG graph in the terminal.

    Detail

    Up to now, Apache DolphinScheduler Python API has CLI only with limited command supported and our community wishes it to become a more powerful tool and support as much command as possible(unless command has security issue).

    It only supports `version` and `config` for now, which you could see more detail in [1]

    Basically, we think the following command is helpful for CLI and you could add another command if it should be added(but may sure after discussing in the community):

    • `run <DAG name> [--example]`: Run local workflow DAG file or examples build-in
    • `users`: User's operation, CURD
    • `projects`: Project's operation, CURD, grant to other users
    • `tenants`: Tenant's operation, CURD
    • `workflow`: Workflow's operation, CURD, name change, should also change  the local Python file name
    • `visualize`: Show task graph in the terminal.
    • etc...

    Besides the functional addition, we should also consider the output part of CLI which makes our output more clear and cool. We may consider using (we should also find other interesting packages to do it):

    • rich: For highlight, our output, or using some existing rich plugin like `click-rich`
    • tabulate: For the tables visualization in terminal

    What Can You Learn

    We wish everyone joining GSoC could learn some things from the project. When you finish this project, you could learn:

    • How to write production-level Python codes and docs, you could improve your Python syntax, how to write tests with `pytest` and `tox`, how to write a document with `sphnix` and it related plugin, how to format your Python code and the linter inside
    • Adding knowledge about task scheduling system, what is it and what it focuses, how it could be run

    If You Interested in It

    If you want to take this ticket, you should

    • (Must) Python skill, especially packages click, pytest and etc.
    • Have a little knowledge of task scheduling systems.
    • (Optional) Basic Java knowledge is better because Apache DolphinScheduler core is written with Java and you may add some functional code to it.

    Mentors

    • Calvin Kirs: Committer of Apache {DolphinScheduler, SeaTunnel, Wayang}, DolphinScheduler PMC and SeaTunnel PPMC
    • Jiajie Zhong: Committer of Apache {Airflow, DolphinScheduler, SeaTunnel}, SeaTunnel PPMC

    [1]: https://dolphinscheduler.apache.org/python/cli.html

    [2]: https://github.com/Textualize/rich

    [3]: https://github.com/astanin/python-tabulate

    [4]: https://dolphinscheduler.apache.org/python/index.html

    [5]: https://github.com/apache/dolphinscheduler/tree/dev/dolphinscheduler-python/pydolphinscheduler

    dev (at) trafficcontrol.apache.org

    RocketMQ

    GSOC Support connect to Doris in Apache RocketMQ Streams

    Apache RocketMQ™ is a unified messaging engine, lightweight data processing platform,

    Apache RocketMQ Streams is a Lightweight Streaming Project for RocketMQ , which can be deployed separately or in cluster mode.
    Various types of data input and output: source supports RocketMQ while sink supports databases and RocketMQ, etc.

    Apache Doris is an MPP-based interactive SQL data warehousing for reporting and analysis. Its original name was Palo, developed in Baidu. After donated to Apache Software Foundation, it was renamed Doris.

    • Doris provides high concurrent low latency point query performance, as well as high throughput queries of ad-hoc analysis.
    • Doris provides batch data loading and real-time mini-batch data loading.
    • Doris provides high availability, reliability, fault tolerance, and scalability.

    The main advantages of Doris are the simplicity (of developing, deploying and using) and meeting many data serving requirements in a single system. For details, refer to Overview.

    The Apache Doris Sink in RocketMQ allows moving data from RocketMQ to Doris. It writes data from topics in RocketMQ to tables in Doris.

    So, in this project, you need to implement a sink based on RocketMQ Streams API, and will executed on RocketMQ Streams runtime.

    You should learn before applying for this topic

    Mentor

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Li Wei, mail: tigerlee (at) apache.org
    Project Devs, mail: dev (at) rocketmq.apache.org

    GSOC Support connect to Clickhouse in Apache RocketMQ Connect

    Apache RocketMQ™ is a unified messaging engine, lightweight data processing platform,

    Apache RocketMQ Streams is a Lightweight Streaming Project for RocketMQ , which can be deployed separately or in cluster mode.
    Various types of data input and output: source supports RocketMQ while sink supports databases and RocketMQ, etc.

    ClickHouse® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP). built by the creators of the fastest OLAP database on Earth

    • True Column-Oriented Database Management System
    • Data Compression¶
    • Disk Storage of Data
    • Parallel Processing on Multiple Cores
    • Distributed Processing on Multiple Servers
    • SQL Support
    • Vector Computation Engine
    • Real-time Data Updates
    • Primary Index
    • Secondary Indexes
    • Suitable for Online Queries
    • Support for Approximated Calculations
    • Adaptive Join Algorithm
    • Data Replication and Data Integrity Support
    • Role-Based Access Control
    • Features that Can Be Considered Disadvantages

    The Clickhouse Sink in RocketMQ allows moving data from RocketMQ to Clickhouse. It writes data from topics in RocketMQ to tables in Clickhouse.

    So, in this project, you need to implement a sink based on RocketMQ Streams API, and will executed on RocketMQ Streams runtime.

    You should learn before applying for this topic

    Mentor

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Li Wei, mail: tigerlee (at) apache.org
    Project Devs, mail: dev (at) rocketmq.apache.org

    DolphinScheduler

    GSOC Support etcd as registry

    Apache DolphinScheduler

    Apache DolphinScheduler is a distributed and extensible workflow scheduler platform with powerful DAG visual interfaces, dedicated to solving complex job dependencies in the data pipeline and providing various types of jobs available out of box.

    Website: https://dolphinscheduler.apache.org/en-us/index.html

    GitHub: https://github.com/apache/dolphinscheduler

    Linked GitHub Issue: https://github.com/apache/dolphinscheduler/issues/8975

    Background

    Right now, we use zookeeper as registry, and we also use zookeeper to store some metadata of master and worker.

    We have already implemented the registry plug-in architecture, it's needed to support Etcd as a new registry plugin choose. This can help user who only familiar with Etcd to use DolphinScheduler.

    Task

    This task is aim to support etcd as registry.

    Recommended Skills

    • Familiar with Java{}
    • Know how to use Etcd

    Mentors

    Wenjun Ruan, wenjun@apache.orgImage Added

    ShunFeng Cai, caishunfeng@apache.org


    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Wenjun Ruan, mail: wenjun (at) apache.org
    Project Devs, mail: dev (at) dolphinscheduler.apache.org

    GSoC Python API CLI enhancement

    About pydolphinscheduler

    PyDolphinScheduler is Python API for Apache DolphinScheduler, which allows you to define your workflow by Python code, aka workflow-as-codes. You could see more detail about PyDolphinScheduler in its document[4]. And all the source code hold as the submodule in DolphinScheduler main codebase[5].

    The Goal

    Make pydolphinscheduler's CLI more powerful, make it can operate the model of DolphinScheduler, run pydolphinscheduler's code, visualize its DAG graph in the terminal.

    Detail

    Up to now, Apache DolphinScheduler Python API has CLI only with limited command supported and our community wishes it to become a more powerful tool and support as much command as possible(unless command has security issue).

    It only supports `version` and `config` for now, which you could see more detail in [1]

    Basically, we think the following command is helpful for CLI and you could add another command if it should be added(but may sure after discussing in the community):

    • `run <DAG name> [--example]`: Run local workflow DAG file or examples build-in
    • `users`: User's operation, CURD
    • `projects`: Project's operation, CURD, grant to other users
    • `tenants`: Tenant's operation, CURD
    • `workflow`: Workflow's operation, CURD, name change, should also change  the local Python file name
    • `visualize`: Show task graph in the terminal.
    • etc...

    Besides the functional addition, we should also consider the output part of CLI which makes our output more clear and cool. We may consider using (we should also find other interesting packages to do it):

    • rich: For highlight, our output, or using some existing rich plugin like `click-rich`
    • tabulate: For the tables visualization in terminal

    What Can You Learn

    We wish everyone joining GSoC could learn some things from the project. When you finish this project, you could learn:

    • How to write production-level Python codes and docs, you could improve your Python syntax, how to write tests with `pytest` and `tox`, how to write a document with `sphnix` and it related plugin, how to format your Python code and the linter inside
    • Adding knowledge about task scheduling system, what is it and what it focuses, how it could be run

    If You Interested in It

    If you want to take this ticket, you should

    • (Must) Python skill, especially packages click, pytest and etc.
    • Have a little knowledge of task scheduling systems.
    • (Optional) Basic Java knowledge is better because Apache DolphinScheduler core is written with Java and you may add some functional code to it.

    Mentors

    • Calvin Kirs: Committer of Apache {DolphinScheduler, SeaTunnel, Wayang}, DolphinScheduler PMC and SeaTunnel PPMC
    • Jiajie Zhong: Committer of Apache {Airflow, DolphinScheduler, SeaTunnel}, SeaTunnel PPMC


    [1]: https://dolphinscheduler.apache.org/python/cli.html

    [2]: https://github.com/Textualize/rich

    [3]: https://github.com/astanin/python-tabulate

    [4]: https://dolphinscheduler.apache.org/python/index.html

    [5]: https://github.com/apache/dolphinscheduler/tree/dev/dolphinscheduler-python/pydolphinscheduler

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Jiajie Zhong, mail: zhongjiajie (at) apache.org
    Project Devs, mail: dev (at) dolphinscheduler.apache.org

    Community Development

    Apache IoTDB integration with gRPC

    Background:

    Apache IoTDB uses Thrift as its RPC layer. However, there are some voices in the community: do we need to support gPRC?

    We noticed:

    • thrift has to apply memory for each RPC call (get data from the network into a byte array, and then convert the bytes to objects), and it is hard to control the whole memory cost for large RPC.
    • thrift connection may be broken when there are too many concurrent connections.
    • thrift does not support stream mode


    So, we'd like to know whether gRPC is better.


    Tasks:

    • implement IoTDB's RPC layer using gRPC.
      • including the sync/async mode 
      • sub-tasks: the C++, c#, python API wrappers are also desired. 
    • have a performance test
      • throughput, memory cost and jitter, etc..
    • write a report to compare them


    References:

    iotdb's current thrift RPC specification:

    1.  https://github.com/apache/iotdb/tree/master/thrift
    2. there are some on-going thrift apis: thrift-datanode, thrift-confignode, thrift-cluster, thrift-sync


    Difficulty: Major
    Project size: ~175 hour (medium)
    Potential mentors:
    Xiangdong Huang, mail: hxd (at) apache.org
    Project Devs, mail:

    Apache EventMesh EventMesh supports dashboard

    Apache EventMesh (incubating)

    Apache EventMesh is a dynamic cloud-native eventing infrastructure used to decouple the application and backend middleware layer, which supports a wide range of use cases that encompass complex multi-cloud, widely distributed topologies using diverse technology stacks.

    Website: https://eventmesh.apache.org

    GitHub: https://github.com/apache/incubator-eventmesh

    Upstream Issue: https://github.com/apache/incubator-eventmesh/issues/700

    Background

    1. Currently, there is no console page for EventMesh. We hope the community can contribute a visual control page based on EventMesh.

    Task

    • Learn the details of Apache EventMesh
    • Improve the functionalities of the EventMesh Administration Module
    • Implement a web-based dashboard for EventMesh 

    Recommended Skills

    Familiar with Java

    Familiar with HTML, CSS, TypeScript, React.js or Vue.js

    Basic knowledge of RESTful API and HTTP communication

    Mentor

    Mike Xue, PPMC of Apache EventMesh, https://github.com/xwm1992, mikexue@apache.orgImage Added

    Xiaoyang Liu, Committer of Apache EventMesh, https://github.com/xiaoyang-sde, xiaoyang@apache.org

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Xue Weiming, mail: mikexue (at) apache.org
    Project Devs, mail:

    Apache EventMesh Support Knative as Eventing Infra

    Apache EventMesh (incubating)

    Apache EventMesh is a dynamic cloud-native eventing infrastructure used to decouple the application and backend middleware layer, which supports a wide range of use cases that encompass complex multi-cloud, widely distributed topologies using diverse technology stacks.

    Website: https://eventmesh.apache.org

    GitHub: https://github.com/apache/incubator-eventmesh

    Linked GitHub Issue: https://github.com/apache/incubator-eventmesh/issues/790

    Background

    1. Knative Eventing provides tools for routing events from event producers to sinks, enabling developers to use an event-driven architecture with their applications.
    2. Apache EventMesh supports the CloudEvents specification, thus it could be integrated with Knative as an event broker.

    Task

    • Learn the details of the CloudEvents specification
    • Learn the basics of Knative Eventing and its communication protocol
    • Implement the EventMesh Knative-Connector module to deliver events to Knative

    Recommended Skills

    Familiar with Java

    Basic knowledge of Docker and Kubernetes

    Basic knowledge of Knative and CloudEvents

    Mentor

    Easonc Chen, PPMC of Apache EventMesh, https://github.com/qqeasonchen, chenguangsheng@apache.org

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Xue Weiming, mail: mikexue (at) apache.org
    Project Devs, mail:

    Commons Statistics

    GSoC 2022

    Placeholder for tasks that could be undertaken in this year's GSoC.

    Ideas:

    • Design an updated summary statistics API for use with Java 8 streams based on the summary statistic implementations in the Commons Math stat.descriptive package including moments, rank and summary sub-packages.
    Difficulty: Minor
    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Jiajie Zhong, mail: zhongjiajie (at) apache.org
    Project Devs, mail: dev (at) dolphinscheduler.apache.org

    Commons Statistics

    GSoC 2022

    Placeholder for tasks that could be undertaken in this year's GSoC.

    Ideas:

    • Design an updated summary statistics API for use with Java 8 streams based on the summary statistic implementations in the Commons Math stat.descriptive package including moments, rank and summary sub-packages.
    Difficulty: Minor
    Project size: ~350 hour (large)
    Potential mentors:
    Alex Herbert, mail: aherbert (at) apache.org
    Project Devs, mail:

    Commons Numbers

    GSoC 2022

    Placeholder for tasks that could be undertaken in this year's GSoC.

    Ideas:

    • Update the support for complex numbers in the complex package to allow operations to be performed on lists of complex numbers. This requires abstracting the representation of multiple complex numbers into a list structure storing real and imaginary parts that can be efficiently iterated to apply all the operations supported by the Complex class. Operations should modify the numbers in place allowing efficient, zero allocation complex number math to be performed on large datasets.
    Difficulty: Minor
    Project size: ~350 hour (large)
    Potential mentors:
    Alex Herbert, mail: aherbert (at) apache.org
    Project Devs, mail: dev (at) commons.apache.org

    Commons Math

    GSoC 2022

    Placeholder for tasks that could be undertaken in this year's GSoC.

    Ideas (extracted from the "dev" ML):

    1. Redesign and modularize the "ml" package
      -> main goal: enable multi-thread usage.
    2. Abstract the linear algebra utilities
      -> main goal: allow switching to alternative implementations.
    3. Redesign and modularize the "random" package
      -> main goal: general support of low-discrepancy sequences.
    4. Refactor and modularize the "special" package
      -> main goals: ensure accuracy and performance and better API,
      add other functions.
    5. Upgrade the test suite to Junit 5
      -> additional goal: collect a list of "odd" expectations.

    Other suggestions welcome, as well as

    • delineating additional and/or intermediate goals,
    • signalling potential pitfalls and/or alternative approaches to the intended goal(s).
    Difficulty: Minor
    Project size: ~350 hour (large)
    Potential mentors:
    Gilles Sadowski, mail: erans (at) apache.org
    Project Devs, mail: dev (at) commons.apache.org

    Commons Geometry

    GSoC 2022

    Placeholder for tasks that could be undertaken in this year's GSoC.

    Ideas:

    • Examine and potentially redesign the API and algorithms in the commons-geometry-enclosing module. The goal here is to make consistent use of the newer geometry API and ensure that the algorithms are sound.
    • Examine and potentially redesign the API and algorithms in the commons-geometry-hull module. The goal here is to make consistent use of the newer geometry API and ensure that the algorithms are sound (see GEOMETRY-144).
    • Design and implement a parser/writer for the PLY file format in the commons-geometry-io-euclidean module.
    • Design an API for advanced 3D mesh data structures (e.g. halfedge meshes) and operations (e.g. surface subdivision, smoothing, etc). This may end up being another module, e.g. commons-geometry-mesh.
    • Create a series of user guides and/or tutorials demonstrating best-practice use of the library.
    • other ideas ... ?
    Difficulty: Minor
    Project size: ~350 hour (large)
    Potential mentors:
    Matt Juntunen, mail: mattjuntunen (at) apache.org
    Project Devs, mail:

    CloudStack

    GSoC Idea 2022 - Bypass Secondary Storage (Direct Download) on VMware &/or XenServer

    Background

    The default way of registering / downloading templates in CloudStack involves caching them on the secondary store and then during VM deployment, the template is copied to the primary store. However, from ACS version 4.11.1 onward, a feature was added for KVM hypervisor to enable direct download to primary store. This massively reduces the usage of secondary store and also quickens the entire VM deployment process, as there is no need to copy the template from secondary to primary store. 

    Requirement

    We would like to propose an idea to extend this feature of direct download of templates onto primary store for other hypervisors - namely, VMware and XenServer. This would gravely benefit end-users to efficiently use the secondary storage and save overall time of VM deployment on the respective hypervisors

    Relevant Skills:

    Java
    MySQL
    Vue.js

    Difficulty:

    175 hours (Only VMware)
    350 hours (VMware & XenServer)

    Potential Mentors:

    Abhishek Kumar (abhishek.mrt22@gmail.com)
    Pearl Dsilva (pearl1594@gmail.com)

    References

    https://www.shapeblue.com/how-to-deploy-templates-without-using-secondary-storage-on-kvm/
    https://www.shapeblue.com/cloudstack-feature-first-look-direct-download-agnostic-of-the-storage-provider/
    https://cwiki.apache.org/confluence/display/CLOUDSTACK/Bypass+Secondary+Storage+%28Direct+Download%29+on+KVM
    https://www.youtube.com/watch?v=SwepUTfGiKc

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Pearl Dsilva, mail: pearl11594 (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    GSoC 2022 CloudStack Terraform Provider - Add datasources for the existing resources

    Background

    Terraform is an Infrastructure as Code (IaC) software that provides a consistent CLI workflow to manage resources in
    many cloud services. Cloudstack Terraform provider integrates with Cloudstack to aid in managing and automating the deployment of resources in cloudstack. We have recently made the first release of CloudStack Terraform Provider v0.4.0 https://github.com/apache/cloudstack-terraform-provider

    Requirement

    Terraform defines a datasource as, "something that allows Terraform to use the information defined outside of Terraform, defined by another separate Terraform configuration, or modified by functions". Most resources offer data sources alongside their set of resource types. However, currently Cloudstack Terraform Provider only has one datasource for template. Hence, we propose an idea for students to get involved in enhancing the features of the Cloudstack Terraform Provider by adding support for datasources.

    Image Removed

    If the students are enjoying the project, the scope can be extended to support adding further resources in Terraform such that the CloudStack Terraform Provider may become a de-facto tool for automating CloudStack deployments.

    The current set of resources Cloudstack terraform provider supports are:
    https://registry.terraform.io/providers/cloudstack/cloudstack/latest/docs , where as its counterpart Ansible boasts of a more evolved list of resources  https://docs.ansible.com/ansible/latest/collections/ngine_io/cloudstack/index.html  mainly zones, clusters, accounts, domains, etc. It would be great if students interested want to go a step ahead and help add support for these too.

    Relevant Skills:

    GoLang (basic)

    Difficulty:

    Medium

    Potential Mentors:

    Harikrishna Patnala
    Pearl Dsilva

    Example and references

    https://registry.terraform.io/providers/cloudstack/cloudstack/latest/docs : check Resources and Data Sources section under CloudStack Provider
    Depends on CloudStack Go SDK - https://github.com/apache/cloudstack-go

    Github issue: https://github.com/apache/cloudstack/issues/6016

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Harikrishna PatnalaAlex Herbert, mail: harikrishna aherbert (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    GSoC 2022 Idea CloudStack Edge Zones

    Background

    Over recent years, Edge computing has been gaining popularity as it defines a model that brings compute and storage closer to
    where they are consumed by the end-user. By being closer to the end-user a better experience can be provided with reduction on overall latency, lower bandwidth requirements, lower TCO, more flexible hardware/software model, while also ensuring security and reliability. To align ACS with this evolving cloud computing model we would like to propose an idea of supporting Edge Zones in CloudStack, which
    can be also looked upon as a lightweight zone, with minimal resources.

    Commons Numbers

    GSoC 2022

    Placeholder for tasks that could be undertaken in this year's GSoC.

    Ideas:

    • Update the support for complex numbers in the complex package to allow operations to be performed on lists of complex numbers. This requires abstracting the representation of multiple complex numbers into a list structure storing real and imaginary parts that can be efficiently iterated to apply all the operations supported by the Complex class. Operations should modify the numbers in place allowing efficient, zero allocation complex number math to be performed on large datasets.
    Difficulty: Minor
    Project size: ~350 hour (large)
    Potential mentors:
    Alex Herbert, mail: aherbert (at) apache.org
    Project Devs, mail: dev (at) commons.apache.org

    Commons Math

    GSoC 2022

    Placeholder for tasks that could be undertaken in this year's GSoC.

    Ideas (extracted from the "dev" ML):

    1. Redesign and modularize the "ml" package
      -> main goal: enable multi-thread usage.
    2. Abstract the linear algebra utilities
      -> main goal: allow switching to alternative implementations.
    3. Redesign and modularize the "random" package
      -> main goal: general support of low-discrepancy sequences.
    4. Refactor and modularize the "special" package
      -> main goals: ensure accuracy and performance and better API,
      add other functions.
    5. Upgrade the test suite to Junit 5
      -> additional goal: collect a list of "odd" expectations.

    Other suggestions welcome, as well as

    • delineating additional and/or intermediate goals,
    • signalling potential pitfalls and/or alternative approaches to the intended goal(s).
    Difficulty: Minor
    Project

    Requirement

    Today, when a Zone is setup in CloudStack, it by default comes up with a secondary storage VM(SSVM) and a console proxy VM(CPVM). As part of this project, we would need to define a new zone type to decide the change in workflow required to ensure that a CPVM & SSVM isn't spawned up by default. Basic characteristics of an Edge zone include:

    • no need for Secondary Storage
    • no Secondary Storage VM
    • no Console Proxy VM
    • Local storage only as typically an edge device comprises of a single compute node (host)
    • And supports L2 and Isolated networks.

    A high-level view of an edge zone would look something like:

    Image Removed

    Relevant Skills:

    Java
    MySQL
    Vue.js (Basic)

    Difficulty:

    Medium

    Project Duration:

    175 hours

    Potential Mentors:

    Alex Mattioli
    Nicolas Vazquez
    Pearl Dsilva

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Pearl DsilvaGilles Sadowski, mail: pearl11594 erans (at) apache.org
    Project Devs, mail: dev (at) cloudstackcommons.apache.org

    View Logs in the UI

    Commons Geometry

    GSoC 2022

    Placeholder for tasks that could be undertaken in this year's GSoC.

    Ideas:

    • Examine and potentially redesign the API and algorithms in the commons-geometry-enclosing module. The goal here is to make consistent use of the newer geometry API and ensure that the algorithms are sound.
    • Examine and potentially redesign the API and algorithms in the commons-geometry-hull module. The goal here is to make consistent use of the newer geometry API and ensure that the algorithms are sound (see GEOMETRY-144).
    • Design and implement a parser/writer for the PLY file format in the commons-geometry-io-euclidean module.
    • Design an API for advanced 3D mesh data structures (e.g. halfedge meshes) and operations (e.g. surface subdivision, smoothing, etc). This may end up being another module, e.g. commons-geometry-mesh.
    • Create a series of user guides and/or tutorials demonstrating best-practice use of the library.
    • other ideas ... ?
    Difficulty: Minor

    As of now, when an admin encounters an issue or error in CloudStack, the maximum information they can immediately get is the API failure response which provides a reason for the failure. At times this might not be sufficinet to diagnose the error and would require the admin to investiage the CloudStack logs. This would require the admin or the sysadmin to log into the VM running CloudStack and either view or export the logs, and then dive into identifying the issue. This idea aims to eiliminate that step.

    The goal of this is to provide admins the ability to view the logs directly in the UI. This would make diagnosing failures and RCAs much quicker.

    Provide the ability display the logs in the UI

    Add an API / WebSocket (and UI) support to :

    • View the logs
    • Live follow the logs (similar to 'tail -f')

    Duration

    • 175 hours

    Potential Mentors

    • David Jumani

    References

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    David JumaniMatt Juntunen, mail: davidjumani mattjuntunen (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    Add the ability to Safely Shutdown / restart CloudStack

    CloudStack

    GSoC Idea 2022 - Bypass Secondary Storage (Direct Download) on VMware &/or XenServer

    Background

    The default way of registering / downloading templates in CloudStack involves caching them on the secondary store and then during VM deployment, the template is copied to the primary store. However, from ACS version 4.11.1 onward, a feature was added for KVM hypervisor to enable direct download to primary store. This massively reduces the usage of secondary store and also quickens the entire VM deployment process, as there is no need to copy the template from secondary to primary store. 

    Requirement

    We would like to propose an idea to extend this feature of direct download of templates onto primary store for other hypervisors - namely, VMware and XenServer. This would gravely benefit end-users to efficiently use the secondary storage and save overall time of VM deployment on the respective hypervisors

    Relevant Skills:

    Java
    MySQL
    Vue.js

    Difficulty:

    175 hours (Only VMware)
    350 hours (VMware & XenServer)

    Potential Mentors:

    Abhishek Kumar (abhishek.mrt22@gmail.com)
    Pearl Dsilva (pearl1594@gmail.com)

    Shutting down / Restarting Cloudstack is a necessary step in upgrades, system maintenance, etc. As of now, there is no way to safely shutdown or restart CloudStack. It is directly terminated via systemd. Since this is the case, any asyncronous job or background task is abrubptly terminated and can fail. As of now, CloudStack maintains a list of asynchronous jobs wihtin it's database along with their status.

    This idea aims to provide a way to safely shutdown CloudStack. It involves two parts :

    • Prevent new asynchronous jobs from being added to CloudStack when a safe shutdown is triggered
    • Check the status of the async jobs and Shut down CloudStack when all the jobs have been completed

    Provide the ability to safely shutdown CloudStack

    Add API (and/or UI) support to :

    • Trigger a safe shutdown
    • (Optional) Support restarts
    • (Optional) Support a forced shutdown when CloudStack will quit even if there are async jobs running

    Duration

    • Some Experience : 175 hours
    • Newbie : 350 hours

    Potential Mentors

    David Jumani

    References

    https://

    github.com/apache/cloudstack/issues/6021

    www.shapeblue.com/how-to-deploy-templates-without-using-secondary-storage-on-kvm/
    https://www.shapeblue.com/cloudstack-feature-first-look-direct-download-agnostic-of-the-storage-provider/
    https://cwiki.apache.org/confluence/display/CLOUDSTACK/Bypass+Secondary+Storage+%28Direct+Download%29+on+KVM
    https://www.youtube.com/watch?v=SwepUTfGiKc

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    David JumaniPearl Dsilva, mail: davidjumani pearl11594 (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    GSoC 2022 CloudStack Terraform Provider - Add

    support for Kubernetes Clusters

    datasources for the existing resources

    Background

    As of now the CloudStack Terraform Provider does not support managing CKS clusters

    This proposal aims to add support to the CloudStack Terraform Provider to manage CKS clusters

    This would involve supporting the following actions on CKS clusters :

    • Create
    • Stop / Start
    • Scale
    • Upgrade
    • Delete

    [Optional]
    Support the following actions on the binary ISOs :

    • Register
    • Enable / Disable
    • Delete

    Duration

    • 175 hours

    Potential Mentors

    • Harikrishna Patnala
    • David Jumani

    References

    Terraform is an Infrastructure as Code (IaC) software that provides a consistent CLI workflow to manage resources in
    many cloud services. Cloudstack Terraform provider integrates with Cloudstack to aid in managing and automating the deployment of resources in cloudstack. We have recently made the first release of CloudStack Terraform Provider v0.4.0 

    https://github.com/apache/cloudstack

    /issues/6040
    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    David Jumani, mail: davidjumani (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    -terraform-provider

    Requirement

    Terraform defines a datasource as, "something that allows Terraform to use the information defined outside of Terraform, defined by another separate Terraform configuration, or modified by functions". Most resources offer data sources alongside their set of resource types. However, currently Cloudstack Terraform Provider only has one datasource for template. Hence, we propose an idea for students to get involved in enhancing the features of the Cloudstack Terraform Provider by adding support for datasources.

    Image Added

    If the students are enjoying the project, the scope can be extended to support adding further resources in Terraform such that the CloudStack Terraform Provider may become a de-facto tool for automating CloudStack deployments.

    The current set of resources Cloudstack terraform provider supports are:
    https://registry.terraform.io/providers/cloudstack/cloudstack/latest/docs , where as its counterpart Ansible boasts of a more evolved list of resources  https://docs.ansible.com/ansible/latest/collections/ngine_io/cloudstack/index.html  mainly zones, clusters, accounts, domains, etc. It would be great if students interested want to go a step ahead and help add support for these too.

    Relevant Skills:

    GoLang (basic)

    Difficulty:

    Medium

    Potential Mentors:

    Harikrishna Patnala
    Pearl Dsilva

    Example and references

    https://registry.terraform.io/providers/cloudstack/cloudstack/latest/docs : check Resources and Data Sources section under CloudStack Provider
    Depends on CloudStack Go SDK - https://github.com/apache/cloudstack-go

    Github issue: https://github.com/apache/cloudstack/issues/6016

    GSoC 2022 Idea Instant Instance Deploy (using VM Definitions)

    Background

    Currently, Deploy Instances/Virtual Machines(VMs) in Cloudstack requires to specify some offerings, template and other settings through the API (check the API here: https://cloudstack.apache.org/api/apidocs-4.16/apis/deployVirtualMachine.html) or the 'Instance Deployment Wizard' in the UI.

    Requirement

    Provision to user/operator to quick deploy an instance using a VM definition/profile. The VM definition/profile would hold the details of the template, offerings (including any custom values - size, iops), ssh keypair, instance group, affinity group and other settings (boot type, dynamic scaling, userdata, keyboard language, etc) that are required, and the underlying definition/profile id can be used to launch an instance. At the minimum, the definition should hold all the mandatory details for deploying an instance. With this, only the VM definitions/profiles (and other important options, with the associated billing details) can be exposed to the users for VM deployment, instead of the offerings and other VM options.

    Need to add new APIs (and/or UI) support for the VM definition/profile CRUD operations, and support for definition in the deployVirtualMachine API.

    Relevant Skills

    • Java, MySQL
    • Vue.js (for UI)
    • Some knowledge of Virtualization and CloudStack

    Difficulty

    Medium

    Potential Mentors

    • Suresh Kumar Anaparti
    • David Jumani

    Project Scope/Duration

    Medium / 175 hours

    References

  • http://docs.cloudstack.apache.org/en/latest/adminguide/index.html#working-with-virtual-machines
  • https://cloudstack.apache.org/api/apidocs-4.16/apis/deployVirtualMachine.html
  • https://cloudstack.apache.org/api/apidocs-4.16/

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Suresh Kumar AnapartiHarikrishna Patnala, mail: sureshkumar.anaparti harikrishna (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    GSoC 2022

    More granularity on affinity/anti-affinity groups

    Currently, defining an affinity or anti-affinity rule works only at hosts level. I would like to have more detail on the affinity group, extending it at different levels (cluster, pod, zone,..) and also within the same level, being able to add or remove resources from the group.

    For hosts and storage pools, administrators can make use of host tags or storage tags to get a similar result. However, the extension of affinity/anti-affinity groups would make the administration easier.

    Size of the project: Medium (~175hs)

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Nicolás Vázquez, mail: nvazquez (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    GSoC 2022 Idea Autodetect IPs used inside the VM

    Idea CloudStack Edge Zones

    Background

    Over recent years, Edge computing has been gaining popularity as it defines a model that brings compute and storage closer to
    where they are consumed by the end-user. By being closer to the end-user a better experience can be provided with reduction on overall latency, lower bandwidth requirements, lower TCO, more flexible hardware/software model, while also ensuring security and reliability. To align ACS with this evolving cloud computing model we would like to propose an idea of supporting Edge Zones in CloudStack, which
    can be also looked upon as a lightweight zone, with minimal resources.

    Requirement

    Today, when a Zone is setup in CloudStack, it by default comes up with a secondary storage VM(SSVM) and a console proxy VM(CPVM). As part of this project, we would need to define a new zone type to decide the change in workflow required to ensure that a CPVM & SSVM isn't spawned up by default. Basic characteristics of an Edge zone include:

    • no need for Secondary Storage
    • no Secondary Storage VM
    • no Console Proxy VM
    • Local storage only as typically an edge device comprises of a single compute node (host)
    • And supports L2 and Isolated networks.

    A high-level view of an edge zone would look something like:

    Image Added

    Relevant Skills:

    Java
    MySQL
    Vue.js (Basic)

    Difficulty:

    Medium

    Project Duration:

    175 hours

    Potential Mentors:

    Alex Mattioli
    Nicolas Vazquez
    Pearl Dsilva



    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Pearl Dsilva, mail: pearl11594 (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    View Logs in the UI

    As of now, when an admin encounters an issue or error in CloudStack, the maximum information they can immediately get is the API failure response which provides a reason for the failure. At times this might not be sufficinet to diagnose the error and would require the admin to investiage the CloudStack logs. This would require the admin or the sysadmin to log into the VM running CloudStack and either view or export the logs, and then dive into identifying the issue. This idea aims to eiliminate that step.

    The goal of this is to provide admins the ability to view the logs directly in the UI. This would make diagnosing failures and RCAs much quicker.

    Provide the ability display the logs in the UI

    Add an API / WebSocket (and UI) support to :

    • View the logs
    • Live follow the logs (similar to 'tail -f')


    Duration

    • 175 hours


    Potential Mentors

    • David Jumani


    References


    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    David Jumani, mail: davidjumani (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    Add the ability to Safely Shutdown / restart CloudStack

    Shutting down / Restarting Cloudstack is a necessary step in upgrades, system maintenance, etc. As of now, there is no way to safely shutdown or restart CloudStack. It is directly terminated via systemd. Since this is the case, any asyncronous job or background task is abrubptly terminated and can fail. As of now, CloudStack maintains a list of asynchronous jobs wihtin it's database along with their status.

    This idea aims to provide a way to safely shutdown CloudStack. It involves two parts :

    • Prevent new asynchronous jobs from being added to CloudStack when a safe shutdown is triggered
    • Check the status of the async jobs and Shut down CloudStack when all the jobs have been completed


    Provide the ability to safely shutdown CloudStack

    Add API (and/or UI) support to :

    • Trigger a safe shutdown
    • (Optional) Support restarts
    • (Optional) Support a forced shutdown when CloudStack will quit even if there are async jobs running


    Duration

    • Some Experience : 175 hours
    • Newbie : 350 hours


    Potential Mentors

    • David Jumani


    References


    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    David Jumani, mail: davidjumani (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    CloudStack Terraform Provider - Add support for Kubernetes Clusters

    As of now the CloudStack Terraform Provider does not support managing CKS clusters

    This proposal aims to add support to the CloudStack Terraform Provider to manage CKS clusters

    This would involve supporting the following actions on CKS clusters :

    • Create
    • Stop / Start
    • Scale
    • Upgrade
    • Delete

    [Optional]
    Support the following actions on the binary ISOs :

    • Register
    • Enable / Disable
    • Delete


    Duration

    • 175 hours


    Potential Mentors

    • Harikrishna Patnala
    • David Jumani

    References


    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    David Jumani, mail: davidjumani (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    GSoC 2022 Idea Instant Instance Deploy (using VM Definitions)

    Background

    Currently, Deploy Instances/Virtual Machines(VMs) in Cloudstack requires to specify some offerings, template and other settings through the API (check the API here: https://cloudstack.apache.org/api/apidocs-4.16/apis/deployVirtualMachine.html) or the 'Instance Deployment Wizard' in the UI.

    Requirement

    Provision to user/operator to quick deploy an instance using a VM definition/profile. The VM definition/profile would hold the details of the template, offerings (including any custom values - size, iops), ssh keypair, instance group, affinity group and other settings (boot type, dynamic scaling, userdata, keyboard language, etc) that are required, and the underlying definition/profile id can be used to launch an instance. At the minimum, the definition should hold all the mandatory details for deploying an instance. With this, only the VM definitions/profiles (and other important options, with the associated billing details) can be exposed to the users for VM deployment, instead of the offerings and other VM options.

    Need to add new APIs (and/or UI) support for the VM definition/profile CRUD operations, and support for definition in the deployVirtualMachine API.

    Relevant Skills

    • Java, MySQL
    • Vue.js (for UI)
    • Some knowledge of Virtualization and CloudStack

    Difficulty

    Medium

    Potential Mentors

    • Suresh Kumar Anaparti
    • David Jumani

    Project Scope/Duration

    Medium / 175 hours

    References

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Suresh Kumar Anaparti, mail: sureshkumar.anaparti (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    GSoC 2022 More granularity on affinity/anti-affinity groups

    Currently, defining an affinity or anti-affinity rule works only at hosts level. I would like to have more detail on the affinity group, extending it at different levels (cluster, pod, zone,..) and also within the same level, being able to add or remove resources from the group.

    For hosts and storage pools, administrators can make use of host tags or storage tags to get a similar result. However, the extension of affinity/anti-affinity groups would make the administration easier.

    Size of the project: Medium (~175hs)

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Nicolás Vázquez, mail: nvazquez (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    GSoC 2022 Idea Keep track of VM's "last known state" and enforce it after an outage

    An infrastructure outage can take out several or all VMs. In the aftermath it's not always possible to know which VMs were supposed to be ON or OFF, especially if HA is not enabled. People keep powered off VMs around all the time for many reasons.

    I propose we add a feature where Cloudstack keeps track of the "last known state" of a VM and after an outage either enforce it (ie start the VM or leave it off) or at least show some information to the operator in the UI/API so they can do it themselves; perhaps make this behaviour configurable in the global settings.

    Thanks

    Difficulty: Minor
    Project size: ~350 hour (large)
    Potential mentors:
    Nux, mail: nuxro (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    GSoC 2022 Idea Report / Manage the VM jobs in CloudStack

    Background

    CloudStack allows users/operators to perform various operations on the Virtual Machines (VMs). When multiple operations are performed on a VM at the same time, these operations are maintained and sync-ed using the sync queues. Any long running job (eg. volume snapshot) of a VM keeps other jobs in waiting/pending state, and only be picked once the active job is finished. Currently, it is not possible for an operator to list the pending jobs on a VM, cancel or re-prioritise any job if needed.

    Requirement

    Provision to admin/operator, to the list the pending jobs of a VM, cancel or re-prioritise a job if needed. Also, allow to clear all the pending jobs of a VM.

    Add API (and/or UI) support to

    • List the active jobs for a VM
    • List all the pending jobs of a VM (in queue, by their order of execution)
    • Re-prioritise a job from the pending jobs (if possible)
    • Cancel any job from the pending jobs
    • Clear all the pending jobs of a VM

    Relevant Skills

    • Java, MySQL
    • Vue.js (for UI)
    • Some knowledge of CloudStack and its Job framework

    Difficulty

    Medium

    Potential Mentors

    • Suresh Kumar Anaparti
    • Any Developer from CS Community

    Project Scope/Duration

    Large / 350hrs (can be Medium / 175 hours - with reduced scope of API/UI work)

    References

    Future Extensions

    This can be extended for other resources (hosts, primary storage, network, etc).
    [APIs should take resource type as a param for generic implementation]

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Suresh Kumar Anaparti, mail: sureshkumar.anaparti (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    GSoc 2022 - CloudStack OAuth2 Plugin

    This can be an interesting task for an engineer with domain knowledge on backend services in Java and some knowledge in Vue.js or relevant tech. Also, domain knowledge of OAuth authentication is desirable.

    The main objectives of this task are:

    • Create a new CloudStack authentication plugin: this plugin will allow authentication to third-party libraries such as Google, Facebook, Github, etc.
    • Extend CloudStack configurations: allow administrators to enable/disable the plugin and configure the auth provider

    More information about the task on: https://github.com/apache/cloudstack/issues/4834

    Size: Medium

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Nicolás Vázquez, mail: nvazquez (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    GSoC 2022 Idea Autodetect IPs used inside the VM

    With regards to IP info reporting, Cloudstack relies entirely on it's DHCP data bases and so on. When this is not available (L2 networks etc) no IP information is shown for a given VM.

    I propose we introduce a mechanism for "IP autodetection" and try to discover the IPs used inside the machines by means of querying the hypervisors. For example with KVM/libvirt we can simply do something like this:


                root@fedora35 ~]# virsh domifaddr win2k22 --source agent  
                Name MAC address Protocol Address 
                ------------------------------------------------------------------------------- 
                Ethernet 52:54:00:7b:23:6a ipv4 192.168.0.68/24 
                Loopback Pseudo-Interface 1 ipv6 ::1/128 - - ipv4 127.0.0.1/8 

    The above command queries the qemu-guest-agent inside the Windows VM. The VM needs to have the qemu-guest-agent installed and running as well as the virtio serial drivers (easily done in this case with virtio-win-guest-tools.exe ) as well as a guest-agent socket channel defined in libvirt.

    Once we have this information we could display it in the UI/API as "Autodetected VM IPs" or something like that.

    I imagine it's very similar for VMWare and XCP-ng.

    Thank you

    Difficulty: Minor
    Project size: ~350 hour (large)
    Potential mentors:
    Nux, mail: nuxro (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    Cassandra

    Produce and verify BoundedReadCompactionStrategy as a unified general purpose compaction algorithm

    The existing compaction strategies have a number of drawbacks that make all three unsuitable as a general use compaction strategy, for example STCS creates giant files that are hard to back up, mess with read performance and the page cache, and led to many of the early re-open bugs. LCS improved dramatically on this but also has various issues e.g. lack of performant full compaction or due to the strict leveling with e.g. bulk loading when writes exceed the rate we can do the L0 - L1 promotion.

    In this talk I proposed a novel compaction strategy that aims to expose a single tunable that the user can control for the read amplification. Raise the min_threshold_levels and you tradeoff read/space performance for write performance. Since then a proof of concept patch has been published along with some rudimentary documentation but this is still not tracked in Jira.

    The remaining work here is

    1. Validate the algorithm is correct via test suites and performance testing stress testing and benchmarking with OSS tools (e.g. cassandra-stress, tlp-stress, or ndbench). When issues are found (there likely will be issues as the patch is a PoC), devise how to adjust the algorithm and implementation appropriately. Key metric of success is we can run Cassandra stably for more than 24 hours while applying sustained load, with minimal compaction load (and also compaction can keep up).

    2. Do more in depth experiments measuring performance across a wide range of workloads (e.g. write heavy, read heavy, balanced, time series, register update, etc ...) and in comparison with LCS (leveled), STCS (size tiered), and TWCS (time window). Key metrics of success are establishing that as we tune max_read_per_read we should get more predictable read latency under low system load (ρ < 30%) while not degrading at high system load (ρ > 70%), and we should match LCS performance under low load while doing better at high load.

    Once this is validated a Cassandra blog post reporting on the findings (positive or negative) may be advisable.


    Difficulty: Normal
    Project size: ~350 hour (large)
    Potential mentors:
    , mail: (at) apache.org
    Project Devs, mail: dev (at) cassandra.apache.org

    Add support for EXPLAIN statements

    We should provide users a way to understand how their query will be executed and some information on the amount of work that will be performed.
    Explain statements are the most common way to do that.
    A CEP Draft has been open for that: (DRAFT) CEP-4: Explain. This draft propose to add support for EXPLAIN and EXPLAIN ANALYZE but I believe that we should split the work in 2 parts because a simple EXPLAIN would already provide relevant information.

    To complete this work I believe that the following steps will be required:

    • Rework and submit the CEP
    • Add missing statistics
    • Implements the logic behind the EXPLAIN statements
    Difficulty:
    Project size: ~350 hour (large)
    Potential mentors:
    , mail: (at) apache.org
    Project Devs, mail: dev (at) cassandra.apache.org

    Beam

    A Complex Event Processing (CEP) library/extension for Apache Beam

    Apache Beam [1] is a unified and portable programming model for data processing jobs. The Beam model [2, 3, 4] has rich mechanisms to process endless streams of events.

    Complex Event Processing [5] lets you match patterns of events in streams to detect important patterns in data and react to them.

    Some examples of uses of CEP are fraud detection for example by detecting unusual behavior (patterns of activity), e.g. network intrusion, suspicious banking transactions, etc. Also trend detection is another interesting use case in the context of sensors and IoT.

    The goal of this issue is to implement an efficient pattern matching library inspired by [6] and existing libraries like Apache Flink CEP [7] using the Apache Beam Java SDK and the Beam style guides [8]. Because of the time constraints of GSoC we will probably try to cover first simple patterns of the ‘a followed by b followed by c’ kind, and then if there is still time try to cover more advanced ones e.g. optional, atLeastOne, oneOrMore, etc.

    [1] https://beam.apache.org/
    [2] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
    [3] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
    [4] https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43864.pdf
    [5] https://en.wikipedia.org/wiki/Complex_event_processing
    [6] https://people.cs.umass.edu/~yanlei/publications/sase-sigmod08.pdf
    [7] https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/cep.html
    [8] https://beam.apache.org/contribute/ptransform-style-guide/


    Difficulty: P3
    Project size: ~350 hour (large)
    Potential mentors:
    Ismaël Mejía, mail: iemejia (at) apache.org
    Project Devs, mail: dev (at) beam.apache.org

    A Beam runner for Ray

    Ray (https://ray.io) is a framework to develop distributed applications. There is a push to develop several libraries to support vario7us forms for AI/ML analytics with Ray. There is an opportunity to develop a Beam runner for Ray.


    https://docs.google.com/document/u/1/d/1vt78s48Q0aBhaUCHrVrTUsProJSP8-EBqDDRGTPEr0Y/edit

    Difficulty: P2
    Project size: ~350 hour (large)
    Potential mentors:
    Pablo Estrada, mail: pabloem (at) apache.org
    Project Devs, mail: dev (at) beam.apache.org

    Run code in examples in Beam's Pydoc

    We have the Beam Pydoc set up, and some functions have examples written into their documentaztion, however we do not run the examples that we express in Pydoc.

    This work item consists in improving the Pydoc for Apache Beam to run examples, adding some examples, and reformatting any existing examples / existing Pydoc that needs to be better expressed for Beam.

    Difficulty: P2
    Project size: ~175 hour (medium)
    Potential mentors:
    Pablo Estrada, mail: pabloem (at) apache.org
    Project Devs, mail: dev (at) beam.apache.org

    CLONE - A generic Beam IO Sink for Java

    It would be desirable to develop a Beam Sink that supports all of the 'best practices' like throttling, auto-sharding, exactly-once capabilities, etc.

    A design proposal is here: https://docs.google.com/document/d/1UIWv6wnD86GYAkeqbVWCG3mx4dTZ9WstUUThPWQmcFM/edit#heading=h.smc16ifdre2

    A prototype for the API and parts of implementation is here: https://github.com/apache/beam/pull/16763

    Contact Pablo Estrada on dev@beam.apache.orgImage Added if you have questions, or comment here.

    Difficulty: P2
    Project size: ~350 hour (large)
    Potential mentors:
    Pablo Estrada, mail: pabloem (at) apache.org
    Project Devs, mail: dev (at) beam.apache.org

    A generic Beam IO Sink for Java

    It would be desirable to develop a Beam Sink that supports all of the 'best practices' like throttling, auto-sharding, exactly-once capabilities, etc.

    A design proposal is here: https://docs.google.com/document/d/1UIWv6wnD86GYAkeqbVWCG3mx4dTZ9WstUUThPWQmcFM/edit#heading=h.smc16ifdre2

    A prototype for the API and parts of implementation is here: https://github.com/apache/beam/pull/16763

    Contact Pablo Estrada on dev@beam.apache.orgImage Added if you have questions, or comment here.

    Difficulty: P2
    Project size: ~350 hour (large)
    Potential mentors:
    Pablo Estrada, mail: pabloem (at) apache.org
    Project Devs, mail: dev (at) beam.apache.org

    Runner Comparison / Capability Matrix revamp

    The goal for this project has changed: We now want to create a completely new Capability Matrix that is based on the ValidatesRunner tests that we run on the various Apache Beam runners.

    We can use the test in ./test-infra/validates-runner/ to generate a JSON file that contains the capabilities supported by various runners and tested by each individual test.

    ----------------------------------------------------


    Discussion: https://lists.apache.org/thread.html/8aff7d70c254356f2dae3109fb605e0b60763602225a877d3dadf8b7@%3Cdev.beam.apache.org%3E

    Summarizing that discussion, we have a lot of issues/wishes. Some can be addressed as one-off and some need a unified reorganization of the runner comparison.

    Basic corrections:

    • Remove rows that impossible to not support (ParDo)
    • Remove rows where "support" doesn't really make sense (Composite transforms)
    • Deduplicate rows are actually the same model feature (all non-merging windowing / all merging windowing)
    • Clearly separate rows that represent optimizations (Combine)
    • Correct rows in the wrong place (Timers are actually a "what...?" row)
    • Separate or remove rows have not been designed ([Meta]Data driven triggers, retractions)
    • Rename rows with names that appear no where else (Timestamp control, which is called a TimestampCombiner in Java)
    • Switch to a more distinct color scheme for full/partial support (currently just solid/faded colors)
    • Switch to something clearer than "~" for partial support, versus ✘ and ✓ for none and full.
    • Correct Gearpump support for merging windows (see BEAM-2759)
    • Correct Spark support for non-merging and merging windows (see BEAM-2499)

    Minor rewrites:

    • Lump all the basic stuff (ParDo, GroupByKey, Read, Window) into one row
    • Make sections as users see them, like "ParDo" / "side Inputs" not "What?" / "side inputs"
    • Add rows for non-model things, like portability framework support, metrics backends, etc

    Bigger rewrites:

    • Add versioning to the comparison, as in BEAM-166
    • Find a way to fit in a plain English summary of runner's support in Beam. It should come first, as it is what new users need before getting to details.
    • Find a way to describe production readiness of runners and/or testimonials of who is using it in production.
    • Have a place to compare non-model differences between runners

    Changes requiring engineering efforts:

    • Gather and add quantitative runner metrics, perhaps Nexmark results for mid-level, smaller benchmarks for measuring aspects of specific features, and larger end-to-end benchmarks to get an idea how it might actually perform on a use case
    • Tighter coupling of the matrix portion of the comparison with tags on ValidatesRunner tests

    If you care to address some aspect of this, please reach out and/or just file a subtask and address it.

    Difficulty: P3

    With regards to IP info reporting, Cloudstack relies entirely on it's DHCP data bases and so on. When this is not available (L2 networks etc) no IP information is shown for a given VM.

    I propose we introduce a mechanism for "IP autodetection" and try to discover the IPs used inside the machines by means of querying the hypervisors. For example with KVM/libvirt we can simply do something like this:

                root@fedora35 ~]# virsh domifaddr win2k22 --source agent  
                Name MAC address Protocol Address 
                ------------------------------------------------------------------------------- 
                Ethernet 52:54:00:7b:23:6a ipv4 192.168.0.68/24 
                Loopback Pseudo-Interface 1 ipv6 ::1/128 - - ipv4 127.0.0.1/8 

    The above command queries the qemu-guest-agent inside the Windows VM. The VM needs to have the qemu-guest-agent installed and running as well as the virtio serial drivers (easily done in this case with virtio-win-guest-tools.exe ) as well as a guest-agent socket channel defined in libvirt.

    Once we have this information we could display it in the UI/API as "Autodetected VM IPs" or something like that.

    I imagine it's very similar for VMWare and XCP-ng.

    Thank you

    Difficulty: Minor
    Project size: ~350 hour (large)
    Potential mentors:
    NuxKenneth Knowles, mail: nuxro kenn (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.orgat) beam.apache.org

    Apache Nemo

    Efficient Dynamic Reconfiguration in Stream Processing

    In Stream processing, we have many methods, starting from the primitive checkpoint-and-replay to a more fancy version of reconfiguration and reinitiation of stream workloads. We aim to find a way to find the most effective and efficient way of reconfiguring stream workloads. Sub-issues are to be created later on.

    Difficulty: Major

    GSoC 2022 Idea Keep track of VM's "last known state" and enforce it after an outage

    An infrastructure outage can take out several or all VMs. In the aftermath it's not always possible to know which VMs were supposed to be ON or OFF, especially if HA is not enabled. People keep powered off VMs around all the time for many reasons.

    I propose we add a feature where Cloudstack keeps track of the "last known state" of a VM and after an outage either enforce it (ie start the VM or leave it off) or at least show some information to the operator in the UI/API so they can do it themselves; perhaps make this behaviour configurable in the global settings.

    Thanks

    Difficulty: Minor
    Project size: ~350 hour (large)
    Potential mentors:
    NuxWonook, mail: nuxro wonook (at) apache.org
    Project Devs, mail: dev (at) cloudstacknemo.apache.org

    Application structure-aware caching on Nemo

    Nemo has a policy layer that allows powerful optimization with configurable runtime modules. In terms of caching, it is possible to identify frequently used data and decide to cache them in-memory ahead of execution, without user annotation.

    Implementation would need:

    • On policy layer, build compile-time pass that identify reused data and mark them as cached
    • On runtime, design and implement caching mechanism that manages per-executor cached data and discard them when these are no longer used.
    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Jeongyoon Eo, mail: jeongyoon (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org

    Implement spill mechanism on Nemo

    Currently, Nemo doesn't have a spill mechanism. This makes executors prone to memory problems such as OOM(Out Of Memory) or GC when task data is large. For example, handling skewed shuffle data in Nemo results in OOM and executor failure, as all data has to be handled in-memory.

    We need to spill in-memory data to secondary storage when there are not enough memory in executor.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Jeongyoon Eo, mail: jeongyoon (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org

    Efficient Caching and Spilling on Nemo

    In-memory caching and spilling are essential features in in-memory big data processing frameworks, and Nemo needs one.

    • Identify and persist frequently used data and unpersist it when its usage ended
    • Spill in-memory data to disk upon memory pressure

    GSoC 2022 Idea Report / Manage the VM jobs in CloudStack

    Background

    CloudStack allows users/operators to perform various operations on the Virtual Machines (VMs). When multiple operations are performed on a VM at the same time, these operations are maintained and sync-ed using the sync queues. Any long running job (eg. volume snapshot) of a VM keeps other jobs in waiting/pending state, and only be picked once the active job is finished. Currently, it is not possible for an operator to list the pending jobs on a VM, cancel or re-prioritise any job if needed.

    Requirement

    Provision to admin/operator, to the list the pending jobs of a VM, cancel or re-prioritise a job if needed. Also, allow to clear all the pending jobs of a VM.

    Add API (and/or UI) support to

    • List the active jobs for a VM
    • List all the pending jobs of a VM (in queue, by their order of execution)
    • Re-prioritise a job from the pending jobs (if possible)
    • Cancel any job from the pending jobs
    • Clear all the pending jobs of a VM

    Relevant Skills

    • Java, MySQL
    • Vue.js (for UI)
    • Some knowledge of CloudStack and its Job framework

    Difficulty

    Medium

    Potential Mentors

    • Suresh Kumar Anaparti
    • Any Developer from CS Community

    Project Scope/Duration

    Large / 350hrs (can be Medium / 175 hours - with reduced scope of API/UI work)

    References

    Future Extensions

    This can be extended for other resources (hosts, primary storage, network, etc).
    [APIs should take resource type as a param for generic implementation]
    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Jeongyoon Eo, mail: jeongyoon (at) apache.org
    Project DevsSuresh Kumar Anaparti, mail: sureshkumar.anaparti (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org
    dev (at) nemo.apache.org

    Enhance Nemo to support autoscaling for bursty loads

    The load of streaming jobs usually fluctuate according to the input rate or operations (e.g., window). Supporting the automatic scaling could reduce the operational cost of running streaming applications, while minimizing the performance degradation that can be caused by the bursty loads. 


    We can harness the cloud resources such as VMs and serverless frameworks to acquire computing resources on demand. To realize the automatic scaling, the following features should be implemented.


    1) state migration: scaling jobs require moving tasks (or partitioning a task to multiple ones). In this situation, the internal state of the task should be serialized/deserialized. 

    2) input/output rerouting: if a task is moved to a new worker, the input and output of the task should be redirected. 

    3) dynamic Executor or Task creation/deletion: Executor}}s or {{Task can be dynamically created or deleted. 

    4) scaling policy: a scaling policy that decides when and how to scale out/in should be implemented. 

    GSoc 2022 - CloudStack OAuth2 Plugin

    This can be an interesting task for an engineer with domain knowledge on backend services in Java and some knowledge in Vue.js or relevant tech. Also, domain knowledge of OAuth authentication is desirable.

    The main objectives of this task are:

    • Create a new CloudStack authentication plugin: this plugin will allow authentication to third-party libraries such as Google, Facebook, Github, etc.
    • Extend CloudStack configurations: allow administrators to enable/disable the plugin and configure the auth provider

    More information about the task on: https://github.com/apache/cloudstack/issues/4834

    Size: Medium

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Nicolás VázquezTae-Geon Um, mail: nvazquez taegeonum (at) apache.org
    Project Devs, mail: dev (at) cloudstacknemo.apache.org

    Cassandra

    Produce and verify BoundedReadCompactionStrategy as a unified general purpose compaction algorithm

    The existing compaction strategies have a number of drawbacks that make all three unsuitable as a general use compaction strategy, for example STCS creates giant files that are hard to back up, mess with read performance and the page cache, and led to many of the early re-open bugs. LCS improved dramatically on this but also has various issues e.g. lack of performant full compaction or due to the strict leveling with e.g. bulk loading when writes exceed the rate we can do the L0 - L1 promotion.

    In this talk I proposed a novel compaction strategy that aims to expose a single tunable that the user can control for the read amplification. Raise the min_threshold_levels and you tradeoff read/space performance for write performance. Since then a proof of concept patch has been published along with some rudimentary documentation but this is still not tracked in Jira.

    The remaining work here is

    1. Validate the algorithm is correct via test suites and performance testing stress testing and benchmarking with OSS tools (e.g. cassandra-stress, tlp-stress, or ndbench). When issues are found (there likely will be issues as the patch is a PoC), devise how to adjust the algorithm and implementation appropriately. Key metric of success is we can run Cassandra stably for more than 24 hours while applying sustained load, with minimal compaction load (and also compaction can keep up).

    2. Do more in depth experiments measuring performance across a wide range of workloads (e.g. write heavy, read heavy, balanced, time series, register update, etc ...) and in comparison with LCS (leveled), STCS (size tiered), and TWCS (time window). Key metrics of success are establishing that as we tune max_read_per_read we should get more predictable read latency under low system load (ρ < 30%) while not degrading at high system load (ρ > 70%), and we should match LCS performance under low load while doing better at high load.

    Once this is validated a Cassandra blog post reporting on the findings (positive or negative) may be advisable.

    Apache Fineract

    Make Fineract.dev (~Mifos X) demo server multi tenant aware, more Cloud Native, and Performance Tested

    Mifos X was built to be cloud ready from the ground up. One of the most popular deployment environments for MifosX has been on Amazon EC2, however due to country specific regulation, many implementors are forced to seek alternative models that can scale as effectively. The aim of this project is two-fold:

    • Propose a scalable deployment model for Mifos on Google Cloud. Your application should highlight a starting point with some details of your planned deployment architecture, as Mentors would be not giving you step-by-step instructions in this project, just "nudge" you along; you would be expected to learn about how to deploy Mifos yourself and by autonomously using the documentation available and help from the public mailing list and IRC channel, and figure out the details of the Cloud deployment.
    • Propose how the above proposed model could be contributed to Mifos in the form of e.g. ready-to-run "configurations" etc. allowing ANYONE to deploy THE LATEST VERSION of Mifos in the Cloud themselves, and then implement this approach in practice. (Contrast this with a "one-off exercise", e.g. taking the current Mifos X WAR file, and UI, and manually making some changes to it, and then manually deploying that to some Cloud PaaS - this would not be sufficient for this project's expectations.)
    • Implement a Continuous Deployment "Devops" EXAMPLE instance of this scalable blueprint using the latest nightly Mifos build artifacts.
    • Publish a high level whitepaper of the same, which can be used as a reference for local implementors, who would additionally take care of provisioning their own hardware. This documentation should be ideally high-level, and what it described much be automated; only providing lengthy step-by-step manual instructions would not be sufficient for this project's expectations.

    To prepare for this project, applying contibutors must demonstrate at least that they have already successfully locally built and ran a Mifos X REST back-end server and UI, populated the database etc. as well as provided a simple pull request proposing some minimal deployment related improvement.

    Note that we now believe that a Platform as a Service (PaaS) is a more suitable foundation for this project than a raw Cloud Infrastructure as a Service (IaaS) platform (such as Openstack, offered by public cloud provider such as e.g. Rackspace; or Azure, or raw Amazon EC2). This is because a PaaS, such OpenShift, already come with relevant features such as built-in, managed, supported and monitored HTTP load balancing (e.g. OpenShift comes with HAProxy).

    The MariaDB (MySQL) database used by Apache Fineract/Mifos does not offer clustering. We believe that this would not be required, and that proper configuration of the already existing cache facility (incl. distributed cache invalidation) available in Mifos X will add more value at signficantly less operational complexity.

    You may need to develop some minor "adjustments" for Mifos X to work well in a PaaS. For example, writeable directories may be limited, and configuration changes may be needed to pick up allowed data directories from an environment variable configuration (but consider multi node distribution in this cluster setup!). Also a cloud PaaS like OpenShift may not support "always running" instances, and scheduled jobs may have to be configured to be kicked off via an explicit HTTP "wake up" request from a cron job.

    Difficulty: Minor
    Project size: ~175 hour (medium)
    Potential mentors:
    Rahul Goel, mail: rahul.usit12 (at) apache.org
    Project Devs, mail: dev (at) fineract.apache.org

    Fix Critical Vulnerabilities from Static Analysis and Vulnerability Scanning of Apache Fineract 1.x

    As our product is core banking platform and our clients are financial institutions, we strive hard to make our code base as secure as possible. However, due to ever increasing security threats and vulnerabilities, it is the need of hour that we analyze our code base in depth for security vulnerabilities. During pull request merge process, we have a process in place wherein we do peer code review,QA and integration tests. This practice has been very effective and our community is already reaping the benefits of such a strong code review process. However, we should test our code against the standard vulnerabilities which have been identified by reputed organisations like Mitre to gain more confidence. It has become a critical part of independent and partner-led deployments


    We can make use of opensource tools like JlintFindbugs , SonarQube or frameworks like  Total output Integration Framework (TOIF) - used by companies dedicated to produce military grade secure systems. As our environments become more containerized we can also utilize tools like: Anchore, Snyk.io, and Docker Bench for Security

    It would be worthwhile, if we can dedicate one GSOC project for this analysis and fixing of critical vulnerabilities and actual bugs. The student would be responsible to analyse the findings, generate reports, identify if it is really a bug and then submit a fix after consultation from the community. Of course, the student needs to demonstrate some basic understanding of security vulnerabilities( like buffer overflow etc) and should have some academic level of experience working with static analysis tools.

    Prioritization of Focus would be on:

    • Vulnerabilities, Hotspots, Bugs, and Code Smells in that order.
    Difficulty: Minor
    Difficulty: Normal
    Project size: ~350 hour (large)
    Potential mentors:
    , mail: (at) apache.org
    Project Devs, mail: dev (at) cassandra.apache.org

    Add support for EXPLAIN statements

    We should provide users a way to understand how their query will be executed and some information on the amount of work that will be performed.
    Explain statements are the most common way to do that.
    A CEP Draft has been open for that: (DRAFT) CEP-4: Explain. This draft propose to add support for EXPLAIN and EXPLAIN ANALYZE but I believe that we should split the work in 2 parts because a simple EXPLAIN would already provide relevant information.

    To complete this work I believe that the following steps will be required:

    • Rework and submit the CEP
    • Add missing statistics
    • Implements the logic behind the EXPLAIN statements
    Difficulty:
    Project size: ~350 hour (large)
    Potential mentors:
    , mail: (at) apache.org
    Project Devs, mail: dev (at) cassandra.apache.org

    Beam

    Runner Comparison / Capability Matrix revamp

    Discussion: https://lists.apache.org/thread.html/8aff7d70c254356f2dae3109fb605e0b60763602225a877d3dadf8b7@%3Cdev.beam.apache.org%3E

    Summarizing that discussion, we have a lot of issues/wishes. Some can be addressed as one-off and some need a unified reorganization of the runner comparison.

    Basic corrections:

    • Remove rows that impossible to not support (ParDo)
    • Remove rows where "support" doesn't really make sense (Composite transforms)
    • Deduplicate rows are actually the same model feature (all non-merging windowing / all merging windowing)
    • Clearly separate rows that represent optimizations (Combine)
    • Correct rows in the wrong place (Timers are actually a "what...?" row)
    • Separate or remove rows have not been designed ([Meta]Data driven triggers, retractions)
    • Rename rows with names that appear no where else (Timestamp control, which is called a TimestampCombiner in Java)
    • Switch to a more distinct color scheme for full/partial support (currently just solid/faded colors)
    • Switch to something clearer than "~" for partial support, versus ✘ and ✓ for none and full.
    • Correct Gearpump support for merging windows (see BEAM-2759)
    • Correct Spark support for non-merging and merging windows (see BEAM-2499)

    Minor rewrites:

    • Lump all the basic stuff (ParDo, GroupByKey, Read, Window) into one row
    • Make sections as users see them, like "ParDo" / "side Inputs" not "What?" / "side inputs"
    • Add rows for non-model things, like portability framework support, metrics backends, etc

    Bigger rewrites:

    • Add versioning to the comparison, as in BEAM-166
    • Find a way to fit in a plain English summary of runner's support in Beam. It should come first, as it is what new users need before getting to details.
    • Find a way to describe production readiness of runners and/or testimonials of who is using it in production.
    • Have a place to compare non-model differences between runners

    Changes requiring engineering efforts:

    • Gather and add quantitative runner metrics, perhaps Nexmark results for mid-level, smaller benchmarks for measuring aspects of specific features, and larger end-to-end benchmarks to get an idea how it might actually perform on a use case
    • Tighter coupling of the matrix portion of the comparison with tags on ValidatesRunner tests

    If you care to address some aspect of this, please reach out and/or just file a subtask and address it.

    Difficulty: P3
    Project size: ~350 hour (large)
    Potential mentors:
    Kenneth KnowlesRahul Goel, mail: kenn rahul.usit12 (at) apache.org
    Project Devs, mail: dev (at) beamfineract.apache.org

    A Complex Event Processing (CEP) library/extension for Apache Beam

    Apache Beam [1] is a unified and portable programming model for data processing jobs. The Beam model [2, 3, 4] has rich mechanisms to process endless streams of events.

    Complex Event Processing [5] lets you match patterns of events in streams to detect important patterns in data and react to them.

    Some examples of uses of CEP are fraud detection for example by detecting unusual behavior (patterns of activity), e.g. network intrusion, suspicious banking transactions, etc. Also trend detection is another interesting use case in the context of sensors and IoT.

    The goal of this issue is to implement an efficient pattern matching library inspired by [6] and existing libraries like Apache Flink CEP [7] using the Apache Beam Java SDK and the Beam style guides [8]. Because of the time constraints of GSoC we will probably try to cover first simple patterns of the ‘a followed by b followed by c’ kind, and then if there is still time try to cover more advanced ones e.g. optional, atLeastOne, oneOrMore, etc.

    [1] https://beam.apache.org/
    [2] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
    [3] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
    [4] https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43864.pdf
    [5] https://en.wikipedia.org/wiki/Complex_event_processing
    [6] https://people.cs.umass.edu/~yanlei/publications/sase-sigmod08.pdf
    [7] https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/cep.html
    [8] https://beam.apache.org/contribute/ptransform-style-guide/

    Difficulty: P3
    Project size: ~350 hour (large)
    Potential mentors:
    Ismaël Mejía, mail: iemejia (at) apache.org
    Project Devs, mail: dev (at) beam.apache.org

    A Beam runner for Ray

    Ray (https://ray.io) is a framework to develop distributed applications. There is a push to develop several libraries to support vario7us forms for AI/ML analytics with Ray. There is an opportunity to develop a Beam runner for Ray.

    https://docs.google.com/document/u/1/d/1vt78s48Q0aBhaUCHrVrTUsProJSP8-EBqDDRGTPEr0Y/edit

    Static Analysis and Vulnerability Scanning of Apache Fineract CN

    As our product is core banking platform and our clients are financial institutions, we strive hard to make our code base as secure as possible. However, due to ever increasing security threats and vulnerabilities, it is the need of hour that we analyze our code base in depth for security vulnerabilities. During pull request merge process, we have a process in place wherein we do peer code review,QA and integration tests. This practice has been very effective and our community is already reaping the benefits of such a strong code review process. However, we should test our code against the standard vulnerabilities which have been identified by reputed organisations like Mitre to gain more confidence. It has become a critical part of independent and partner-led deployments


    We can make use of opensource tools like JlintFindbugs , SonarQube or frameworks like  Total output Integration Framework (TOIF) - used by companies dedicated to produce military grade secure systems. As our environments become more containerized we can also utilize tools like: Anchore, Snyk.io, and Docker Bench for Security

    It would be worthwhile, if we can dedicate one GSOC project for this analysis. The student would be responsible to analyse the findings, generate reports, identify if it is really a bug and then submit a fix after consultation from the community. Of course, the student needs to demonstrate some basic understanding of security vulnerabilities( like buffer overflow etc) and should have some academic level of experience working with static analysis tools.

    Difficulty: Minor
    Project size: ~175 hour (medium)
    Potential mentors:
    Rahul Goel, mail: rahul.usit12 (at) apache.org
    Project Devs, mail: dev (at) fineract.apache.org

    Optimize Containerization & Deployment of Apache Fineract CN

    The increasing need for fast and reliable access to financial services has prompted the expansion of Apache Fineract from a single complex financial platform to a platform constituted of multiple micro-services that interact and scale to meet up with this increased need - Apache Fineract CN. Apache Fineract CN is a digital financial application platform built to render financial services to consumers in a fast, reliable and scalable manner. Deploying this platform such that consumers get the latest features with no reduction impact requires an optimized release cycle in a CI/CD (continuous integration and continuous Deployment) environment.


    In view of that, last year Courage began this work by implementing the needed scripts to containerize and deploy the Fineract CN services using Docker, Docker compose and Kubernetes. For the Google Summer of Code 2020, you are required to complete this work by performing the following task:

    • Improve Docker-compose deployment configuration to deploy on a swarm node
    • Implement new Fineract service to generate RSA keys and complete the provisioning process.
    • Improve provisioner and migration script to work with both a swarm cluster and a single machine running multiple compose services.
    • Build and publish the Fineract images on Docker hub.
    • Link Docker Hub to Github service repositories via an Automation Server pipeline.
    • Publish the built Fineract CN services libraries to a Maven Artifactory so developers will not have to manually clone and publish these services by themselves.

    N.B: 

    • I would like to hear the applicants own ideas.
    • The task for the completion of this project may change depending on input from the community, the mentors and the applicant.


    Difficulty: MinorDifficulty: P2
    Project size: ~350 hour (large)
    Potential mentors:
    Pablo EstradaRahul Goel, mail: pabloem rahul.usit12 (at) apache.org
    Project Devs, mail: dev (at) beamfineract.apache.org

    Digital Bank UI

    A new reference user interface on Fineract CN for staff of financial institutions such as digital, challenger, and neo-banks that focused on individual accounts is needed for multiple reasons:

    1. The current fims-web-app reference UI on top of Fineract CN is incomplete, unpolished and doesn't serve as a good representation of capabilities of Fineract CN.
    1. As more financial inclusion providers focus on individual lending and savings products and more digital banks/neo-banks and fintechs that don't have group or center-based operations explore Mifos and Fineract CN, we'd need to have a reference UI that is more in line with those requirements. We don't want prospective users to come and see the microfinance-centric UI and immediately think that the platform might not be useful for them.

    Intern will work on the following tasks:

    • Upgrade dependencies to latest versions
    • Improve overall user experience and look and feel
    • implement the front-end UI screens for the Fineract CN web UI for the following functionalities and use case:
      • Account Details
      • Notifications
      • Transaction Details
      • Account Opening
      • Accounting
      • Reporting
    • More Use cases to be listed.
    Difficulty: Minor
    Project size: ~350 hour (large)
    Potential mentors:
    Rahul Goel, mail: rahul.usit12

    Run code in examples in Beam's Pydoc

    We have the Beam Pydoc set up, and some functions have examples written into their documentaztion, however we do not run the examples that we express in Pydoc.

    This work item consists in improving the Pydoc for Apache Beam to run examples, adding some examples, and reformatting any existing examples / existing Pydoc that needs to be better expressed for Beam.

    Difficulty: P2
    Project size: ~175 hour (medium)
    Potential mentors:
    Pablo Estrada, mail: pabloem (at) apache.org
    Project Devs, mail: dev (at) beamfineract.apache.org

    Expand Unit Testing Coverage of Fineract with Cucumber Testing Framework

    The goal of this project is to expand unit testing coveragea cross the Finerat platform. Currently most of our automated testing is only through integration tests which take a long time to run and aren’t consistent. Cucumber is being implemented as the unit test framework and this project would focus on converting existing integration tests to unit test and writing new unit tests.

    Goals are to increase testing coverage of core modules, reduce run-time at build of completing tests, and implementing some automated reporting to show testing coverage.

    The student will be working on implementing the following things:

    1. Collaborate with mentor to implement Cucumber framework
    1. Collaborate with mentor to implement test containers
    1. Refine test data set and scripts
    1. Convert high priority integration tests to unit tests
    1. Write unit tests for key functional modules
    1. Implement reporting to show test coverage.
    Difficulty: Minor

    CLONE - A generic Beam IO Sink for Java

    It would be desirable to develop a Beam Sink that supports all of the 'best practices' like throttling, auto-sharding, exactly-once capabilities, etc.

    A design proposal is here: https://docs.google.com/document/d/1UIWv6wnD86GYAkeqbVWCG3mx4dTZ9WstUUThPWQmcFM/edit#heading=h.smc16ifdre2

    A prototype for the API and parts of implementation is here: https://github.com/apache/beam/pull/16763

    Contact Pablo Estrada on dev@beam.apache.orgImage Removed if you have questions, or comment here.

    Difficulty: P2
    Project size: ~350 hour (large)
    Potential mentors:
    Pablo EstradaRahul Goel, mail: pabloem rahul.usit12 (at) apache.org
    Project Devs, mail: dev (at) beamfineract.apache.org

    A generic Beam IO Sink for Java

    It would be desirable to develop a Beam Sink that supports all of the 'best practices' like throttling, auto-sharding, exactly-once capabilities, etc.

    A design proposal is here: https://docs.google.com/document/d/1UIWv6wnD86GYAkeqbVWCG3mx4dTZ9WstUUThPWQmcFM/edit#heading=h.smc16ifdre2

    A prototype for the API and parts of implementation is here: https://github.com/apache/beam/pull/16763

    Contact Pablo Estrada on dev@beam.apache.orgImage Removed if you have questions, or comment here.

    Difficulty: P2
    Project size: ~350 hour (large)
    Potential mentors:
    Pablo Estrada, mail: pabloem (at) apache.org
    Project Devs, mail: dev (at) beam.apache.org

    Apache Nemo

    Fineract-CN-Mobile-Version-4.0

    Just as we have a mobile field operations app on Apache Fineract 1.x, we have recently built out on top of the brand new Apache Fineract CN micro-services architecture, an initial version of a mobile field operations app with an MVP architecture and material design. Given the flexibily of the new architecture and its ability to support different methodologies - MFIs, credit unions, cooperatives, savings groups, agent banking, etc - this mobile app will have different flavors and workflows and functionalities. 


    In 2021, our Google Summer of Code intern, Varun Jain worked on additional functionality in the Fineract CN mobile app. In 2022, the student will work on the following tasks:

    • Continuing migration to Kotlin
    • Incorporating more external transaction flows through Payment Hub EE
    • Improving notifications generation in the app.
    • Enhancing the customer onboarding and loan origination user experience I
    • Refining the UI and look and feel of app.
    • More robust survey and data capture features.
    • Integration with external ID systems for biometric verification
    • Integration with voucher generation.
    • Improve GIS features like location tracking, dropping of pin into the app
    • Improve offline mode via Couchbase support
    • Write Unit Test, Integration Test and UI tests




    Difficulty: Minor
    Project size: ~175 hour (medium)
    Potential mentors:
    Rahul Goel, mail: rahul.usit12

    Efficient Dynamic Reconfiguration in Stream Processing

    In Stream processing, we have many methods, starting from the primitive checkpoint-and-replay to a more fancy version of reconfiguration and reinitiation of stream workloads. We aim to find a way to find the most effective and efficient way of reconfiguring stream workloads. Sub-issues are to be created later on.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Wonook, mail: wonook (at) apache.org
    Project Devs, mail: dev (at) nemofineract.apache.org

    Application structure-aware caching on Nemo

    Nemo has a policy layer that allows powerful optimization with configurable runtime modules. In terms of caching, it is possible to identify frequently used data and decide to cache them in-memory ahead of execution, without user annotation.

    Implementation would need:

    • On policy layer, build compile-time pass that identify reused data and mark them as cached
    • On runtime, design and implement caching mechanism that manages per-executor cached data and discard them when these are no longer used.
    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Jeongyoon Eo, mail: jeongyoon (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org

    Functional Enhancements - Mobile Banking App for Fineract CN

    Just as we have a client-facing mobile banking app for our generation 2 Apache Fineract 1.0 platform, we need to provide a reference mobile banking app on top of the Apache Fineract CN architecture which allows a client to securely authenticate against the microservices architecture and interact with his/her accounts. 


    A major focal area for the 2022 GSOC is to integrate with the Open Banking API layer built on top of the WS02 API Gateway which provides a secure authentication and integration layer for first party applications as currently the app only is consuming a mock layer of data. Additional use cases would include better support for transactions via external payment systems, improving the workflow for sign-up and account creation and implementing new UI designs. 

    • Integrate with Fineract CN via Open Banking API layer on WS02 API Gateway
    • Map APIs to Open Banking APIs
    • Improve workflow for self-guided sign-up, account creation, and initial authentication. 
    • Integration with external payment systems via Mojaloop and GSMA mobile money API. 
    Difficulty: Minor

    Implement spill mechanism on Nemo

    Currently, Nemo doesn't have a spill mechanism. This makes executors prone to memory problems such as OOM(Out Of Memory) or GC when task data is large. For example, handling skewed shuffle data in Nemo results in OOM and executor failure, as all data has to be handled in-memory.

    We need to spill in-memory data to secondary storage when there are not enough memory in executor.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Jeongyoon Eo, mail: jeongyoon (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org

    Efficient Caching and Spilling on Nemo

    In-memory caching and spilling are essential features in in-memory big data processing frameworks, and Nemo needs one.

    • Identify and persist frequently used data and unpersist it when its usage ended
    • Spill in-memory data to disk upon memory pressure
    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Jeongyoon Eo, mail: jeongyoon (at) apache.org
    Project Devs, mail: dev (at) nemo.apache.org
    mentors:
    Rahul Goel, mail: rahul.usit12 (at) apache.org
    Project Devs, mail: dev (at) fineract.apache.org

    Reduce Boilerplate Code by Introducing lombok to Reduce getters/setters and Mapstruct to map REST DTO to Entity Objects

    Lombok could help us to not only reduce a large amount of code, but also to fix a couple of inconsistencies in the code base:

    • getters/setters with non-standard characters (e. g. underscores)
    • getters/setters with typos

    The layered architecture of Fineract requires mapping between REST DTO classes and internal entity classes. The current code base contains various strategies to achieve this:

    • private functions
    • static functions
    • mapping classes

    All of these approaches are very manual (and error prone) and difficult to maintain. Mapstruct can help here:

    • throw errors at compile time (missing new attributes, type changes etc.)
    • one common concept (easier to understand)
    • reduce manually maintained code and replace mostly generated code

    Challenges:

    • maintain immutability (especially in DTO classes)
    • should we fluent builder pattern?
    • backwards compatibility
    • these improvements cannot be introduced as one pull request, but have to be split up at least at the “module” level (clients, loans, accounts etc.). This would result in approximately 30 pull requests; if we split up Lombok and Mapstruct then it would be 30 PRs each (=60); we would need this fine grained approach to make a transition as painless as possible
    • some classes are maybe beyond repair (e. g. Loan.java with 6000 lines of code, the smaller part getters/setters and a long list of utility/business logic functions)
    Difficulty: Minor

    Enhance Nemo to support autoscaling for bursty loads

    The load of streaming jobs usually fluctuate according to the input rate or operations (e.g., window). Supporting the automatic scaling could reduce the operational cost of running streaming applications, while minimizing the performance degradation that can be caused by the bursty loads. 

    We can harness the cloud resources such as VMs and serverless frameworks to acquire computing resources on demand. To realize the automatic scaling, the following features should be implemented.

    1) state migration: scaling jobs require moving tasks (or partitioning a task to multiple ones). In this situation, the internal state of the task should be serialized/deserialized. 

    2) input/output rerouting: if a task is moved to a new worker, the input and output of the task should be redirected. 

    3) dynamic Executor or Task creation/deletion: Executor}}s or {{Task can be dynamically created or deleted. 

    4) scaling policy: a scaling policy that decides when and how to scale out/in should be implemented. 

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Tae-Geon UmRahul Goel, mail: taegeonum rahul.usit12 (at) apache.org
    Project Devs, mail: dev (at) nemofineract.apache.org

    Apache Dubbo

    GSoC2022 Rust language implementation

    Dubbo provides implementations of almost all mainstream languages from Java, Golang, Javascript, C# to Python, etc.In this project, we want to build a basic Rust implementation for Dubbo.to Python, etc.

    In this project, we want to build a basic Rust implementation for Dubbo.

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Jun Liu, mail: chicken (at) apache.org
    Project Devs, mail:

    GSoC2022 Metrics and Observability for Dubbo-go

    Description

    Please read the Observasibility proposal here first to know about the ultimate goal behind this issue.

    If you are interested in this project and the objective described in the proposal, please leave comments on the corresponding Github issue below so we can further exchange information on the tasks that need to be done.

    https://github.com/apache/dubbo-go/issues/1807

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Jun LiuZhixin Li, mail: chicken laurence (at) apache.org
    Project Devs, mail:

    ...

    GSoC2022 Task demo demonstrating the usage of Dubbo3

    • 目标
      首先,从宏观上、使用上掌握 Dubbo 及微服务治理相关概念;在此基础之上,设计一系列的 Demo 应用,基于这些应用设计出一系列微服务治理的 Tasks,每个 Task 涵盖一项或多项 Dubbo 的服务治理能力,通过详细描述的用例引导用户一步步的完成每一个 Task,进而帮助用户学习使用 Dubbo 能做到什么。

    详情请在 https://github.com/apache/dubbo/issues/9887 讨论。

    • 任务描述
      Dubbo 拥有丰富的治理规则,如服务发现、负载均衡、路由策略(标签路由、条件路由)等,但是这些治理规则的使用具有一定的难度,用户也很难直观的了解到其对应的使用场景。因此 Dubbo 期望有这样的一些场景化的用例能够体现 Dubbo 的治理能力,帮助用户将治理规则迁移到真实业务场景中使用。

    这是一项相对比较有挑战性的任务,难度并不在编码本身,而在于对整个 Dubbo 及微服务体系要有比较总体的把握。如能顺利完成,对于参与者整体的视野提升将具有非常大的帮助。参与者可以导师一起协作完成。

    • 参考:
      Istio 中 bookinfo 应用
    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Jun Liu, mail: chicken (at) apache.org
    Project Devs, mail:

    GSoC2022

    Sidecar

    Proxyless Mesh support

    Please read the detailed proposal of Dubbo Sidecar Proxyless Mesh or Thin SDK here first to know about the ultimate goal behind this issue.

    The details of this project will be posted on the following GitHub issue, please keep posted there.

    https://github.com/apache/dubbo/issues/98859884

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Jun Liu, mail: chicken (at) apache.org
    Project Devs, mail:

    GSoC2022

    Proxyless

    Sidecar Mesh support for Dubbo-go

    Please read read the detailed proposal of Dubbo Proxyless Mesh here Sidecar Mesh or Thin SDK here first to know about the ultimate goal behind this issue.

    The details of this project will be posted on the following GitHub issue, please keep posted there.

    https://github.com/apache/dubbo-go/issues/98841809

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Jun LiuZhixin Li, mail: chicken laurence (at) apache.org
    Project Devs, mail:

    GSoC2022

    Metrics and Observability

    Proxyless Mesh support for Dubbo-go

    Description

    Please read the Observasibility proposal here Please read the detailed proposal of Dubbo Proxyless Mesh here first to know about the ultimate goal behind this issue.

    If you are interested in this project and the objective described in the proposal, please leave comments on the corresponding Github issue below so we can further exchange information on the tasks that need to be doneThe details of this project will be posted on the following GitHub issue, please keep posted there.

    https://github.com/apache/dubbo-go/issues/18071808

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Zhixin Li, mail: laurence (at) apache.org
    Project Devs, mail:

    GSoC2022 Sidecar Mesh support

    for Dubbo-go

    Please read read the detailed proposal of Dubbo Sidecar Mesh or Thin SDK here here first to know about the ultimate goal behind this issue.

    The details of this project will be posted on the following GitHub issue, please keep posted there.

    https://github.com/apache/dubbo-go/issues/18099885

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Zhixin LiJun Liu, mail: laurence chicken (at) apache.org
    Project Devs, mail:

    GSoC2022 Proxyless Mesh support for Dubbo-go

    Please read the detailed proposal of Dubbo Proxyless Mesh here first to know about the ultimate goal behind this issue.

    GSoC 2022 Rust language service governance implementation for Dubbo3

    The details of this project will be posted on the following GitHub issue, please keep posted there.

    https://github.com/apache/dubbo-rust/issues/2

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Albumen Kevin, mail: albumenj (at) apache.org
    Project Devs, mail:

    GSoC 2022 Rust language protocol implementation for Dubbo3

    The details of this project will be posted on the following GitHub issue, please keep posted there.

    https://github.com/apache/dubbo-gorust/issues/18081

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Zhixin LiAlbumen Kevin, mail: laurence albumenj (at) apache.org
    Project Devs, mail:

    ...

    Interactive Hyracks Job Viewer

    We will utilize ngx-graph library simialar to interactive query plan viewer (ASTERIXDB-2863) in order to display an interactive query plan that supports DAGs.

    Features:

    • Colored nodes (by operator)
    • Zoom out to fit whole plan
    • Zoom and drag through the plan
    • Traverse the nodes or jump to nodes in a Depth First Search (DFS) fashion
    • Detail number of locations for execution 
    • Detailed mode (contains more information per node)
      • Search using string match
    • Clear all selections and reset the interactive plan
       
    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Preston Carman, mail: prestonc (at) apache.org
    Project Devs, mail:

    Airavata

    Enhance File Transports in MFT

    Complete all transports in MFT

    • Currently SCP, S3 is known to work
    • Others need effort to optimize, test, and declare readiness
    • Develop a complete a fully functional MFT Command-line interface
    • Have a feature-complete Python SDK
    • A minimum implementation will be prvoided, students need to complete it and test it.Clear all selections and reset the interactive plan
       
    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Preston CarmanSuresh Marru, mail: prestonc smarru (at) apache.org
    Project Devs, mail:

    ...

    dev (at) airavata.apache.org

    Provide meta scheduling capabilities within Airavata

    As discussed on the architecture mailing list [1] and summarized at [2], Airavata will need to develop a metascheduler. In the short term, a user request (demeler, gobert) is to have airavata throttle jobs to resources. In the future more informed scheduling strategies needs to be integrated. Hopefully, the actual scheduling algorithms can be borrowed from third party implementations.

    [1] - http://markmail.org/message/tdae5y3togyq4duv
    [2] - https://cwiki.apache.org/confluence/display/AIRAVATA/Airavata+Metascheduler

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Suresh Marru, mail: smarru (at) apache.org
    Project Devs, mail: dev (at) airavata.apache.org

    Airavata Rich Client based on ElectronJS

    Using SEAGrid Rich Client as an example, develop a native application based on electronJS to mimic Airavata Django Portal.

    Reference example - https://github.com/SciGaP/seagrid-rich-client 

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Suresh Marru, mail: smarru (at) apache.org
    Project Devs, mail: dev (at) airavata.apache.org

    Migrate Datalake from Neo4J to JanusGraph

    Airavata Data lake is currently implemented in Neo4J. To increase the scale and broaden the use cases, we need to migrate to janusgraph

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Suresh Marru, mail: smarru (at) apache.org
    Project Devs, mail: dev (at) airavata.apache.org

    Gateway adminportal monitoring module

    The proposed monitoring module is for individual gateway admins to generate report they need for various reporting and planning. This documentation will explain the monitoring requirements of SciGaP gateway admins.

    Another main aspect of the monitoring module would be to have an audit trail. The audit is to generate report which states who has changed what in gateway Settings level. The audit is required to all aspects of Admin Settings and should display who has created, updated or deleted records within the gateway.

    Enhance File Transports in MFT

    Complete all transports in MFT

  • Currently SCP, S3 is known to work
  • Others need effort to optimize, test, and declare readiness
  • Develop a complete a fully functional MFT Command-line interface
  • Have a feature-complete Python SDK
  • A minimum implementation will be prvoided, students need to complete it and test it. 

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Suresh MarruEroma, mail: smarru eroma_a (at) apache.org
    Project Devs, mail: dev (at) airavata.apache.org

    Airavata Jupyter Platform Services

    1. UI Framework 
      1. To host the jupyter environment we will need to envolop the notebooks in a user interface and connect it with Apache Airavata services 
      2. Leverage Airavata communications from within the Django Portal - https://github.com/apache/airavata-django-portal 
      3. Explore if the platform is better to be developed as VSCode extensions leveraging jupyter extensions like - https://github.com/Microsoft/vscode-jupyter
      4. Alternatively, explore developing a standalone native application using ElectronJS
    1. UI Framework 
  • Django
  • Electron JS
  • VSCode
    1. Draft up a platform architecture - Airavata based collab infrastructure + JS2
    Virtual File System - Integrate
    1. infrastructure with functionality similar to collab. 
    2. Authenticate with Airavata Custos Framework - https://github.com/apache/airavata-custos 
    3. Extend Notebook filesystem using the virtual file system approaching integration with Airavata based storage and catalog
    Extend
    1. Make the notebooks registered with Airavata app catalog and experiment catalog
    to recognize notebooks as first-class 


    Advanced Possibilities:

    Explore Multi-tenanted JupyterHub 

    • Can K8 namespace isolation accomplish?
    • Make deployment of Jupyter support as part of the default core
    • Data and the user-level tenancy can be assumed, how to make sure infrastructure can isolate them, like not one gateway crashing a hosting environment.
    1. How to leverage computational resources jupypter hub
    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Suresh Marru, mail: smarru (at) apache.org
    Project Devs, mail: dev (at) airavata.apache.org

    SMILES data Models

    Extend Airavata Data Catalog to record metadata extracted from experimental and computational data in support of the small-molecule ionic isolation lattices SMILES data.

    Suggested flow:

    VueJS user interfaces -> Django App -> API Server -> Data Orchestrator -> Data Lake

    Refer to https://github.com/apache/airavata-data-lake

    The data models should be developed in JSON-LD https://json-ld.org/should be developed in JSON-LD https://json-ld.org/ 

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Suresh Marru, mail: smarru (at) apache.org
    Project Devs, mail: dev (at) airavata.apache.org

    Custos Backup and Restore

    Custos does not have the capabilities to efficiently backup and restore a live instance. This is essential for high available services. 

    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Suresh Marru, mail: smarru (at) apache.org
    Project Devs, mail: dev (at) airavata.apache.org

    Dashboards to get quick statistics

    Gateway admins need period reports for various reporting and planning. 

    Features Include:

    • Compute resources across that had at least one job submitted during the period <start date - End date>
    • User groups created within a given period and how many users are in those and with permission levels and also number of jobs each user have submitted.
    • List applications and number of jobs for each applications for a given period and group them by job status.
    • Number of users that at least submitted a single job for the period <start date - End date>
    • Total number of Unique Users
    • User Registration Trends
    • Number of experiments for a given period <Start date - End date> grouped by the experiment status
    • The total cpu-hours used by a users, sorted, quarterly, plotted over a period of time
    • The total cpu-hours consumed by application, sorted, quarterly, plotted over a period of time


    Difficulty: Major
    Project size: ~350 hour (large)
    Potential mentors:
    Suresh Marru, mail: smarru (at) apache.org
    Project Devs, mail: dev (at) airavata.apache.org