Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.

Background

Currently, window logic can be individually defined per pipeline element. The whole windowing logic needs to be declared in the controller and runtime logic needs to be individually added based on the selected runtime wrapper (Java, Siddhi, Flink, etc...).

As many data processors benefit from using window-functions (i.e PEs such as Event Counter, Count Aggregation, Rate Limiter), windowing logic is often duplicated as it needs to be implemented for every new pipeline element. In addition, the feature set of supported window operators differs (and often depends on the developer) as it is unclear which windows and parameters should/can be offered.

Therefore, adding support for explicit window semantics to the SDK/Core would make implementing data processors and sinks using windows much easier and less error-prone.

Tasks

Design and introduce new processor and controller classes for windowed event processors (e.g., WindowedDataProcessor) which handle the windowing logic internally and only expose the higher-level methods to users (i.e onCurrentEvent, onExpiredEvent, etc...).
Implement internal logic for few window functions (i.e TimeWindow, LengthWindow, TimeBatchWindow, LengthBatchWindow, etc...)
Write a few sample pipeline-elements using your new API!

Relevant Skills

Basic knowledge in StreamPipes core (cloning the repo, going through the codebase/documents would do).
Basic knowledge of stream analytics window functions (this is not a must, but it's awesome if you know your way around analytics window functions).
Some Java experience.

Learning Material

For StreamPipes:

For Streaming Analytics:

For the context for the issue:

https://www.mail-archive.com/dev@streampipes.apache.org/msg00868.html

Mentor

Grainier Perera (grainier [at] apache.org).

Difficulty: Major

Potential mentors:

Grainier Perera, mail: grainier (at) apache.org

Project Devs, mail:

More powerful real-time visualizations for StreamPipes

New Python Wrapper

Apache StreamPipes

Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.

Background

Current wrappers such as standalone (JVM, Siddhi) or distributed (Flink) already allow to develop new processors in the given runtime environment. The idea is to extend the list of standalone runtime wrappers to also support pure Python processors. We already got a minimal working version that however is pretty inflexible and still relies on Java as a proxy to the pipeline management in the backend service for the model declaration in the setup phase ( capabilities, requirements, static properties of a processor) as well as the actual invocation in the execution phase ( receiving specific configuration from pipeline management when pipeline is started). This issue is to track the status of the development.

Tasks

Add API endpoints as an interface for registration/invocation ( partly done)
Port relevant model classes over to Python (declaration + invocation descriptions)
Implement support for various transport protocols and transport formats
Implement dev friendly alternative to Java builder pattern for model declaration
Implement overall runtime logic for Python wrapper

Currently, the live dashboard (implemented in Angular) offers an initial set of simple visualizations, such as line charts, gauges, tables and single values. More advanced visualizations, especially those relevant for condition monitoring tasks (e.g., monitoring sensor measurements from industrial machines) is easy. Visualizations can be flexibly created by users and there is an SDK that allows to express requirements (e.g., based on data type or semantic type) for visualizations to better guide users through the creation process.

Tasks

Extend the set of real-time visualizations in StreamPipes, e.g., by integrating existing visualizations from Apache ECharts.
Improve the existing dashboard, e.g., by introducing better filtering or more advanced customization options.

Relevant Skills

0. Don't be afraid! We'll guide you through your first steps with StreamPipes.

Excellent Python skills
Excellent understanding of stream processing paradigm incl. message broker such as Kafka, MQTT, etc
Good Understanding of RESTful web services (HTTP, etc.)
Basic Java skills to understand existing wrapper logic

Info

SIP-02 to collect design decisions https://cwiki.apache.org/confluence/display/STREAMPIPES/SIP-02+Python+wrapper
Current python runtime wrapper implementation: https://github.com/apache/incubator-streampipes/tree/dev/streampipes-wrapper-python
POC example: https://github.com/apache/incubator-streampipes-examples/tree/dev/streampipes-pipeline-elements-examples-processors-jvm/src/main/java/org/apache/streampipes/pe/examples/jvm/python

Mentor

Patrick Wiener, PPMC Apache StreamPipes (wiener@apache.org)

Angular
Basic knowledge of Apache ECharts

Mentor

Dominik Riemer, PPMC Apache StreamPipes (riemer@apache.org)

Difficulty: Major

Potential mentors:

Dominik Riemer, mail: riemer

Difficulty: Major

Potential mentors:

Patrick Wiener, mail: wiener (at) apache.org

Project Devs, mail:

CLONE -

New Python Wrapper

Apache StreamPipes

Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.

Background

Current wrappers such as standalone (JVM, Siddhi) or distributed (Flink) already allow to develop new processors in the given runtime environment. The idea is to extend the list of standalone runtime wrappers to also support pure Python processors. We already got a minimal working version that however is pretty inflexible and still relies on Java as a proxy to the pipeline management in the backend service for the model declaration in the setup phase ( capabilities, requirements, static properties of a processor) as well as the actual invocation in the execution phase ( receiving specific configuration from pipeline management when pipeline is started). This issue is to track the status of the development.

Tasks

Add API endpoints as an interface for registration/invocation ( partly done)
Port relevant model classes over to Python (declaration + invocation descriptions)
Implement support for various transport protocols and transport formats
Implement dev friendly alternative to Java builder pattern for model declaration
Implement overall runtime logic for Python wrapper

Relevant Skills

0. Don't be afraid! We'll guide you through your first steps with StreamPipes.

Excellent Python skills
Excellent understanding of stream processing paradigm incl. message broker such as Kafka, MQTT, etc
Good Understanding of RESTful web services (HTTP, etc.)
Basic Java skills to understand existing wrapper logic

Info

SIP-02 to collect design decisions https://cwiki.apache.org/confluence/display/STREAMPIPES/SIP-02+Python+wrapper
Current python runtime wrapper implementation: https://github.com/apache/incubator-streampipes/tree/dev/streampipes-wrapper-python
POC example: https://github.com/apache/incubator-streampipes-examples/tree/dev/streampipes-pipeline-elements-examples-processors-jvm/src/main/java/org/apache/streampipes/pe/examples/jvm/python

Mentor

Patrick Wiener, PPMC Apache StreamPipes (wiener@apache.org)

Difficulty: Major

Potential mentors:

Anush krishna VPatrick Wiener, mail: anushkrishnav wiener (at) apache.org

Project Devs, mail:

Spatial Information Systems

Create metadata, CRS and tabular data editors in JavaFX

Creates the foundation of a GUI application for Apache SIS based on JavaFX. This application should leverage the functionalities available in Apache SIS 0.8. In particular:

Read metadata from files in various formats (currently ISO 19139, GeoTIFF, NetCDF, LANDSAT, GPX, Moving Features)
Get Coordinate Reference System from a registry or from GML or WKT definitions and apply coordinate transformations.
Show vector data in a tabular format.

Since SIS does not yet have a renderer engine, we can not yet show maps in the application. However the application should be designed with this goal in mind.

This project should create a metadata editor showing the ISO 19115 metadata. We should provide a simplified view with only the essential information, and an advanced view showing all information. The information to shown should be customizable. The user should be able to edit the metadata and save them in ISO 19139 format.

The project should also create the necessary widgets for showing a Coordinate Reference System (CRS) definition and allow the user to edit it. Another widget should use the CRS definitions for applying coordinate operations (map projections) using the existing Apache SIS referencing engine, and show the result in a table with information about accuracy and domain of validity.

Edit (March 2021): A JavaFX application has been created. It has widget for metadata and vector data, but we still need widget for Coordinate Reference System definitions. See SIS wiki for screenshots.

More powerful real-time visualizations for StreamPipes

Apache StreamPipes

Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.

Background

Currently, the live dashboard (implemented in Angular) offers an initial set of simple visualizations, such as line charts, gauges, tables and single values. More advanced visualizations, especially those relevant for condition monitoring tasks (e.g., monitoring sensor measurements from industrial machines) is easy. Visualizations can be flexibly created by users and there is an SDK that allows to express requirements (e.g., based on data type or semantic type) for visualizations to better guide users through the creation process.

Tasks

Extend the set of real-time visualizations in StreamPipes, e.g., by integrating existing visualizations from Apache ECharts.
Improve the existing dashboard, e.g., by introducing better filtering or more advanced customization options.

Relevant Skills

0. Don't be afraid! We'll guide you through your first steps with StreamPipes.

Angular
Basic knowledge of Apache ECharts

Mentor

Dominik Riemer, PPMC Apache StreamPipes (riemer@apache.org)

Difficulty: Major

Potential mentors:

Dominik RiemerMartin Desruisseaux, mail: riemer desruisseaux (at) apache.org

Project Devs, mail:

Spatial Information Systems

Coordinate operation methods to implement

This is an umbrella task for some coordinate operation methods not yet supported in Apache SIS. Coordinate operations include map projections (e.g. Transverse Mercator, Lambert Conic Conformal, etc.), datum shifts (e.g. transformations from NAD27 to NAD83 in United States), transformation of vertical coordinates, etc. We can of course not list all possible formulas that we do not support, but this JIRA task lists at least some of the operations listed in the EPSG guidance notes.

The main material for this work is the EPSG guidance notes, which can be downloaded freely from the following site:

IOGP Publication 373-7-2 – Geomatics Guidance Note number 7, part 2
Coordinate Conversions and Transformations including Formulas
http://www.epsg.org/GuidanceNotes

Google summer of code students interested in this work would need to be reasonably comfortable with the Java language (but not necessarily with the JDK library at large, since this work uses relatively few JDK classes outside Math), and in mathematic. In particular, this work requires a good understanding of affine transforms: their representation as a matrix, and how to map a term in a formula to a coefficient in the affine transform matrix.

Apache SIS has one advanced feature which is not easily found in popular geospatial software or text books: the capability to compute the derivative (or more precisely, the Jacobian) of a transformation at a given point. Implementation of this feature requires the capability to find the analytic derivative of a non-linear formula and to simplify it.

Implementations of those formulas take place in one of the org.apache.sis.referencing.operation sub-packages (projection or transform). Implementations of JUnit test happen partially in Apache SIS, and partially in the "conformance module" of the GeoAPI project, if possible through the Geospatial Integrity of Geoscience Software (GIGS) tests

Create metadata, CRS and tabular data editors in JavaFX

Creates the foundation of a GUI application for Apache SIS based on JavaFX. This application should leverage the functionalities available in Apache SIS 0.8. In particular:

Read metadata from files in various formats (currently ISO 19139, GeoTIFF, NetCDF, LANDSAT, GPX, Moving Features)
Get Coordinate Reference System from a registry or from GML or WKT definitions and apply coordinate transformations.
Show vector data in a tabular format.

Since SIS does not yet have a renderer engine, we can not yet show maps in the application. However the application should be designed with this goal in mind.

This project should create a metadata editor showing the ISO 19115 metadata. We should provide a simplified view with only the essential information, and an advanced view showing all information. The information to shown should be customizable. The user should be able to edit the metadata and save them in ISO 19139 format.

The project should also create the necessary widgets for showing a Coordinate Reference System (CRS) definition and allow the user to edit it. Another widget should use the CRS definitions for applying coordinate operations (map projections) using the existing Apache SIS referencing engine, and show the result in a table with information about accuracy and domain of validity.

Edit (March 2021): A JavaFX application has been created. It has widget for metadata and vector data, but we still need widget for Coordinate Reference System definitions. See SIS wiki for screenshots.

Difficulty: Major

Potential mentors:

Martin Desruisseaux, mail: desruisseaux (at) apache.org

Project Devs, mail:

Coordinate operation methods to implement

Solr

Refactor test infra to work with a managed SolrClient; ditch TestHarness

This is a proposal to substantially refactor SolrTestCaseJ4 and some of its intermediate subclasses in the hierarchy. In essence, I envision that tests should work with a SolrClient typed "solrClient" field managed by the test infrastructure. With only a few lines of code, a test should be able to pick between an instance based on EmbeddedSolrServer (lighter tests), HttpSolrClient (tests HTTP/Jetty behavior directly or indirectly), SolrCloud, and perhaps a special one for our distributed search tests. STCJ4 would refactor its methods to use the solrClient field instead of TestHarness. TestHarness would disappear as-such; bits of its existing code would migrate elsewhere, such as to manage an EmbeddedSolrServer for testing.

I think we can do a transition like this in stages and furthermore minimally affecting most tests by adding some deprecated shims. Perhaps STCJ4 should become the deprecated shim so that users can still use it during 7.x and to help us with the transition internally too. More specifically, we'd add a new superclass to STCJ4 that is the future – "SolrTestCase".

Additionally, there are a bunch of methods on SolrTestCaseJ4 that I question the design of, especially ones that return XML strings like delI (generates a delete-by-id XML string) and adoc. Perhaps that used to be a fine idea before there was a convenient SolrClient API but we've got one now and a test shouldn't be building XML unless it's trying to test exactly that.

For consulting work I once developed a JUnit4 TestRule managing a SolrClient that is declared in a test with an annotation of @ClassRule. I had a variation for SolrCloud and EmbeddedSolrServer that was easy for a test to choose. Since TestRule is an interface, I was able to make a special delegating SolrClient subclass that implements TestRule. This isn't essential but makes use of it easier since otherwise you'd be forced to call something like getSolrClient(). We could go the TestRule route here, which I prefer (with or without having it subclass SolrClient), or we could alternatively do TestCase subclassing to manage the lifecycle.

Initially I'm just looking for agreement and refinement of the approach. After that, sub-tasks ought to be added. I won't have time to work on this for some time

This is an umbrella task for some coordinate operation methods not yet supported in Apache SIS. Coordinate operations include map projections (e.g. Transverse Mercator, Lambert Conic Conformal, etc.), datum shifts (e.g. transformations from NAD27 to NAD83 in United States), transformation of vertical coordinates, etc. We can of course not list all possible formulas that we do not support, but this JIRA task lists at least some of the operations listed in the EPSG guidance notes.

The main material for this work is the EPSG guidance notes, which can be downloaded freely from the following site:

IOGP Publication 373-7-2 – Geomatics Guidance Note number 7, part 2
Coordinate Conversions and Transformations including Formulas
http://www.epsg.org/GuidanceNotes

Google summer of code students interested in this work would need to be reasonably comfortable with the Java language (but not necessarily with the JDK library at large, since this work uses relatively few JDK classes outside Math), and in mathematic. In particular, this work requires a good understanding of affine transforms: their representation as a matrix, and how to map a term in a formula to a coefficient in the affine transform matrix.

Apache SIS has one advanced feature which is not easily found in popular geospatial software or text books: the capability to compute the derivative (or more precisely, the Jacobian) of a transformation at a given point. Implementation of this feature requires the capability to find the analytic derivative of a non-linear formula and to simplify it.

Implementations of those formulas take place in one of the org.apache.sis.referencing.operation sub-packages (projection or transform). Implementations of JUnit test happen partially in Apache SIS, and partially in the "conformance module" of the GeoAPI project, if possible through the Geospatial Integrity of Geoscience Software (GIGS) tests.

Difficulty: Major

Potential mentors:

Martin DesruisseauxDavid Smiley, mail: desruisseaux dsmiley (at) apache.org

Project Devs, mail:

Solr

Pulsar

Integration with Apache Ranger

Currently, Pulsar only supports store authorization policies under local-zookeeper. Is it possible to support [ranger](https://github.com/apache/ranger), apache ranger can provide a framework for central administration of security policies and monitoring of user access.

Difficulty: Major

Potential mentors:

Penghui Li, mail: penghui (at) apache.org

Project Devs, mail:

Throttle the ledger rollover for the broker

In Pulsar, the ledger rollover is split the data of a topic into multiple segments. For each ledger roll over operation, the metadata of the topic needs to be updated in the ZookKeeper. High ledger rollover frequency may lead to the ZookKeeper cluster in heavy workload. In order to make the ZookKeeper run more stable, we should limit the ledger rollover rate.

Difficulty: Major

Potential mentors:

Penghui Li, mail: penghui (at) apache.org

Project Devs, mail:

Support reset cursor by message index

Currently, Pulsar supports resetting the cursor according to time and message-id, e.g. you can reset the cursor to 3 hours ago or reset the cursor to a specific message-id. For some cases that users want to reset to the 10,000 earlier messages, Pulsar has not supported this operation yet

PIP-70 https://github.com/apache/pulsar/wiki/PIP-70%3A-Introduce-lightweight-raw-Message-metadata Introduced a broker level entry metadata which can support message index for a topic(or message offset of a topic), this will provide the ability to support reset cursor according to the message index.

Difficulty: Major

Potential mentors:

Penghui Li, mail: penghui (at) apache.org

Project Devs, mail:

Support publish and consume avro objects in pulsar-perf

We should use perf tool to benchmark producing and consuming messages using Schema.

Refactor test infra to work with a managed SolrClient; ditch TestHarness

This is a proposal to substantially refactor SolrTestCaseJ4 and some of its intermediate subclasses in the hierarchy. In essence, I envision that tests should work with a SolrClient typed "solrClient" field managed by the test infrastructure. With only a few lines of code, a test should be able to pick between an instance based on EmbeddedSolrServer (lighter tests), HttpSolrClient (tests HTTP/Jetty behavior directly or indirectly), SolrCloud, and perhaps a special one for our distributed search tests. STCJ4 would refactor its methods to use the solrClient field instead of TestHarness. TestHarness would disappear as-such; bits of its existing code would migrate elsewhere, such as to manage an EmbeddedSolrServer for testing.

I think we can do a transition like this in stages and furthermore minimally affecting most tests by adding some deprecated shims. Perhaps STCJ4 should become the deprecated shim so that users can still use it during 7.x and to help us with the transition internally too. More specifically, we'd add a new superclass to STCJ4 that is the future – "SolrTestCase".

Additionally, there are a bunch of methods on SolrTestCaseJ4 that I question the design of, especially ones that return XML strings like delI (generates a delete-by-id XML string) and adoc. Perhaps that used to be a fine idea before there was a convenient SolrClient API but we've got one now and a test shouldn't be building XML unless it's trying to test exactly that.

For consulting work I once developed a JUnit4 TestRule managing a SolrClient that is declared in a test with an annotation of @ClassRule. I had a variation for SolrCloud and EmbeddedSolrServer that was easy for a test to choose. Since TestRule is an interface, I was able to make a special delegating SolrClient subclass that implements TestRule. This isn't essential but makes use of it easier since otherwise you'd be forced to call something like getSolrClient(). We could go the TestRule route here, which I prefer (with or without having it subclass SolrClient), or we could alternatively do TestCase subclassing to manage the lifecycle.

Initially I'm just looking for agreement and refinement of the approach. After that, sub-tasks ought to be added. I won't have time to work on this for some time.

Difficulty: Major

Potential mentors:

David SmileyPenghui Li, mail: dsmiley penghui (at) apache.org

Project Devs, mail:

Improve the message written count metrics for the topic

Pulsar

Expose the broker level message metadata to the client

Currently, Pulsar exposes the message written count metrics though the Prometheus endpoint, and the metrics maintain in the broker, no been persistent. So if the topic ownership changes or restart broker, this will lead to reset the message written count of the topic to 0. This will confused users and not able to get the correct message written count metrics.

PIP-70 https://github.com/apache/pulsar/wiki/PIP-70%3A-Introduce-lightweight-rawbroker-Messageentry-metadata Introduced metadata Introduced a broker level entry metadata and already which can support add message index and broker add a timestamp for the message. But currently, the client can't get the broker level message metadata since the broker skip this information when dispatching messages to the client. Provide a way to expose the broker level message metadata to the client.for a topic(or message offset of a topic), this will provide the ability to calculate the precise message written count for a topic. So we can leverage PIP-70 to improve the message written count metrics for the topic

Difficulty: Major

Potential mentors:

Penghui Li, mail: penghui (at) apache.org

Project Devs, mail:

Integration with Apache Ranger

Currently, Pulsar only supports store authorization policies under local-zookeeper. Is it possible to support [ranger](https://github.com/apache/ranger), apache ranger can provide a framework for central administration of security policies and monitoring of user access.

Difficulty: Major

Potential mentors:

Penghui Li, mail: penghui (at) apache.org

Project Devs, mail:

Throttle the ledger rollover for the broker

In Pulsar, the ledger rollover is split the data of a topic into multiple segments. For each ledger roll over operation, the metadata of the topic needs to be updated in the ZookKeeper. High ledger rollover frequency may lead to the ZookKeeper cluster in heavy workload. In order to make the ZookKeeper run more stable, we should limit the ledger rollover rate.

Difficulty: Major

Potential mentors:

Penghui Li, mail: penghui (at) apache.org

Project Devs, mail:

Improve the message backlogs for the topic

In Pulsar, the client usually sends several messages with a batch. From the broker side, the broker receives a batch and write the batch message to the storage layer.

The message backlog is maintaining how many messages should be handled for a subscription. But unfortunately, the current backlog is based on the batches, not the messages. This will confuse users that they have pushed 1000 messages to the topic, but from the subscription side, when to check the backlog, will return a value that lower than 1000 messages such as 100 batches. Not able to get the message based backlog is it's so expensive to calculate the number of messages in each batch.

PIP-70 https://github.com/apache

Support reset cursor by message index

Currently, Pulsar supports resetting the cursor according to time and message-id, e.g. you can reset the cursor to 3 hours ago or reset the cursor to a specific message-id. For some cases that users want to reset to the 10,000 earlier messages, Pulsar has not supported this operation yet

PIP-70 https://github.com/apache/pulsar/wiki/PIP-70%3A-Introduce-lightweight-raw-Message-metadata Introduced a broker level entry metadata which can support message index for a topic(or message offset of a topic), this . This will provide the ability to support reset cursor according to the message index.

Difficulty: Major

Potential mentors:

Penghui Li, mail: penghui (at) apache.org

Project Devs, mail:

to calculate the number of messages between a message index to another message index. So we can leverage PIP-70 to improve the message backlog implementation to able to get the message-based backlog.

For the Exclusive subscription or Failover subscription, it easy to implement by calculating the messages between the mark delete position and the LAC position. But for the Shared and Key_Shared subscription, the individual acknowledgment will bring some complexity. We can cache the individual acknowledgment count in the broker memory, so the way to calculate the message backlog for the Shared and Key_Shared subscription is `backlogOfTheMarkdeletePosition` - `IndividualAckCount`

Support publish and consume avro objects in pulsar-perf

We should use perf tool to benchmark producing and consuming messages using Schema.

Difficulty: Major

Potential mentors:

Penghui Li, mail: penghui (at) apache.org

Project Devs, mail:

Improve

Expose the

message written count metrics for the topicCurrently, Pulsar exposes the message written count metrics though the Prometheus endpoint, and the metrics maintain in the broker, no been persistent. So if the topic ownership changes or restart broker, this will lead to reset the message written count of the topic to 0. This will confused users and not able to get the correct message written count metrics.

broker level message metadata to the client.

PIP-70 https://github.com/apache/pulsar/wiki/PIP-70%3A-Introduce-lightweight-brokerraw-entryMessage-metadata Introduced metadata Introduced a broker level entry metadata which can and already support add message index for a topic(or message offset of a topic), this will provide the ability to calculate the precise message written count for a topic. So we can leverage PIP-70 to improve the message written count metrics for the topicand broker add a timestamp for the message.

But currently, the client can't get the broker level message metadata since the broker skip this information when dispatching messages to the client. Provide a way to expose the broker level message metadata to the client.

Difficulty: Major

Potential mentors:

Penghui Li, mail: penghui (at) apache.org

Project Devs, mail:

OODT

Improve OPSUI React.js UI with advanced functionalities

In GSoC 2019, we implemented a new OPSUI UI based on React.js. See the related blog posts [1] [2]. Several advanced features require to be implemented including.

Implement querying functionality at OPSUI side (scope can be determined)
Show progress of workflows and file ingestions
Introduce a proper REST API for resource manager component
Introduce proper packaging (with configurable external REST API URLs) and deployment mechanism (as a docker deployment or an npm package)

In this project, the student will have to work on the UI with React.js and will have to implement several REST APIs using JAX-RS. Furthermore, will have to work on making OPSUI easy to deploy.

The existing wicket based OPSUI will be replaced by the new React.js based OPSUI at the end of this project. And the linked blog posts will be a good start to understand what the new React.js based OPSUI is capable of doing.

[1] https://medium.com/faun/gsoc-2019-apache-oodt-react-based-opsui-dashboard-d93a9083981c
[2] https://medium.com/faun/whats-new-in-apache-oodt-react-opsui-dashboard-4cc6701628a9
[3] https://medium.com/faun/apache-oodt-with-docker-84d32525c798

Difficulty: Major

Potential mentors:

Imesha Sudasingha, mail: imesha (at) apache.org

Project Devs, mail:

James Server

[GSOC-2021] Implement Thread support for JMAP

Why ?

Mail user agents generally allow displaying emails grouped by conversations (replies, forward, etc...).

As part of JMAP RFC-8621 implementation, there is a dedicated concepts: threads. We did implement JMAP Threads in a rather naive way: each email is a thread of its own.

This naive implementation is specification compliant but defeat the overall purposes of threads.

I propose myself to mentor the implementation of Threads as part of the James JMAP implementation.

See: https://jmap.io/spec-mail.html#threads

Improve the message backlogs for the topic

In Pulsar, the client usually sends several messages with a batch. From the broker side, the broker receives a batch and write the batch message to the storage layer.

The message backlog is maintaining how many messages should be handled for a subscription. But unfortunately, the current backlog is based on the batches, not the messages. This will confuse users that they have pushed 1000 messages to the topic, but from the subscription side, when to check the backlog, will return a value that lower than 1000 messages such as 100 batches. Not able to get the message based backlog is it's so expensive to calculate the number of messages in each batch.

PIP-70 https://github.com/apache/pulsar/wiki/PIP-70%3A-Introduce-lightweight-raw-Message-metadata Introduced a broker level entry metadata which can support message index for a topic(or message offset of a topic). This will provide the ability to calculate the number of messages between a message index to another message index. So we can leverage PIP-70 to improve the message backlog implementation to able to get the message-based backlog.

For the Exclusive subscription or Failover subscription, it easy to implement by calculating the messages between the mark delete position and the LAC position. But for the Shared and Key_Shared subscription, the individual acknowledgment will bring some complexity. We can cache the individual acknowledgment count in the broker memory, so the way to calculate the message backlog for the Shared and Key_Shared subscription is `backlogOfTheMarkdeletePosition` - `IndividualAckCount`

Difficulty: Major

Potential mentors:

Penghui LiBenoit Tellier, mail: penghui btellier (at) apache.org

Project Devs, mail:

OODT

Fineract Cloud Native

Machine Learning Scorecard for Credit Risk Assessment Phase 4

Mentors

Lalit Mohan S
VICTOR ROMERO

Overview & Objectives

Financial Organizations using Mifos/Fineract are depending on external agencies or their past experiences for evaluating credit scoring and identification of potential NPAs. Though information from external agencies is required, financial organizations can have an internal scorecard for evaluating loans so that preventive/proactive actions can be done along with external agencies reports. In industry, organizations are using rule based, Statistical and Machine learning methods for credit scoring, predicting potential NPAs, fraud detection and other activities. This project aims to implement a scorecard based on statistical and ML methods for credit scoring and identification of potential NPAs.

Description

The approach should factor and improve last year's GSOC work (https://gist.github.com/SupreethSudhakaranMenon/a20251271adb341f949dbfeb035191f7) on Features/Characteristics, Criteria and evaluation (link). The design and implementation of the screens should follow Mifos Application standards. Should implement statistical and ML methods with explainability on decision making. Should also be extensible for adding other functionalities such as fraud detection, cross-sell and up-sell, etc.

Helpful Skills

JAVA, Integrating Backend Service, MIFOS X, Apache Fineract, AngularJS, ORM, ML, Statistical Methods, Django

Impact

Streamlined Operations, Better RISK Management, Automated Response Mechanism

Other Resources

2019 Progress: https://gist.github.com/SupreethSudhakaranMenon/a20251271adb341f949dbfeb035191f7

https://gist.github.com/lalitsanagavarapu

Improve OPSUI React.js UI with advanced functionalities

In GSoC 2019, we implemented a new OPSUI UI based on React.js. See the related blog posts [1] [2]. Several advanced features require to be implemented including.

Implement querying functionality at OPSUI side (scope can be determined)
Show progress of workflows and file ingestions
Introduce a proper REST API for resource manager component
Introduce proper packaging (with configurable external REST API URLs) and deployment mechanism (as a docker deployment or an npm package)

In this project, the student will have to work on the UI with React.js and will have to implement several REST APIs using JAX-RS. Furthermore, will have to work on making OPSUI easy to deploy.

The existing wicket based OPSUI will be replaced by the new React.js based OPSUI at the end of this project. And the linked blog posts will be a good start to understand what the new React.js based OPSUI is capable of doing.

[1] https://medium.com/faun/gsoc-2019-apache-oodt-react-based-opsui-dashboard-d93a9083981c
[2] https://medium.com/faun/whats-new-in-apache-oodt-react-opsui-dashboard-4cc6701628a9
[3] https://medium.com/faun/apache-oodt-with-docker-84d32525c798

Difficulty: Major

Potential mentors:

Imesha SudasinghaEd Cable, mail: imesha edcable (at) apache.org

Project Devs, mail:

James Server

[GSOC-2021] Implement Thread support for JMAP

Why ?

Mail user agents generally allow displaying emails grouped by conversations (replies, forward, etc...).

As part of JMAP RFC-8621 implementation, there is a dedicated concepts: threads. We did implement JMAP Threads in a rather naive way: each email is a thread of its own.

This naive implementation is specification compliant but defeat the overall purposes of threads.

I propose myself to mentor the implementation of Threads as part of the James JMAP implementation.

See: https://jmap.io/spec-mail.html#threads

Difficulty: Major

Potential mentors:

Benoit Tellier, mail: btellier (at) apache.org

Project Devs, mail:

Fineract Cloud Native

Machine Learning Scorecard for Credit Risk Assessment Phase 4

Mentors

Lalit Mohan S
VICTOR ROMERO

Overview & Objectives

Financial Organizations using Mifos/Fineract are depending on external agencies or their past experiences for evaluating credit scoring and identification of potential NPAs. Though information from external agencies is required, financial organizations can have an internal scorecard for evaluating loans so that preventive/proactive actions can be done along with external agencies reports. In industry, organizations are using rule based, Statistical and Machine learning methods for credit scoring, predicting potential NPAs, fraud detection and other activities. This project aims to implement a scorecard based on statistical and ML methods for credit scoring and identification of potential NPAs.

Description

The approach should factor and improve last year's GSOC work (https://gist.github.com/SupreethSudhakaranMenon/a20251271adb341f949dbfeb035191f7) on Features/Characteristics, Criteria and evaluation (link). The design and implementation of the screens should follow Mifos Application standards. Should implement statistical and ML methods with explainability on decision making. Should also be extensible for adding other functionalities such as fraud detection, cross-sell and up-sell, etc.

Helpful Skills

JAVA, Integrating Backend Service, MIFOS X, Apache Fineract, AngularJS, ORM, ML, Statistical Methods, Django

Impact

Streamlined Operations, Better RISK Management, Automated Response Mechanism

Other Resources

2019 Progress: https://gist.github.com/SupreethSudhakaranMenon/a20251271adb341f949dbfeb035191f7

https://gist.github.com/lalitsanagavarapu

Difficulty: Major

Potential mentors:

Ed Cable, mail: edcable (at) apache.org

Project Devs, mail: dev (at) fineract.apache.org

dev (at) fineract.apache.org

Create Open Banking Layer for Fineract CN Self-Service App

Mentors

Overview & Objectives

Across our ecosystem we're seeing more and more adoption and innovation from fintechs. A huge democratizing force across the financial services sector is the Open Banking movement providing Open Banking APIs to enable third parties to directly interact with customers of financial institutions. We have recently started providing an Open Banking API layer that will allow financial institutions using Mifos and Fineract to offer third parties access to requesting account information and initiating payments via these APIs. Most recently the Mojaloop community, led by Google, has led the development of a centralized PISP API. We have chosen to the follow the comprehensive UK Open Banking API standard which is being followed and adopted by a number of countries through Sub-Saharan Africa and Latin America.

Tremendous impact can be had at the Base of the Pyramid by enabling third parties to establish consent with customers to authorize transactions to be initiated or information to be accessed from accounts at their financial institution. This Open Banking API layer would enable any institution using Mifos or Fineract to provide a UK Open Banking API layer to third parties and fintechs.

The API Gateway to connect to is still being chosen (WS02, Gravitee, etc.)

Description

The APIS that are consumed by the the reference Fineract 1.x mobile banking application have been documented in the spreadsheet below. The APIs have also been categorized according to whether they are an existing self-service API or back-office API and if they have an equivalent Open Banking API and if so, a link to the corresponding Open Banking API.

For each API with an equivalent Open Banking API, the interns must: Take rest api, upload swagger definition, do transformation in OpenBanking Adapter, and publish on API gateway.

For back-office and/or self-service APIs with no equivalent Open Banking API, the process is: Take rest api, upload swagger definition, and publish on API gateway.

For example:

Submit Loan Application (Self-ServiceAPIwith EquivalentOpenBankingAPI)
https://demo.mifos.io/api-docs/apiLive.htm#loans_create
Used by Fineract 1.x Self-Service App
ImagesAPI(Back-OfficeAPIwith No EquivalentOpenBankingAPI)
https://demo.mifos.io/api-docs/apiLive.htm#client_images
Used by Mifos Mobile and Mobile Wallet
Fetch Identification CardAPI(Fineract CNAPIwith no equivalentOpenBankingAPI)
https://docs.google.com/document/d/15LbxVoQQRoa4uU7QiV7FpJFVjkyyNb9_HJwFvS47O4I/edit?pli=1#heading=h.xfl6jxdpcpy1
Sample APIs to be Documented
-------------------------------------------

Mifos Mobile CN API Matrix (completed by Garvit)
https://docs.google.com/spreadsheets/d/1-HrfPKhh1kO7ojK15Ylf6uzejQmaz72eXf5MzCBCE3M/edit#gid=0
https://docs.google.com/document/d/15LbxVoQQRoa4uU7QiV7FpJFVjkyyNb9_HJwFvS47O4I/edit?pli=1#
Mobile Wallet API Matrix (completed by Devansh)
https://docs.google.com/spreadsheets/d/1VgpIwN2JsljWWytk_Qb49kKzmWvwh6xa1oRgMNIAv3g/edit#gid=0

Helpful Skills

Android development, SQL, Java, Javascript, Git, Spring, OpenJPA, Rest, Kotlin, Gravitee, WSO2

Impact

By providing a standard UK Open Banking API layer we can provide both a secure way for our trusted first party apps to allow customers to authenticate and access their accounts as well as an API layer for third party fintechs to securely access Fineract and request information or initiate transactions with the consent of customers.

Other Resources

CGAP Research on Open Banking: https://www.cgap.org/research/publication/open-banking-how-design-financial-inclusion
Docs: https://mifos.gitbook.io/docs/wso2-1/setup-openbanking-apis
Self-Service APIs: https://demo.mifos.io/api-docs/apiLive.htm#selfbasicauth

https://cwiki.apache.org/confluence/display/FINERACT/Customer+Self-Service+Phase+2
Open Banking Adapter: https://github.com/openMF/openbanking-adapter
Transforms Open Banking API to Fineract API
Works with both Fineract 1.x and Fineract CN
Can connect to different API gateways and can transform against different API standards.

Reference Open Banking Fintech App:

Backend: https://github.com/openMF/openbanking-tpp-server
GUI: https://github.com/openMF/openbanking-tpp-client
Google Whitepaper on 3PPI: https://static.googleusercontent.com/media/nextbillionusers.google/en//tools/3PPI-2021-whitepaper.pdf

UK Open Banking API Standard: https://standards.openbanking.org.uk/

Open Banking Developer Zone: https://openbanking.atlassian.net/wiki/spaces/DZ/overview

Examples of Open Banking Apps: https://www.ft.com/content/a5f0af78-133e-11e9-a581-4ff78404524e

See https://openmf.github.io/mobileapps.github.io/

Difficulty: Major

Potential mentors:

Ed Cable, mail: edcable (at) apache.org

Project Devs, mail: dev (at) fineract.apache.org

Functional Enhancements to Fineract CN Mobile

Mentors

Overview & Objectives

Just as we have a mobile field operations app on Apache Fineract 1.x, we have recently built out on top of the brand new Apache Fineract CN micro-services architecture, an initial version of a mobile field operations app with an MVP architecture and material design. Given the flexibility of the new architecture and its ability to support different methodologies - MFIs, credit unions, cooperatives, savings groups, agent banking, etc - this mobile app will have different flavors and workflows and functionalities.

Description

In 2020, our Google Summer of Code intern worked on additional functionality in the Fineract CN mobile app. In 2021, the student will work on the following tasks:

Integrate with Payment Hub to enable disbursement via Mobile Money API
Improve Task management features into the app.
Create UI for creating new account and displaying account details
Create UI for creating tellers and displaying tellers details
Improve GIS features like location tracking, dropping of pin into the app
Improve offline mode via Couchbase support
Write Unit Test, Integration Test and UI tests
Helpful Skills
Android Development, Kotlin, Java, Git, OpenJPA, Rest API
Impact
Allows staff to go directly into the field to connect to the client. Reduces cost of operations by enabling organizations to go paperless and be more efficient.
Other Resources

Create Open Banking Layer for Fineract CN Self-Service App

Mentors

Overview & Objectives

Across our ecosystem we're seeing more and more adoption and innovation from fintechs. A huge democratizing force across the financial services sector is the Open Banking movement providing Open Banking APIs to enable third parties to directly interact with customers of financial institutions. We have recently started providing an Open Banking API layer that will allow financial institutions using Mifos and Fineract to offer third parties access to requesting account information and initiating payments via these APIs. Most recently the Mojaloop community, led by Google, has led the development of a centralized PISP API. We have chosen to the follow the comprehensive UK Open Banking API standard which is being followed and adopted by a number of countries through Sub-Saharan Africa and Latin America.

Tremendous impact can be had at the Base of the Pyramid by enabling third parties to establish consent with customers to authorize transactions to be initiated or information to be accessed from accounts at their financial institution. This Open Banking API layer would enable any institution using Mifos or Fineract to provide a UK Open Banking API layer to third parties and fintechs.

The API Gateway to connect to is still being chosen (WS02, Gravitee, etc.)

Description

The APIS that are consumed by the the reference Fineract 1.x mobile banking application have been documented in the spreadsheet below. The APIs have also been categorized according to whether they are an existing self-service API or back-office API and if they have an equivalent Open Banking API and if so, a link to the corresponding Open Banking API.

For each API with an equivalent Open Banking API, the interns must: Take rest api, upload swagger definition, do transformation in OpenBanking Adapter, and publish on API gateway.

For back-office and/or self-service APIs with no equivalent Open Banking API, the process is: Take rest api, upload swagger definition, and publish on API gateway.

For example:

Submit Loan Application (Self-ServiceAPIwith EquivalentOpenBankingAPI)
https://demo.mifos.io/api-docs/apiLive.htm#loans_create
Used by Fineract 1.x Self-Service App
ImagesAPI(Back-OfficeAPIwith No EquivalentOpenBankingAPI)
https://demo.mifos.io/api-docs/apiLive.htm#client_images
Used by Mifos Mobile and Mobile Wallet
Fetch Identification CardAPI(Fineract CNAPIwith no equivalentOpenBankingAPI)
https://docs.google.com/document/d/15LbxVoQQRoa4uU7QiV7FpJFVjkyyNb9_HJwFvS47O4I/edit?pli=1#heading=h.xfl6jxdpcpy1
Sample APIs to be Documented
-------------------------------------------

Mifos Mobile CN API Matrix (completed by Garvit)
https://docs.google.com/spreadsheets/d/1-HrfPKhh1kO7ojK15Ylf6uzejQmaz72eXf5MzCBCE3M/edit#gid=0
https://docs.google.com/document/d/15LbxVoQQRoa4uU7QiV7FpJFVjkyyNb9_HJwFvS47O4I/edit?pli=1#
Mobile Wallet API Matrix (completed by Devansh)
https://docs.google.com/spreadsheets/d/1VgpIwN2JsljWWytk_Qb49kKzmWvwh6xa1oRgMNIAv3g/edit#gid=0

Helpful Skills

Android development, SQL, Java, Javascript, Git, Spring, OpenJPA, Rest, Kotlin, Gravitee, WSO2

Impact

By providing a standard UK Open Banking API layer we can provide both a secure way for our trusted first party apps to allow customers to authenticate and access their accounts as well as an API layer for third party fintechs to securely access Fineract and request information or initiate transactions with the consent of customers.

Other Resources

CGAP Research on Open Banking: https://www.cgap.org/research/publication/open-banking-how-design-financial-inclusion
Docs: https://mifos.gitbook.io/docs/wso2-1/setup-openbanking-apis
Self-Service APIs: https://demo.mifos.io/api-docs/apiLive.htm#selfbasicauth

Repo on Github:
https://github.com/apache/fineract-cn-mobile
Fineract CN API documentation:
https://izakey.github.io/fineract-cn-api-docs-site/
https://github.com/aasaru/fineract-cn-api-docs

https://cwiki.apache.org/confluence/display/FINERACT/

Customer+Self-Service+Phase+2
Open Banking Adapter:

Fineract+CN
How to install and run Couchbase:
https://gist.github.com/

openMF

jawidMuhammadi/

openbanking-adapter

Transforms Open Banking API to Fineract API

Works with both Fineract 1.x and Fineract CN

Can connect to different API gateways and can transform against different API standards.

Reference Open Banking Fintech App:

Backend: https://github.com/openMF/openbanking-tpp-server

GUI:

af6cd34058cacf20b100d335639b3ad8
GSMA mobile money API:
https://developer.mobilemoneyapi.io/1.1/oas3/22466
Payment Hub:

https://github.com/

openMF/openbanking-tpp-client
Google Whitepaper on 3PPI:

search?q=openMF%2Fph-ee&ref=opensearch
Some UI designs:

static

googleusercontent

media

nextbillionusers.google/en//tools/3PPI-2021-whitepaper.pdf

KHXtZPdIpC3TqvdIVZu8CW/fineract-cn-mobile?node-id=0%3A1

UK Open Banking API Standard:

2020 GSoC progress report:

https://

standards

gist.

openbanking.org.uk/Open Banking Developer Zone:

openbanking

issues.

atlassian

apache.

net

org/

wiki

jira/

spaces/DZ/overview

Examples of Open Banking Apps: https://www.ft.com/content/a5f0af78-133e-11e9-a581-4ff78404524e

See https://openmf.github.io/mobileapps.github.io/

browse/FINCN-241?filter=-2&jql=project%20%3D%20FINCN%20order%20by%20created%20DESC

Difficulty: Major

Potential mentors:

Ed Cable, mail: edcable (at) apache.org

Project Devs, mail: dev (at) fineract.apache.org

Functional Enhancements to Fineract CN Mobile

SkyWalking

Apache SkyWalking: Python agent supports profiling

Apache SkyWalking [1] is an application performance monitor (APM) tool for distributed systems, especially designed for microservices, cloud native and container-based (Docker, K8s, Mesos) architectures.

SkyWalking is based on agent to instrument (automatically) monitored services, for now, we have many agents for different languages, Python agent [2] is one of them, which supports automatic instrumentations.

The goal of this project is to extend the agent's features by supporting profiling [3] a function's invocation stack, help the users to analyze which method costs the most major time in a cross-services call.

To complete this task, you must be comfortable with Python, have some knowledge of tracing system, otherwise you'll have a hard time coming up to speed..

[1] http://skywalking.apache.org
[2] http

Mentors

Overview & Objectives

Just as we have a mobile field operations app on Apache Fineract 1.x, we have recently built out on top of the brand new Apache Fineract CN micro-services architecture, an initial version of a mobile field operations app with an MVP architecture and material design. Given the flexibility of the new architecture and its ability to support different methodologies - MFIs, credit unions, cooperatives, savings groups, agent banking, etc - this mobile app will have different flavors and workflows and functionalities.

Description

In 2020, our Google Summer of Code intern worked on additional functionality in the Fineract CN mobile app. In 2021, the student will work on the following tasks:

Integrate with Payment Hub to enable disbursement via Mobile Money API
Improve Task management features into the app.
Create UI for creating new account and displaying account details
Create UI for creating tellers and displaying tellers details
Improve GIS features like location tracking, dropping of pin into the app
Improve offline mode via Couchbase support
Write Unit Test, Integration Test and UI tests
Helpful Skills
Android Development, Kotlin, Java, Git, OpenJPA, Rest API
Impact
Allows staff to go directly into the field to connect to the client. Reduces cost of operations by enabling organizations to go paperless and be more efficient.
Other Resources

Repo on Github:
https

://github.com/apache/

fineract

skywalking-

cn-mobileFineract CN API documentation:

python
[3] https://

izakey

thenewstack.

github.

io/

fineract-cn-api-docs-site/

apache-skywalking-use-profiling-to-fix-the-blind-spot-of-distributed-tracing/

Difficulty: Major

Potential mentors:

Zhenxu Ke, mail: kezhenxu94 (at) apache.org

Project Devs, mail: dev (at) skywalking.apache.org

Apache SkyWalking: Python agent collects and reports PVM metrics to backend

Apache SkyWalking [1] is an application performance monitor (APM) tool for distributed systems, especially designed for microservices, cloud native and container-based (Docker, K8s, Mesos) architectures.

Tracing distributed systems is one of the main features of SkyWalking, with those traces, it can analyze some service metrics such as CPM, success rate, error rate, apdex, etc. SkyWalking also supports receiving metrics from the agent side directly.

In this task, we expect the Python agent to report its Python Virtual Machine (PVM) metrics, including (but not limited to, whatever metrics useful are also acceptable) CPU usage (%), memory used (MB), (active) thread/coroutine counts, garbage collection count, etc.

To complete this task, you must be comfortable with Python and gRPC, otherwise you'll have a hard time coming up to speed.

Live demo to play around: http://122.112.182.72:8080 (under reconstruction, maybe unavailable but latest demo address can be found at the GitHub index page http://github.com/apache/skywalking)

[1] http://skywalking.apache.org

https://github.com/aasaru/fineract-cn-api-docs
https://cwiki.apache.org/confluence/display/FINERACT/Fineract+CN

How to install and run Couchbase:
https://gist.github.com/jawidMuhammadi/af6cd34058cacf20b100d335639b3ad8

GSMA mobile money API:
https://developer.mobilemoneyapi.io/1.1/oas3/22466

Payment Hub:
https://github.com/search?q=openMF%2Fph-ee&ref=opensearch

Some UI designs:

https://www.figma.com/file/KHXtZPdIpC3TqvdIVZu8CW/fineract-cn-mobile?node-id=0%3A1

2020 GSoC progress report:
https://gist.github.com/jawidMuhammadi/9fa91d37b1cbe43d9cdfe165ad8f2102

JIRA Task
https://issues.apache.org/jira/browse/FINCN-241?filter=-2&jql=project%20%3D%20FINCN%20order%20by%20created%20DESC

Difficulty: Major

Potential mentors:

Ed CableZhenxu Ke, mail: edcable kezhenxu94 (at) apache.org

Project Devs, mail: dev (at) fineractskywalking.apache.org

...

ShardingSphere

Apache

SkyWalking: Python agent supports profiling

Apache SkyWalking [1] is an application performance monitor (APM) tool for distributed systems, especially designed for microservices, cloud native and container-based (Docker, K8s, Mesos) architectures.

SkyWalking is based on agent to instrument (automatically) monitored services, for now, we have many agents for different languages, Python agent [2] is one of them, which supports automatic instrumentations.

The goal of this project is to extend the agent's features by supporting profiling [3] a function's invocation stack, help the users to analyze which method costs the most major time in a cross-services call.

To complete this task, you must be comfortable with Python, have some knowledge of tracing system, otherwise you'll have a hard time coming up to speed..

[1] http://skywalking.apache.org
[2] http://github.com/apache/skywalking-python
[3] https://thenewstack.io/apache-skywalking-use-profiling-to-fix-the-blind-spot-of-distributed-tracing/

Difficulty: Major

Potential mentors:

Zhenxu Ke, mail: kezhenxu94 (at) apache.org

Project Devs, mail: dev (at) skywalking.apache.org

ShardingSphere: Proofread the DDL/TCL SQL definitions for ShardingSphere Parser

Apache ShardingSphere

Apache ShardingSphere is a distributed database middleware ecosystem, including 2 independent products, ShardingSphere JDBC and ShardingSphere Proxy presently. They all provide functions of data sharding, distributed transaction, and database orchestration.
Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere

Background

ShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer` and `Oracle`, which means we have to understand different database dialect SQLs.
More details: https://shardingsphere.apache.org/document/current/en/features/sharding/principle/parse/

Task

This issue is to proofread the following definitions,

All the DDL SQL definitions for Oracle except for ALTER, DROP, CREATE and TRUNCATE.
All the TCL (Transaction Control Language) SQL definitions for Oracle

You can learn more here.

As we have a basic Oracle SQL syntax definitions but do not keep in line with Oracle DOC, we need you to find out the vague SQL grammar definitions and correct them referring to Oracle DOC.

Notice, when you review these target SQLs above, you will find that these definitions will involve some basic elements of Oracle SQL. No doubt, these elements are included in this task as well.

Relevant Skills

1. Master JAVA language
2. Have a basic understanding of Antlr g4 file
3. Be familiar with Oracle SQLs

Targets files

References

1. Oracle SQL quick reference: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlqr/SQL-Statements.html#GUID-1FA35EAD-AED2-4619-BFEE-348FF05D1F4A
2. Detailed Oracle SQL info: https://docs.oracle.com/pls/topic/lookup?ctx=en/database/oracle/oracle-database/19/sqlqr&id=SQLRF008

Mentor

Juan Pan, PMC of Apache ShardingSphere, panjuan@apache.org^{Image Added}

Difficulty: Major

Potential mentors:

Juan Pan, mail: panjuan

Apache SkyWalking: Python agent collects and reports PVM metrics to backend

Apache SkyWalking [1] is an application performance monitor (APM) tool for distributed systems, especially designed for microservices, cloud native and container-based (Docker, K8s, Mesos) architectures.

Tracing distributed systems is one of the main features of SkyWalking, with those traces, it can analyze some service metrics such as CPM, success rate, error rate, apdex, etc. SkyWalking also supports receiving metrics from the agent side directly.

In this task, we expect the Python agent to report its Python Virtual Machine (PVM) metrics, including (but not limited to, whatever metrics useful are also acceptable) CPU usage (%), memory used (MB), (active) thread/coroutine counts, garbage collection count, etc.

To complete this task, you must be comfortable with Python and gRPC, otherwise you'll have a hard time coming up to speed.

Live demo to play around: http://122.112.182.72:8080 (under reconstruction, maybe unavailable but latest demo address can be found at the GitHub index page http://github.com/apache/skywalking)

[1] http://skywalking.apache.org

Difficulty: Major

Potential mentors:

Zhenxu Ke, mail: kezhenxu94 (at) apache.org

Project Devs, mail: dev (at) skywalkingshardingsphere.apache.org

Apache ShardingSphere: Add unit test for example

Apache ShardingSphere

: Proofread the DDL/TCL SQL definitions for ShardingSphere Parser

Apache

Apache ShardingSphere

Apache ShardingSphere is a distributed database middleware ecosystem, including 2 independent products, ShardingSphere JDBC and ShardingSphere Proxy presently. They all provide functions of data sharding, distributed transaction, and database orchestration.
Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere

BackgroundShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer` and `Oracle`, which means we have to understand different database dialect SQLs.
More details: https://shardingsphere.apache.org/document/current/en/features/sharding/principle/parse/

The examples of ShardingSphere do not have test cases.
After mvn install, developer can know compiling success only, but the can not guarantee code correct, especially config for YAML, spring namespace and spring boot starter.

Task

This issue is to proofread the following definitions,

All the DDL SQL definitions for Oracle except for ALTER, DROP, CREATE and TRUNCATE.
All the TCL (Transaction Control Language) SQL definitions for Oracle

You can learn more here.

As we have a basic Oracle SQL syntax definitions but do not keep in line with Oracle DOC, we need you to find out the vague SQL grammar definitions and correct them referring to Oracle DOC.

Notice, when you review these target SQLs above, you will find that these definitions will involve some basic elements of Oracle SQL. No doubt, these elements are included in this task as well.

add auto test cases with JUnit to assert startup success and code logic correct.

Notice, the code of current example may need to be refactor to make it easy for test.

Relevant

Relevant

Skills

1. Master JAVA language
2. Be familiar with spring framework
3. Have a basic understanding of Antlr g4 file
3. Be familiar with Oracle SQLsJUnit

Targets files

1. DDL SQLs g4 fileExample repo: https://github.com/apache/shardingsphere/blobtree/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/DDLStatement.g4
2. TCL SQLs g4 file:
https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/TCLStatement.g4
3. Basic elements g4 fileexamples

Mentor
Liang Zhang, PMC Chair of Apache ShardingSphere, zhangliang@apache.org

Difficulty: Major

Potential mentors:

Liang Zhang, mail: zhangliang (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere: Proofread the DML SQL definitions for ShardingSphere Parser

Apache ShardingSphere

Apache ShardingSphere is a distributed database middleware ecosystem, including 2 independent products, ShardingSphere JDBC and ShardingSphere Proxy presently. They all provide functions of data sharding, distributed transaction, and database orchestration.
Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/BaseRule.g4

References

Background

ShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer` and `Oracle`, which means we have to understand different database dialect SQLs.
More details1. Oracle SQL quick reference: https://docsshardingsphere.oracle.comapache.org/document/current/en/database/oracle/oracle-database/19/sqlqr/SQL-Statements.html#GUID-1FA35EAD-AED2-4619-BFEE-348FF05D1F4A
2. Detailed Oracle SQL info: https://docs.oracle.com/pls/topic/lookup?ctx=en/database/oracle/oracle-database/19/sqlqr&id=SQLRF008

Mentor

Juan Pan, PMC of Apache ShardingSphere, panjuan@apache.org^{Image Removed}

Difficulty: Major

Potential mentors:

Juan Pan, mail: panjuan (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

Apache ShardingSphere: Add unit test for example

features/sharding/principle/parse/

Task

This issue is to proofread the DML(SELECT/UPDATE/DELETE/INSERT) SQL definitions for Oracle. As we have a basic Oracle SQL syntax definitions but do not keep in line with Oracle DOC, we need you to find out the vague SQL grammar definitions and correct them referring to Oracle DOC.

Notice, when you review these DML(SELECT/UPDATE/DELETE/INSERT) SQLs, you will find that these definitions will involve some basic elements of Oracle SQL. No doubt, these elements are included in this task as well.

Relevant Skills

1. Master JAVA language
2. Have a basic understanding of Antlr g4 file
3. Be familiar with Oracle SQLs

Targets files

1. DML SQLs g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/DMLStatement.g4
2. Basic elements g4 file

Apache ShardingSphere

Apache ShardingSphere is a distributed database middleware ecosystem, including 2 independent products, ShardingSphere JDBC and ShardingSphere Proxy presently. They all provide functions of data sharding, distributed transaction, and database orchestration.
Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere/blob/master/shardingsphere

Background

The examples of ShardingSphere do not have test cases.
After mvn install, developer can know compiling success only, but the can not guarantee code correct, especially config for YAML, spring namespace and spring boot starter.

Task

This issue is to add auto test cases with JUnit to assert startup success and code logic correct.

Notice, the code of current example may need to be refactor to make it easy for test.

Relevant Skills

1. Master JAVA language
2. Be familiar with spring framework
3. Have a basic understanding of JUnit

Targets files

Example repo: https://github.com/apache/shardingsphere/tree/master/examples

-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/BaseRule.g4

References

1. Oracle SQL quick reference: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlqr/SQL-Statements.html#GUID-1FA35EAD-AED2-4619-BFEE-348FF05D1F4A
2. Detailed Oracle SQL info: https://docs.oracle.com/pls/topic/lookup?ctx=en/database/oracle/oracle-database/19/sqlqr&id=SQLRF008

Mentor

Juan Pan, PMC of Apache ShardingSphere, panjuan@apache.org^{Image Added}Mentor
Liang Zhang, PMC Chair of Apache ShardingSphere, zhangliang@apache.org

Difficulty: Major

Potential mentors:

Liang ZhangJuan Pan, mail: zhangliang panjuan (at) apache.org

Project Devs, mail: dev (at) shardingsphere.apache.org

IoTDB

Implement PISA index in Apache IoTDB

Apache IoTDB is a highly efficient time series database, which supports high speed query process, including aggregation query.

Currently, IoTDB pre-calculates the aggregation info, or called the summary info, (sum, count, max_time, min_time, max_value, min_value) for each page and each Chunk. The info is helpful for aggregation operations and some query filters. For example, if the query filter is value >10 and the max value of a page is 9, we can skip the page. For another example, if the query is select max(value) and the max value of 3 chunks are 5, 10, 20, then the max(value) is 20.

However, there are two drawbacks:

1. The summary info actually reduces the data that needs to be scanned as 1/k (suppose each page has k data points). However, the time complexity is still O(N). If we store a long historical data, e.g., storing 2 years data with 500KHz, then the aggregation operation may be still time-consuming. So, a tree-based index to reduce the time complexity from O(N) to O(logN) is a good choice. Some basic ideas have been published in [1], while it can just handle data with fix frequency. So, improving it and implementing it into IoTDB is a good choice.

2. The summary info is helpless for evaluating the query like where value >8 if the max value = 10. If we can enrich the summary info, e.g., storing the data histogram, we can use the histogram to evaluate how many points we can return.

This proposal is mainly for adding an index for speeding up the aggregation query. Besides, if we can let the summary info be more useful, it could be better.

Notice that the premise is that the insertion speed should not be slow down too much!

By the way, IoTDB provides an index framework already. So, the PISA index should be compatible with the index framework.

You should know:
• IoTDB query process
• TsFile structure and organization
• Basic index knowledge
• Java

difficulty: Major
mentors:
hxd@apache.org

Reference:

[1] https://www.sciencedirect.com/science/article/pii/S0306437918305489

Apache ShardingSphere: Proofread the DML SQL definitions for ShardingSphere Parser

Apache ShardingSphere

Apache ShardingSphere is a distributed database middleware ecosystem, including 2 independent products, ShardingSphere JDBC and ShardingSphere Proxy presently. They all provide functions of data sharding, distributed transaction, and database orchestration.
Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere

Background

ShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer` and `Oracle`, which means we have to understand different database dialect SQLs.
More details: https://shardingsphere.apache.org/document/current/en/features/sharding/principle/parse/

Task

This issue is to proofread the DML(SELECT/UPDATE/DELETE/INSERT) SQL definitions for Oracle. As we have a basic Oracle SQL syntax definitions but do not keep in line with Oracle DOC, we need you to find out the vague SQL grammar definitions and correct them referring to Oracle DOC.

Notice, when you review these DML(SELECT/UPDATE/DELETE/INSERT) SQLs, you will find that these definitions will involve some basic elements of Oracle SQL. No doubt, these elements are included in this task as well.

Relevant Skills

1. Master JAVA language
2. Have a basic understanding of Antlr g4 file
3. Be familiar with Oracle SQLs

Targets files

1. DML SQLs g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/DMLStatement.g4
2. Basic elements g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/BaseRule.g4

References

1. Oracle SQL quick reference: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlqr/SQL-Statements.html#GUID-1FA35EAD-AED2-4619-BFEE-348FF05D1F4A
2. Detailed Oracle SQL info: https://docs.oracle.com/pls/topic/lookup?ctx=en/database/oracle/oracle-database/19/sqlqr&id=SQLRF008

Mentor

Juan Pan, PMC of Apache ShardingSphere, panjuan@apache.org^{Image Removed}

Difficulty: Major

Potential mentors:

Juan PanXiangdong Huang, mail: panjuan hxd (at) apache.org

Project Devs, mail: dev (at) shardingsphereiotdb.apache.org

...

Implement PISA index in

Apache IoTDB Integration Test

Apache IoTDB is a highly efficient time series database, which supports high speed query process, including aggregation query.Currently, IoTDB pre-calculates the aggregation info, or called the summary info, (sum, count, max_time, min_time, max_value, min_value) for each page and each Chunk. The info is helpful for aggregation operations and some query filters. For example, if the query filter is value >10 and the max value of a page is 9, we can skip the page. For another example, if the query is select max(value) and the max value of 3 chunks are 5, 10, 20, then the max(value) is 20. an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.

Now, IoTDB uses JUnit for its UT/IT test.

However, there are two drawbacks:

1. The summary info actually reduces the data that needs to be scanned as 1/k (suppose each page has k data points). However, the time complexity is still O(N). If we store a long historical data, e.g., storing 2 years data with 500KHz, then the aggregation operation may be still time-consuming. So, a tree-based index to reduce the time complexity from O(N) to O(logN) is a good choice. Some basic ideas have been published in [1], while it can just handle data with fix frequency. So, improving it and implementing it into IoTDB is a good choice.

2. The summary info is helpless for evaluating the query like where value >8 if the max value = 10. If we can enrich the summary info, e.g., storing the data histogram, we can use the histogram to evaluate how many points we can return.

This proposal is mainly for adding an index for speeding up the aggregation query. Besides, if we can let the summary info be more useful, it could be better.

Notice that the premise is that the insertion speed should not be slow down too much!

By the way, IoTDB provides an index framework already. So, the PISA index should be compatible with the index framework.

You should know:
• IoTDB query process
• TsFile structure and organization
• Basic index knowledge
• Java

difficulty: Major
mentors:
hxd@apache.org

Reference:

There are many singleton class instances in IoTDB. Therefore, modifying something in a test may impact others, and it requires us do many cleanup work after a test.

Especially, after we open an serversocket (by Thrift), though we have called the socket.close, the socket may be not closed quickly (controlled by Thrift). But, if the next test begins, then a "the port is already used" error will occur.

2. when testing IoTDB's cluster module, we may need to start at least 3 IoTDB instances in one server.
Using JUnit, the 3 instances will be in one JVM, which is conflicted with the reality "IoTDB has many singleton instances".

So, next, we want to use TestContainer, which is a combiner of Docker and JUnit.

This task is for:

1. using TestContainer to re-implement all IT codes of IoTDB;
2. using TestContainer to add some IT codes for IoTDB's cluster module.

Needed skills:

Java
Docker (Docker-Compose better)
Know or learn Junit and TestContainer

[1] iotdb.apache.org
[2[1] https://www.sciencedirect.com/science/article/pii/S0306437918305489

testcontainers.org/

Difficulty: Major

Potential mentors:

Xiangdong Huang, mail: hxd (at) apache.org

Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB

Integration Test

C# library

Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.

Now, IoTDB uses JUnit for its UT/IT test.

However, there are two drawbacks:

1. There are many singleton class instances in IoTDB. Therefore, modifying something in a test may impact others, and it requires us do many cleanup work after a test.

Especially, after we open an serversocket (by Thrift), though we have called the socket.close, the socket may be not closed quickly (controlled by Thrift). But, if the next test begins, then a "the port is already used" error will occur.

2. when testing IoTDB's cluster module, we may need to start at least 3 IoTDB instances in one server.
Using JUnit, the 3 instances will be in one JVM, which is conflicted with the reality "IoTDB has many singleton instances".

So, next, we want to use TestContainer, which is a combiner of Docker and JUnit.

This task is for:

1. using TestContainer to re-implement all IT codes of IoTDB;
2. using TestContainer to add some IT codes for IoTDB's cluster module.

Needed skills:

IoTDB has two kinds of client interfaces: SQL and native API (also called as session API.)

This task is for the native API.

IoTDB uses Apache Thrift[2] as its RPC framework, so all native API can be generated by Thrift. However, to accelerate the performance, we may use some byte array in Thrift, rather than a Struct, which is not quite friendly to users.

That is why we provide our session API. Session API just wraps the interfaces of the generated thrift codes. Now we have Java[4], Python and c++ version[3]. The C# version is left.

This task hopes you can provide a c# library for IoTDB.

Needed skills:

Thrift
C#
know Java
Java
Docker (Docker-Compose better)
Know or learn Junit and TestContainer

[1] iotdb.apache.org
[2] http://thrift.apache.org/
[3] https://iotdb.apache.org/UserGuide/Master//www.testcontainers.org/Client/Programming%20-%20Other%20Languages.html
[4] https://iotdb.apache.org/UserGuide/Master/Client/Programming%20-%20Native%20API.html

Difficulty: Major

Potential mentors:

Xiangdong Huang, mail: hxd (at) apache.org

Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB

C# library

: Metadata (Schema) Storage Engine

Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.

IoTDB has two kinds of client interfaces: SQL and native API (also called as session API.)

This task is for the native API.

IoTDB uses Apache Thrift[2] as its RPC framework, so all native API can be generated by Thrift. However, to accelerate the performance, we may use some byte array in Thrift, rather than a Struct, which is not quite friendly to users.

That is why we provide our session API. Session API just wraps the interfaces of the generated thrift codes. Now we have Java[4], Python and c++ version[3]. The C# version is left.

This task hopes you can provide a c# library for IoTDB.

Needed skills:

Thrift
C#
know Java

Different with traditional relational databases, IoTDB uses tree-based structure in memory to manage the schema (a.k.a, metadata), and use a Write-Ahead-Log-like file structure to persist the schema.

Now, each time series will take about 300 Bytes in memory. However, an IoTDB instance may manage more than 100 million time series, which may take more than 30GB memory.

Therefore, we'd like to re-design the schema management module.
1. File: Persist the tree on disk like a b-tree.
2. WAL: implement the WAL of the metadata. So we can update the tree on disk in batch, rather than one operation by one.
3. Cache: we may have no memory to load the whole tree. So a cache is needed and query from the tree on disk is needed.

What knowledge you need to know:
1. Java
2. Basic design idea about Database [2]

[1[1] iotdb.apache.org
[2] http://thrift.apache.org/
[3]https://iotdb.apache.org/UserGuide/Master/Client/Programming%20-%20Other%20Languages.html
[42]httpshttp://iotdbpages.cs.apache.org/UserGuide/Master/Client/Programming%20-%20Native%20API.htmlwisc.edu/~dbbook/openAccess/firstEdition/slides/pdfslides/mod2l1.pdf

Difficulty: Major

Potential mentors:

Xiangdong Huang, mail: hxd (at) apache.org

Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB:

Metadata (Schema) Storage Engine

GUI workbench

Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.

Different with traditional relational databases, IoTDB uses tree-based structure in memory to manage the schema (a.k.a, metadata), and use a Write-Ahead-Log-like file structure to persist the schema.

Now, each time series will take about 300 Bytes in memory. However, an IoTDB instance may manage more than 100 million time series, which may take more than 30GB memory.

Therefore, we'd like to re-design the schema management module.
1. File: Persist the tree on disk like a b-tree.
2. WAL: implement the WAL of the metadata. So we can update the tree on disk in batch, rather than one operation by one.
3. Cache: we may have no memory to load the whole tree. So a cache is needed and query from the tree on disk is needed.

What knowledge you need to know:
1. Java
2. Basic design idea about Database [2]

As a database, it is good to have a workbench to operate IoTDB using a GUI.

For example, there is a 3rd-part web-based workbench for Apache Cassandra [2]. MySQL supports a more complex workbench application [3].

We also want to IoTDB has a workbench.

Task:
1. execute SQL and show results in Table or Chart.
2. view the schema of IoTDB (how many Storage groups, how many time series etc..)
3. View and modify IoTDB's configuration
4. View IoTDB's dynamic status (e.g., info that JMX can get)

(As we have integrated IOTDB with Apache Zeppelin, task 1 has done. So, we hope this workbench can be more lightweight than using Zeppelin.)

Better to use Java. (Python or some others are also ok).

Needed Skills:

Java
Web application development

[1] [1] https:// iotdb.apache.org
[2]httphttps://pages.cs.wisc.edu/~dbbook/openAccess/firstEdition/slides/pdfslides/mod2l1.pdfgithub.com/avalanche123/cassandra-web
[3] https://www.mysql.com/cn/products/workbench/

Difficulty: Major

Potential mentors:

Xiangdong Huang, mail: hxd (at) apache.org

Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB:

GUI workbench

Complex Arithmetic Operations in SELECT Clauses

Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.

As a database, it is good to have a workbench to operate IoTDB using a GUI.

For example, there is a 3rd-part web-based workbench for Apache Cassandra [2]. MySQL supports a more complex workbench application [3].

We also want to IoTDB has a workbench.

Task:
1. execute SQL and show results in Table or Chart.
2. view the schema of IoTDB (how many Storage groups, how many time series etc..)
3. View and modify IoTDB's configuration
4. View IoTDB's dynamic status (e.g., info that JMX can get)

(As we have integrated IOTDB with Apache Zeppelin, task 1 has done. So, we hope this workbench can be more lightweight than using Zeppelin.)

Better to use Java. (Python or some others are also ok).

Needed Skills:

Java
Web application development

[1] iotdb.apache.org
[2] https://github.com/avalanche123/cassandra-web
[3] https://www.mysql.com/cn/products/workbench/

Difficulty: Major

Potential mentors:

Xiangdong Huang, mail: hxd (at) apache.org

Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB: Complex Arithmetic Operations in SELECT Clauses

Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.

We have recently been working to improve the ease of use of IoTDB. For queries, we hope that IoTDB can provide more powerful analysis capabilities.

IOTDB supports many types of queries: raw data queries, function queries (including UDF queries), and so on. However, currently there is no easy way to combine the results of multiple queries. Therefore, we hope that IoTDB can support complex arithmetic operations in the SELECT clause, which will greatly improve the analysis capabilities.

Function description:
Applied to: raw time series, literal numbers and function outputs.
Applicable data types: all types except TIMESTAMP and TEXT.
Applicable operators: at least five binary operators ( + , - , * , / , % ) and two unary operator (+ , -).

Usage examples:

raw queries
SELECT -a FROM root.sg.d;
SELECT a, b, c, b * b - 4 * a * c FROM root.sg.d WHERE b > 0;
SELECT a, b, -(bool_value * (a - b)) FROM root.sg.d;
SELECT -3.14 + a / 15 + 926 FROM root.sg.d;
SELECT +a % 3.14 FROM root.sg.d WHERE a < 0;

function queries
SELECT a + abs(a), sin(a) * cos(a) FROM root.sg.d;
SELECT a, b, sqrt(a) * sqrt(b) / (a * b) FROM FROM root.sg.d WHERE a < 0;

nested queries
select a, b, a + b + udf(sin(a) * sin(b), cos(a) * cos(b)) FROM root.sg.d;
select a, a + a, sin(sin(sin(a + a))) FROM root.sg.d WHERE a < 0;

Additional requirements:
1. For performance reasons, it's better to perform as few disk read operations as possible.
Example:
SELECT a, sin(a + a) FROM root.sg.d WHERE a < 0;
The series root.sg.d.a should be read only once during the query.

2. For performance reasons, it's better to reuse intermediate calculation results as much as possible.
Example:
SELECT a + a, sin(a + a) FROM root.sg.d WHERE a < 0;
The intermediate calculation result a + a should only be evaluated once during the query.

3. Need to consider memory-constrained scenarios.

What knowledge you need to know:
1. Java
2. Basic database knowledge (such as SQL, etc.)
3. ANTLR
4. IoTDB query process

Links:
[1] iotdb.apache.org

Difficulty: Major

Potential mentors:

Xiangdong Huang, mail: hxd (at) apache.org

Project Devs, mail: dev (at) iotdb.apache.org

We have recently been working to improve the ease of use of IoTDB. For queries, we hope that IoTDB can provide more powerful analysis capabilities.

IOTDB supports many types of queries: raw data queries, function queries (including UDF queries), and so on. However, currently there is no easy way to combine the results of multiple queries. Therefore, we hope that IoTDB can support complex arithmetic operations in the SELECT clause, which will greatly improve the analysis capabilities.

Function description:
Applied to: raw time series, literal numbers and function outputs.
Applicable data types: all types except TIMESTAMP and TEXT.
Applicable operators: at least five binary operators ( + , - , * , / , % ) and two unary operator (+ , -).

Usage examples:

raw queries
SELECT -a FROM root.sg.d;
SELECT a, b, c, b * b - 4 * a * c FROM root.sg.d WHERE b > 0;
SELECT a, b, -(bool_value * (a - b)) FROM root.sg.d;
SELECT -3.14 + a / 15 + 926 FROM root.sg.d;
SELECT +a % 3.14 FROM root.sg.d WHERE a < 0;

function queries
SELECT a + abs(a), sin(a) * cos(a) FROM root.sg.d;
SELECT a, b, sqrt(a) * sqrt(b) / (a * b) FROM FROM root.sg.d WHERE a < 0;

nested queries
select a, b, a + b + udf(sin(a) * sin(b), cos(a) * cos(b)) FROM root.sg.d;
select a, a + a, sin(sin(sin(a + a))) FROM root.sg.d WHERE a < 0;

Additional requirements:
1. For performance reasons, it's better to perform as few disk read operations as possible.
Example:
SELECT a, sin(a + a) FROM root.sg.d WHERE a < 0;
The series root.sg.d.a should be read only once during the query.

2. For performance reasons, it's better to reuse intermediate calculation results as much as possible.
Example:
SELECT a + a, sin(a + a) FROM root.sg.d WHERE a < 0;
The intermediate calculation result a + a should only be evaluated once during the query.

3. Need to consider memory-constrained scenarios.

What knowledge you need to know:
1. Java
2. Basic database knowledge (such as SQL, etc.)
3. ANTLR
4. IoTDB query process

Links:
[1] iotdb.apache.org

Difficulty: Major

Potential mentors:

Xiangdong Huang, mail: hxd (at) apache.org

Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB: integration with Chaos Mesh

Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.

Chaos Mesh [2] is a versatile chaos engineering solution that features all-around fault injection methods for complex systems on Kubernetes [3], covering faults in Pod, network, file system, and even the kernel.

We hope that Chaos Mesh can be used as a versatile chaos test tool for the IoTDB cluster module, so as to verify the reliability of the IoTDB cluster module in production environment.

You should define a series of failure simulations for the cluster using Chaos Mesh, such as Network partition, Network packet loss and Node collapse, and then define a series of operations and the expected results of those operations.

This task hopes that you can set up an automated framework for IoTDB cluster module chaos testing, so that we can detect potential problems of cluster module and and iteratively fix them.

Needed skills:

Java
Go
Kubernetes
Chao mesh
Know iotdb-benchmark [4]

[1] https://iotdb.apache.org

[2] https://chaos-mesh.org

[3] https://kubernetes.io

[4] https://github.com/thulab/iotdb-benchmark

Difficulty: Major

Potential mentors:

Xiangdong Huang, mail: hxd (at) apache.org

Project Devs, mail: dev (at) iotdb.apache.org

...

GSOC: Varnish Cache support in Apache Traffic Control

Background
Apache Traffic Control is a Content Delivery Network (CDN) control plane for large scale content distribution.

Traffic Control currently requires Apache Traffic Server as the underlying cache. Help us expand the scope by integrating with the very popular Varnish Cache.

There are multiple aspects to this project:

Configuration Generation: Write software to build Varnish configuration files (VCL). This code will be implemented in our Traffic Ops and cache client side utilities, both written in Go.

Health Monitoring: Implement monitoring of the Varnish cache health and performance. This code will run both in the Traffic Monitor component and within Varnish. Traffic Monitor is written in Go and Varnish is written in C.

Testing: Adding automated tests for new code

Skills:

Proficiency in Go is required
A basic knowledge of HTTP and caching is preferred, but not required for this project.

Difficulty: Major

Potential mentors:

Eric Friedrich, mail: friede (at) apache.org

Project Devs, mail: dev (at) trafficcontrol.apache.org

...

DolphinScheduler

Apache DolphinScheduler-Parameter coverage

Apache DolphinScheduler

Apache DolphinScheduler is a distributed and extensible workflow scheduler platform with powerful DAG visual interfaces, dedicated to solving complex job dependencies in the data pipeline and providing various types of jobs available out of the box.

Page:https://dolphinscheduler.apache.org
GitHub: https://github.com/apache/incubator-dolphinscheduler

Background:
Configuration parameter override

At present, our parameter configuration is mainly based on configuration files: you can refer to PropertiesUtils,

But usually important parameters will be injected through the form of Java runtime virtual machine parameters, so we need to support this way of parameter injection. At the same time, because different ways of parameter injection have different priorities, we need to achieve configuration coverage. There are two main situations at present, SystemProperties and LocalFile. The priority of SystemProperties should be the highest, followed by LocalFile (that is, our various configuration files, such as master.properties).

issue:
https://github.com/apache/incubator-dolphinscheduler/issues/5164

for example:
1: Configure master.max.cpuload.avg=-1 in master.prperties

2: Java runtime virtual machine parameters -Dmaster.max.cpuload.avg=1

3:PropertiesUtils.get("master.max.cpuload.avg") = 1

Task: realize configuration parameter coverage

Mentor: CalvinKirs kirs@apache.org

Difficulty: Major

Potential mentors:

Calvin Kirs, mail: kirs (at) apache.org

Project Devs, mail:

CloudStack

dev (at) dolphinscheduler.apache.org

CouchDB

GSoC: Apache CouchDB and Debezium integration

Apache CouchDB software is a document-oriented database that can be queried and indexed in a MapReduce fashion using JavaScript. CouchDB also offers incremental replication with bi-directional conflict detection and resolution.

Debezium is an open source distributed platform for change data capture. Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases. Debezium is durable and fast, so your apps can respond quickly and never miss an event, even when things go wrong.

CouchDB has a change capture feed as a public HTTP API endpoint. Integrating with Debezium would provide an easy way to translate the _changes feed into a Kafka topic which plugs us into a much larger ecosystem of tools and alleviates the need for every consumer of data in CouchDB to build a bespoke “follower” of the _changes feed.

The project for GSoC 2021 here is to design, implement and test a CouchDB connector for Debezium.

Required skills:

Java

Nice-to-have skills:

Erlang

CloudStack GSoC 2021 - Clone a Virtual Machine (with all the data disks)

Hi there,

Here is the background of the proposed improvement in the CloudStack.

Currently, there is no straight way to clone / create a copy of the VM (with all the data disks) in CloudStack. Operator/Admin requires a series of steps/API cmds to be followed to achieve that in CloudStack, and also it takes considerable time (to wait and check each cmd response before proceeding to next step). Some hypervisors (Eg. VMware) already supports clone VM operation, and CloudStack can leverage that.

The support for this new functionality, can be integrated by introducing a new (admin-only) API to clone the VM, something like cloneVirtualMachine , which facilitates direct way to clone / create a copy of the VM (with all the data disks) can be . CloudStack internally performs all the required operations to create the copy of the VM (leverages the relevant hypervisor(s) operations if necessary), and returns the VM as response when success, otherwise throws the relevant error message.

This improvement will be a good addition to the VM operations supported in the CloudStack. It requires some virtualization/cloud domain knowledge & usage.

More details here: https://github.com/apache/cloudstack/issues/4818

Skills Required:

Java and Python

Vue.js (for UI integration)

Difficulty: Major

Potential mentors:

Suresh Kumar AnapartiBalázs Donát Bessenyei, mail: sureshkumar.anaparti bessbd (at) apache.org

Project Devs, mail: dev (at) cloudstackcouchdb.apache.org

CloudStack

CloudStack GSoC 2021

Ideas

- Clone a Virtual Machine (with all the data disks)

Hi there,

Here is the background of the proposed improvement in the CloudStack.

Currently, there is no straight way to clone / create a copy of the VM (with all the data disks) in CloudStack. Operator/Admin requires a series of steps/API cmds to be followed to achieve that in CloudStack, and also it takes considerable time (to wait and check each cmd response before proceeding to next step). Some hypervisors (Eg. VMware) already supports clone VM operation, and CloudStack can leverage that.

The support for this new functionality, can be integrated by introducing a new (admin-only) API to clone the VM, something like cloneVirtualMachine , which facilitates direct way to clone / create a copy of the VM (with all the data disks) can be . CloudStack internally performs all the required operations to create the copy of the VM (leverages the relevant hypervisor(s) operations if necessary), and returns the VM as response when success, otherwise throws the relevant error message.

This improvement will be a good addition to the VM operations supported in the CloudStack. It requires some virtualization/cloud domain knowledge & usage.

More details here:

Hello Students! We are the Apache CloudStack project. From our project website: "Apache CloudStack is open source software designed to deploy and manage large networks of virtual machines, as a highly available, highly scalable Infrastructure as a Service (IaaS) cloud computing platform. CloudStack is used by a number of service providers to offer public cloud services, and by many companies to provide an on-premises (private) cloud offering, or as part of a hybrid cloud solution."

2-min video on the Apache CloudStack project - https://www.youtube.com/watch?v=oJ4b8HFmFTc

Here's about an hour-long intro to what is CloudStack - https://www.youtube.com/watch?v=4qFFwyK9hos

The general skills student would need are - Java, Python, JavaScript/Vue. Idea-specific requirements are mentioned on the idea issue. We're a diverse and welcoming community and we encourage interested students to join the dev ML: http://cloudstack.apache.org/mailing-lists.html (dev@cloudstack.apache.org^{Image Removed})

All our Apache CloudStack GSoC2021 ideas are tracked on the project's Github issue: https://github.com/apache/cloudstack/issues?q=is%3Aissue+is%3Aopen+label%3Agsoc2021

Feature

/4818

Skills Required:

Java and Python
Vue.js (for UI integration)

Difficulty: Major

Potential mentors:

Suresh Kumar Anaparti, mail: sureshkumar.anaparti (at) apache.org

Project Devs, mail: dev (at) cloudstack.apache.org

CloudStack GSoC 2021 Ideas

Hello Students! We are the Apache CloudStack project. From our project website: "Apache CloudStack is open source software designed to deploy and manage large networks of virtual machines, as a highly available, highly scalable Infrastructure as a Service (IaaS) cloud computing platform. CloudStack is used by a number of service providers to offer public cloud services, and by many companies to provide an on-premises (private) cloud offering, or as part of a hybrid cloud solution."

2-min video on the Apache CloudStack project - https://www.youtube.com/watch?v=oJ4b8HFmFTc

Here's about an hour-long intro to what is CloudStack - https://www.youtube.com/watch?v=4qFFwyK9hos

The general skills student would need are - Java, Python, JavaScript/Vue. Idea-specific requirements are mentioned on the idea issue. We're a diverse and welcoming community and we encourage interested students to join the dev ML: http://cloudstack.apache.org/mailing-lists.html (dev@cloudstack.apache.org^{Image Added})

All our Apache CloudStack GSoC2021 ideas are tracked on the project's Github issue: https://github.com/apache/cloudstack/issues?q=is%3Aissue+is%3Aopen+label%3Agsoc2021

Feature	Skills Required	Difficulty Level	Potential Mentor(s)	Details and Discussion
Support Multiple SSH Keys for VMs	Java, Javascript/Vue	Medium	David Jumani david.jumani@shapeblue.com^{Image Added}
Difficulty Level	Potential Mentor(s)	Details and Discussion
Support Multiple SSH Keys for VMs	Java, Javascript/Vue	Medium	David Jumani david.jumani@shapeblue.com^{Image Removed}	https://github.com/apache/cloudstack/issues/4813
Clone a Virtual Machine	Java, Javascript/Vue	Medium	Suresh Anaparti sureshanaparti@apache.org^{Image Removed}	https://github.com/apache/cloudstack/issues/4818
UI Shortcuts (UX improvements in the UI)	Javascript, Vue	Easy	Boris Stoyanov boris.stoyanov@shapeblue.com^{Image Removed} David Jumani david.jumani@shapeblue.com^{Image Removed}	https://github.com/apache/cloudstack/issues/4798
CloudStack OAuth2 Plugin	Java, Javascript/Vue	Medium	Nicolas Vazquez nicovazquez90@gmail.com^{Image Removed} Rohit Yadav rohit@apache.org^{Image Removed}	https://github.com/apache/cloudstack/issues/4834
Synchronization of network devices on newly added hosts for Persistent Networks	Java	Medium	Pearl Dsilva pearl.dsilva@shapeblue.com^{Image Removed}	https://github.com/apache/cloudstack/issues/4814
Add SPICE console for vms on KVM/XenServer	Java, Python, Javascript	Hard	Wei Zhou ustcweizhou@gmail.com^{Image Removed}	https://github.com/apache/cloudstack/issues/48034813
Clone a Virtual Machine	Java, Javascript/Vue	Medium	Suresh Anaparti sureshanaparti@apache	Configuration parameters and APIs mappings	Java, Python	Hard	Harikrishna Patnala harikrishna@apache.org^{Image Modified}	https://github.com/apache/cloudstack/issues/4825	Add virt-v2v support in CloudStack for VM import to KVM	4818
UI Shortcuts (UX improvements in the UI)	Javascript, Vue	Easy	Boris Stoyanov boris.stoyanov@shapeblue.com^{Image Added} David Jumani david.jumani@shapeblue.com^{Image Added}	Java, Python, libvirt, libguestfs	Hard	Rohit Yadav rohit@apache.org^{Image Removed}	https://github.com/apache/cloudstack/issues/4696

We have an onboarding course for students to learn and get started with CloudStack:

4798
CloudStack OAuth2 Plugin	Java, Javascript/Vue	Medium	Nicolas Vazquez nicovazquez90@gmail.com^{Image Added} Rohit Yadav rohit@apache.org^{Image Added}	https://github.com/

shapeblue/hackerbook

apache/cloudstack/issues/4834
Synchronization of network devices on newly added hosts for Persistent Networks	Java	Medium	Pearl Dsilva pearl.dsilva@shapeblue.com^{Image Added}

Project wiki and other resources:

https://

cwiki

github.com/apache

.org

/cloudstack/

confluence/display/CLOUDSTACK

issues/4814
Add SPICE console for vms on KVM/XenServer	Java, Python, Javascript	Hard	Wei Zhou ustcweizhou@gmail.com^{Image Added}	https://github.com/apache/cloudstack

http:

/issues/4803
Configuration parameters and APIs mappings	Java, Python	Hard	Harikrishna Patnala harikrishna@apache.org^{Image Added}	https://github.com/apache/cloudstack/issues/4825
Add virt-v2v support in CloudStack for VM import to KVM	Java, Python, libvirt, libguestfs	Hard	Rohit Yadav rohit@apache.org^{Image Added}	https://github.com/apache/cloudstack/issues/4696

We have an onboarding course for students to learn and get started with CloudStack:
https://github.com/shapeblue/hackerbook

Project wiki and other resources:
https://cwiki.apache.org/confluence/display/CLOUDSTACK

https://github.com/apache/cloudstack

http://docs.cloudstack.apache.org/

Difficulty: Major

Potential mentors:

Rohit Yadav, mail: bhaisaab (at) apache.org

Project Devs, mail: dev (at) cloudstack.apache.org

...

Prevent and fail-fast any attempts to incremental repair cdc/mv tables

Running incremental repairs on CDC or MV tables breaks them.

Attempting to run incremental repair on such should fail-fast and be prevented, with a clear error message.

Difficulty: Normal

Potential mentors:

paulo, mail: paulo (at) apache.org

Project Devs, mail: dev (at) cassandra.apache.org

Per-node overrides for table settings

Add ability to ttl snapshots

It should be possible to add a TTL to snapshots, after which it automatically cleans itself up.

This will be useful together with the auto_snapshot option, where you want to keep an emergency snapshot in case of accidental drop or truncation but automatically remove it after a specified period when it's no longer useful. So in addition to allowing a user to specify a snapshot ttl on nodetool snapshot we should have a auto_snapshot_ttl option that allows a user to set a ttl for automatic snaphots on drop/truncate.

Difficulty: Normal

Potential mentors:

paulo, mail: paulo (at) apache.org

Project Devs, mail: dev (at) cassandra.apache.org

Add nodetool command to display or export the contents of a virtual table

Several virtual tables were recently added, but they're currently only accessible via cqlsh or programmatically. While this is valuable for many use cases, operators are accustomed with the convenience of querying system metrics with a simple nodetool command.

In addition to that, a relatively common request is to provide nodetool output in different formats (JSON, YAML and even XML) (~~CASSANDRA-5977~~, ~~CASSANDRA-12035~~, ~~CASSANDRA-12486~~, CASSANDRA-12698, CASSANDRA-12503). However this requires lots of manual labor as each nodetool subcommand needs to be adapted to support new output formats.

I propose adding a new nodetool command that will consistently print to the standard output the contents of a virtual table. By default the command will print the output in a human-readable tabular format similar to cqlsh, but a "--format" parameter can be specified to modify the output to some other format like JSON or YAML.

It should be possible to add a limit to the amount of rows displayed and filter to display only rows from a specific keyspace or table. The command should be flexible and provide simple hooks for registration and customization of new virtual tables.

I propose calling this command nodetool show <virtualtable> (naming bikeshedding welcome), for example:

nodetool show --list
            caches
            clients

There is a few cases where it's convenient to set some table parameters on only one of a few nodes. For instance, it's useful for experimenting with settings like caching options, compaction, compression, read repair chance, gcGrace ... Another case is when you want to completely migrate to a new setting, but want to do that node-per-node (mainly useful when switching compaction strategy, see CASSANDRA-10898).

I'll note that we can already do some of this through JMX for some of the settings as we have methods like ColumnFamilyStoreMBean.setCompactionParameters(), but:

parameters settings are initially set in CQL. Having to go to JMX for this sounds less consistent to me. The fact we have both a ColumnFamilyStoreMBean.setCompactionParameters() and a ColumnFamilyStoreMBean.setCompactionParametersJson() (as I assume the former one is inconvenient to use) is also proof to me than JMX ain't terribly appropriate.
I think this can be potentially useful for almost all table settings, but we don't expose JMX methods for all settings, and it would be annoying to have to. The method suggested below wouldn't have to be updated every time we add a new settings (if done right).
Changing options through JMX is not persistent across restarts. This may arguably be fine in some cases, but if you're trying to migrate your compaction strategy node per node, or want to experiment with a setting over a mediumish time period, it's mostly a pain.

So what I suggest would be add node overrides in the normal table setting (which would be part of the schema as any other setting). In other words, if you want to set LCS for only one specific node, you'd do:

ALTER TABLE foo WITH node_overrides = { '192.168.0.1' : { 'compaction' : { 'class' : 'LeveledCompactionStrategy' } }
            }internode_inbound

I'll note that I already suggested that idea on CASSANDRA-10898, but as it's more generic than what that latter ticket is about, so creating its own ticket.

Difficulty: Challenging

Potential mentors:

paulo, mail: paulo (at) apache.org

Project Devs, mail: dev (at) cassandra.apache.org

Expose application_name and application_version in virtual table system_views.clients

Recent java-driver's com.datastax.oss.driver.api.core.session.SessionBuilder respects properties ApplicationName and ApplicationVersion.

It would be helpful to exposed this information via virtual table system_views.clients and with nodetool clientstats.

internode_outbound
            settings
            sstable_tasks
            system_properties
            thread_pools
            
            nodetool show clients --format yaml
            ...
            nodetool show internode_outboud --format json
            ...
            nodetool show sstabletasks --keyspace my_ks --table -my_table
            ...

Difficulty: Normal

Potential mentors:

paulo, mail: paulo (at) apache.org

Project Devs, mail: dev (at) cassandra.apache.org

Script to autogenerate cassandra.yaml

It would be useful to have a script that can ask the user a few questions and generate a recommended cassandra.yaml based on their answers. This will help solve issues like selecting num_tokens. It can also be integrated into OS specific packaging tools such as debconf[1]. Rather than just documenting on the website, it is best to provide a simple script to auto-generate configuration based on common use-cases.

[1] https://wiki.debian.org/debconf

Add ability to disable schema changes, repairs, bootstraps, etc (during upgrades)

There are a lot of operations that aren't supposed to be run in a mixed version cluster: schema changes, repairs, topology changes, etc. However, it's easily possible for these operations to be accidentally run by a script, another user unaware of the upgrade, or an operator that's not aware of these rules.

We should make it easy to follow the rules by making it possible to prevent/disable all of these operations through nodetool commands. At the start of an upgrade, an operator can disable all of these until the upgrade has been completed.

Difficulty: Normal

Potential mentors:

paulo, mail: paulo (at) apache.org

Project Devs, mail: dev (at) cassandra.apache.org

Allow table property defaults (e.g. compaction, compression) to be specified for a cluster/keyspace

During an IRC discussion in cassandra-dev it was proposed that we could have table property defaults stored on a Keyspace or globally within the cluster. For example, this would allow users to specify "All new tables on this cluster should default to LCS with SSTable size of 320MiB" or "all new tables in Keyspace XYZ should have Zstd commpression with a 8 KiB block size" or "default_time_to_live should default to 3 days" etc ... This way operators can choose the default that makes sense for their organization once (e.g. LCS if they are running on fast SSDs), rather than requiring developers creating the Keyspaces/Tables to make the decision on every creation (often without context of which choices are right).

A few implementation options were discussed including:

A YAML option
Schema provided at the Keyspace level that would be inherited by any tables automatically
Schema provided at the Cluster level that would be inherited by any Keyspaces or Tables automatically

In IRC it appears that rough consensus was found in having global -> keyspace -> table defaults which would be stored in schema (no YAML configuration since this isn't node level really, it's a cluster level config).

Difficulty: Challenging

Potential mentors:

paulo, mail: paulo (at) apache.org

Project Devs, mail: dev (at) cassandra.apache.org

Global configuration parameter to reject repairs with anti-compaction

We have moved from Cassandra 2.1 to 3.0 and from an operational aspect, the Cassandra repair area changed significantly / got more complex. Beside incremental repairs not working reliably, also full repairs (-full command-line option) are running into anti-compaction code paths, splitting repaired / non-repaired data into separate SSTables, even with full repairs.

Casandra 4.x (with repair enhancements) is quite away for us (for production usage), thus we want to avoid anti-compactions with Cassandra 3.x at any cost. Especially for our on-premise installations at our customer sites, with less control over on how e.g. nodetool is used, we simply want to have a configuration parameter in e.g. cassandra.yaml, which we could use to reject any repair invocations that results in anti-compaction being active.

I know, such a flag still can be flipped then (by the customer), but as a first safety stage possibly sufficient enough to reject anti-compaction repairs, e.g. if someone executes nodetool repair ... the wrong way (accidentally).

Difficulty: Normal

Potential mentors:

paulo, mail: paulo (at) apache.org

Project Devs, mail: dev (at) cassandra.apache.org

Expose application_name and application_version in virtual table system_views.clients

Recent java-driver's com.datastax.oss.driver.api.core.session.SessionBuilder respects properties ApplicationName and ApplicationVersion.

It would be helpful to exposed this information via virtual table system_views.clients and with nodetool clientstats.

Add nodetool command to display or export the contents of a virtual table

Several virtual tables were recently added, but they're currently only accessible via cqlsh or programmatically. While this is valuable for many use cases, operators are accustomed with the convenience of querying system metrics with a simple nodetool command.

In addition to that, a relatively common request is to provide nodetool output in different formats (JSON, YAML and even XML) (~~CASSANDRA-5977~~, ~~CASSANDRA-12035~~, ~~CASSANDRA-12486~~, CASSANDRA-12698, CASSANDRA-12503). However this requires lots of manual labor as each nodetool subcommand needs to be adapted to support new output formats.

I propose adding a new nodetool command that will consistently print to the standard output the contents of a virtual table. By default the command will print the output in a human-readable tabular format similar to cqlsh, but a "--format" parameter can be specified to modify the output to some other format like JSON or YAML.

It should be possible to add a limit to the amount of rows displayed and filter to display only rows from a specific keyspace or table. The command should be flexible and provide simple hooks for registration and customization of new virtual tables.

I propose calling this command nodetool show <virtualtable> (naming bikeshedding welcome), for example:

nodetool show --list caches clients internode_inbound internode_outbound settings sstable_tasks system_properties thread_pools nodetool show clients --format yaml ... nodetool show internode_outboud --format json ... nodetool show sstabletasks --keyspace my_ks --table -my_table ...

Difficulty: Normal

Potential mentors:

paulo, mail: paulo (at) apache.org

Project Devs, mail: dev (at) cassandra.apache.org

Script to autogenerate cassandra.yaml

It would be useful to have a script that can ask the user a few questions and generate a recommended cassandra.yaml based on their answers. This will help solve issues like selecting num_tokens. It can also be integrated into OS specific packaging tools such as debconf[1]. Rather than just documenting on the website, it is best to provide a simple script to auto-generate configuration based on common use-cases.

[1] https://wiki.debian.org/debconf

Difficulty: Normal

Potential mentors:

paulo, mail: paulo (at) apache.org

Project Devs, mail: dev (at) cassandra.apache.org

, mail: paulo (at) apache.org

Project Devs, mail: dev (at) cassandra.apache.org

Per-node overrides for table settings

There is a few cases where it's convenient to set some table parameters on only one of a few nodes. For instance, it's useful for experimenting with settings like caching options, compaction, compression, read repair chance, gcGrace ... Another case is when you want to completely migrate to a new setting, but want to do that node-per-node (mainly useful when switching compaction strategy, see CASSANDRA-10898).

I'll note that we can already do some of this through JMX for some of the settings as we have methods like ColumnFamilyStoreMBean.setCompactionParameters(), but:

parameters settings are initially set in CQL. Having to go to JMX for this sounds less consistent to me. The fact we have both a ColumnFamilyStoreMBean.setCompactionParameters() and a ColumnFamilyStoreMBean.setCompactionParametersJson() (as I assume the former one is inconvenient to use) is also proof to me than JMX ain't terribly appropriate.
I think this can be potentially useful for almost all table settings, but we don't expose JMX methods for all settings, and it would be annoying to have to. The method suggested below wouldn't have to be updated every time we add a new settings (if done right).
Changing options through JMX is not persistent across restarts. This may arguably be fine in some cases, but if you're trying to migrate your compaction strategy node per node, or want to experiment with a setting over a mediumish time period, it's mostly a pain.

So what I suggest would be add node overrides in the normal table setting (which would be part of the schema as any other setting). In other words, if you want to set LCS for only one specific node, you'd do:

ALTER TABLE foo WITH node_overrides = { '192.168.0.1' : { 'compaction' : { 'class' : 'LeveledCompactionStrategy' } }
            }

I'll note that I already suggested that idea on CASSANDRA-10898, but as it's more generic than what that latter ticket is about, so creating its own ticket.

Difficulty: Challenging

Global configuration parameter to reject repairs with anti-compaction

We have moved from Cassandra 2.1 to 3.0 and from an operational aspect, the Cassandra repair area changed significantly / got more complex. Beside incremental repairs not working reliably, also full repairs (-full command-line option) are running into anti-compaction code paths, splitting repaired / non-repaired data into separate SSTables, even with full repairs.

Casandra 4.x (with repair enhancements) is quite away for us (for production usage), thus we want to avoid anti-compactions with Cassandra 3.x at any cost. Especially for our on-premise installations at our customer sites, with less control over on how e.g. nodetool is used, we simply want to have a configuration parameter in e.g. cassandra.yaml, which we could use to reject any repair invocations that results in anti-compaction being active.

I know, such a flag still can be flipped then (by the customer), but as a first safety stage possibly sufficient enough to reject anti-compaction repairs, e.g. if someone executes nodetool repair ... the wrong way (accidentally).

Difficulty: Normal

Potential mentors:

paulo, mail: paulo (at) apache.org

Project Devs, mail: dev (at) cassandra.apache.org.apache.org

Add ability to disable schema changes, repairs, bootstraps, etc (during upgrades)

There are a lot of operations that aren't supposed to be run in a mixed version cluster: schema changes, repairs, topology changes, etc. However, it's easily possible for these operations to be accidentally run by a script, another user unaware of the upgrade, or an operator that's not aware of these rules.

We should make it easy to follow the rules by making it possible to prevent/disable all of these operations through nodetool commands. At the start of an upgrade, an operator can disable all of these until the upgrade has been completed

Add ability to ttl snapshots

It should be possible to add a TTL to snapshots, after which it automatically cleans itself up.

This will be useful together with the auto_snapshot option, where you want to keep an emergency snapshot in case of accidental drop or truncation but automatically remove it after a specified period when it's no longer useful. So in addition to allowing a user to specify a snapshot ttl on nodetool snapshot we should have a auto_snapshot_ttl option that allows a user to set a ttl for automatic snaphots on drop/truncate.

Difficulty: Normal

Potential mentors:

paulo, mail: paulo (at) apache.org

Project Devs, mail: dev (at) cassandra.apache.org

...

Space shortcuts

Child pages

Page History

Versions Compared

Key

Apache StreamPipes

Background

Tasks

Relevant Skills

Learning Material

Spatial Information Systems

Spatial Information Systems

Solr

Solr

Pulsar

Pulsar

OODT

James Server

Why ?

OODT

Fineract Cloud Native

Mentors

Overview & Objectives

Description

Helpful Skills

Impact

Other Resources

James Server

Why ?

Fineract Cloud Native

Mentors

Overview & Objectives

Description

Helpful Skills

Impact

Other Resources

Mentors

Overview & Objectives

Description

Sample APIs to be Documented

Helpful Skills

Impact

Other Resources

Mentors

Overview & Objectives

Description

Helpful Skills

Impact

Other Resources

Mentors

Overview & Objectives

Description

Sample APIs to be Documented

Helpful Skills

Impact

Other Resources

SkyWalking

Mentors

Overview & Objectives

Description

Helpful Skills

Impact

Other Resources

ShardingSphere

Apache ShardingSphere

Background

Task

Relevant Skills

Targets files

References

Mentor

Apache ShardingSphere

Apache ShardingSphere

References

Background

Mentor

Task

Relevant Skills

Targets files

References