Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Contents

...

Introduce event windowing to the StreamPipes core/sdk

Apache StreamPipes

Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.

Background

Currently, window logic can be individually defined per pipeline element. The whole windowing logic needs to be declared in the controller and runtime logic needs to be individually added based on the selected runtime wrapper (Java, Siddhi, Flink, etc...).

As many data processors benefit from using window-functions (i.e PEs such as Event Counter, Count Aggregation, Rate Limiter), windowing logic is often duplicated as it needs to be implemented for every new pipeline element. In addition, the feature set of supported window operators differs (and often depends on the developer) as it is unclear which windows and parameters should/can be offered.  

Therefore, adding support for explicit window semantics to the SDK/Core would make implementing data processors and sinks using windows much easier and less error-prone.

Tasks

  1. Design and introduce new processor and controller classes for windowed event processors (e.g., WindowedDataProcessor) which handle the windowing logic internally and only expose the higher-level methods to users (i.e onCurrentEvent, onExpiredEvent, etc...).
  2. Implement internal logic for few window functions (i.e TimeWindow, LengthWindow, TimeBatchWindow, LengthBatchWindow, etc...)
  3. Write a few sample pipeline-elements using your new API!

Relevant Skills

  • Basic knowledge in StreamPipes core (cloning the repo, going through the codebase/documents would do).
  • Basic knowledge of stream analytics window functions (this is not a must, but it's awesome if you know your way around analytics window functions).
  • Some Java experience.

Learning Material

For StreamPipes:

For Streaming Analytics:

For the context for the issue:

Mentor

  • Grainier Perera (grainier [at] apache.org).
Difficulty: Major
Potential mentors:
Grainier Perera, mail: grainier (at) apache.org
Project Devs, mail:

More powerful real-time visualizations for StreamPipes

New Python Wrapper

Apache StreamPipes

Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.


Background

Current wrappers such as standalone (JVM, Siddhi) or distributed (Flink) already allow to develop new processors in the given runtime environment. The idea is to extend the list of standalone runtime wrappers to also support pure Python processors. We already got a minimal working version that however is pretty inflexible and still relies on Java as a proxy to the pipeline management in the backend service for the model declaration in the setup phase ( capabilities, requirements, static properties of a processor) as well as the actual invocation in the execution phase ( receiving specific configuration from pipeline management when pipeline is started). This issue is to track the status of the development.

Tasks

  1. Add API endpoints as an interface for registration/invocation ( partly done)
  2. Port relevant model classes over to Python (declaration + invocation descriptions)
  3. Implement support for various transport protocols and transport formats
  4. Implement dev friendly alternative to Java builder pattern for model declaration 
  5. Implement overall runtime logic for Python wrapper

Currently, the live dashboard (implemented in Angular) offers an initial set of simple visualizations, such as line charts, gauges, tables and single values. More advanced visualizations, especially those relevant for condition monitoring tasks (e.g., monitoring sensor measurements from industrial machines) is easy. Visualizations can be flexibly created by users and there is an SDK that allows to express requirements (e.g., based on data type or semantic type) for visualizations to better guide users through the creation process.


Tasks

  1. Extend the set of real-time visualizations in StreamPipes, e.g., by integrating existing visualizations from Apache ECharts.
  2. Improve the existing dashboard, e.g., by introducing better filtering or more advanced customization options.


Relevant Skills

0. Don't be afraid! We'll guide you through your first steps with StreamPipes.

  1. Excellent Python skills
  2. Excellent understanding of stream processing paradigm incl.  message broker such as Kafka, MQTT, etc
  3. Good Understanding of RESTful web services (HTTP, etc.)
  4. Basic Java skills to understand existing wrapper logic

Info

Mentor

Patrick Wiener, PPMC Apache StreamPipes (wiener@apache.org)

  1. Angular
  2. Basic knowledge of Apache ECharts


Mentor

Dominik Riemer, PPMC Apache StreamPipes (riemer@apache.org)


Difficulty: Major
Potential mentors:
Dominik Riemer, mail: riemer
Difficulty: Major
Potential mentors:
Patrick Wiener, mail: wiener (at) apache.org
Project Devs, mail:
CLONE -

New Python Wrapper

Apache StreamPipes

Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.


Background

Current wrappers such as standalone (JVM, Siddhi) or distributed (Flink) already allow to develop new processors in the given runtime environment. The idea is to extend the list of standalone runtime wrappers to also support pure Python processors. We already got a minimal working version that however is pretty inflexible and still relies on Java as a proxy to the pipeline management in the backend service for the model declaration in the setup phase ( capabilities, requirements, static properties of a processor) as well as the actual invocation in the execution phase ( receiving specific configuration from pipeline management when pipeline is started). This issue is to track the status of the development.


Tasks

  1. Add API endpoints as an interface for registration/invocation ( partly done)
  2. Port relevant model classes over to Python (declaration + invocation descriptions)
  3. Implement support for various transport protocols and transport formats
  4. Implement dev friendly alternative to Java builder pattern for model declaration 
  5. Implement overall runtime logic for Python wrapper


Relevant Skills

0. Don't be afraid! We'll guide you through your first steps with StreamPipes.

  1. Excellent Python skills
  2. Excellent understanding of stream processing paradigm incl.  message broker such as Kafka, MQTT, etc
  3. Good Understanding of RESTful web services (HTTP, etc.)
  4. Basic Java skills to understand existing wrapper logic


Info


Mentor

Patrick Wiener, PPMC Apache StreamPipes (wiener@apache.org)

Difficulty: Major
Potential mentors:
Anush krishna VPatrick Wiener, mail: anushkrishnav wiener (at) apache.org
Project Devs, mail:

Spatial Information Systems

Create metadata, CRS and tabular data editors in JavaFX

Creates the foundation of a GUI application for Apache SIS based on JavaFX. This application should leverage the functionalities available in Apache SIS 0.8. In particular:

  • Read metadata from files in various formats (currently ISO 19139, GeoTIFF, NetCDF, LANDSAT, GPX, Moving Features)
  • Get Coordinate Reference System from a registry or from GML or WKT definitions and apply coordinate transformations.
  • Show vector data in a tabular format.

Since SIS does not yet have a renderer engine, we can not yet show maps in the application. However the application should be designed with this goal in mind.

This project should create a metadata editor showing the ISO 19115 metadata. We should provide a simplified view with only the essential information, and an advanced view showing all information. The information to shown should be customizable. The user should be able to edit the metadata and save them in ISO 19139 format.

The project should also create the necessary widgets for showing a Coordinate Reference System (CRS) definition and allow the user to edit it. Another widget should use the CRS definitions for applying coordinate operations (map projections) using the existing Apache SIS referencing engine, and show the result in a table with information about accuracy and domain of validity.

Edit (March 2021): A JavaFX application has been created. It has widget for metadata and vector data, but we still need widget for Coordinate Reference System definitions. See SIS wiki for screenshots.

More powerful real-time visualizations for StreamPipes

Apache StreamPipes

Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.

Background

Currently, the live dashboard (implemented in Angular) offers an initial set of simple visualizations, such as line charts, gauges, tables and single values. More advanced visualizations, especially those relevant for condition monitoring tasks (e.g., monitoring sensor measurements from industrial machines) is easy. Visualizations can be flexibly created by users and there is an SDK that allows to express requirements (e.g., based on data type or semantic type) for visualizations to better guide users through the creation process.

Tasks

  1. Extend the set of real-time visualizations in StreamPipes, e.g., by integrating existing visualizations from Apache ECharts.
  2. Improve the existing dashboard, e.g., by introducing better filtering or more advanced customization options.

Relevant Skills

0. Don't be afraid! We'll guide you through your first steps with StreamPipes.

  1. Angular
  2. Basic knowledge of Apache ECharts

Mentor

Dominik Riemer, PPMC Apache StreamPipes (riemer@apache.org)

Difficulty: Major
Potential mentors:
Dominik RiemerMartin Desruisseaux, mail: riemer desruisseaux (at) apache.org
Project Devs, mail:

Spatial Information Systems

Coordinate operation methods to implement

This is an umbrella task for some coordinate operation methods not yet supported in Apache SIS. Coordinate operations include map projections (e.g. Transverse Mercator, Lambert Conic Conformal, etc.), datum shifts (e.g. transformations from NAD27 to NAD83 in United States), transformation of vertical coordinates, etc. We can of course not list all possible formulas that we do not support, but this JIRA task lists at least some of the operations listed in the EPSG guidance notes.

The main material for this work is the EPSG guidance notes, which can be downloaded freely from the following site:

IOGP Publication 373-7-2 – Geomatics Guidance Note number 7, part 2
Coordinate Conversions and Transformations including Formulas
http://www.epsg.org/GuidanceNotes

Google summer of code students interested in this work would need to be reasonably comfortable with the Java language (but not necessarily with the JDK library at large, since this work uses relatively few JDK classes outside Math), and in mathematic. In particular, this work requires a good understanding of affine transforms: their representation as a matrix, and how to map a term in a formula to a coefficient in the affine transform matrix.

Apache SIS has one advanced feature which is not easily found in popular geospatial software or text books: the capability to compute the derivative (or more precisely, the Jacobian) of a transformation at a given point. Implementation of this feature requires the capability to find the analytic derivative of a non-linear formula and to simplify it.

Implementations of those formulas take place in one of the org.apache.sis.referencing.operation sub-packages (projection or transform). Implementations of JUnit test happen partially in Apache SIS, and partially in the "conformance module" of the GeoAPI project, if possible through the Geospatial Integrity of Geoscience Software (GIGS) tests

Create metadata, CRS and tabular data editors in JavaFX

Creates the foundation of a GUI application for Apache SIS based on JavaFX. This application should leverage the functionalities available in Apache SIS 0.8. In particular:

  • Read metadata from files in various formats (currently ISO 19139, GeoTIFF, NetCDF, LANDSAT, GPX, Moving Features)
  • Get Coordinate Reference System from a registry or from GML or WKT definitions and apply coordinate transformations.
  • Show vector data in a tabular format.

Since SIS does not yet have a renderer engine, we can not yet show maps in the application. However the application should be designed with this goal in mind.

This project should create a metadata editor showing the ISO 19115 metadata. We should provide a simplified view with only the essential information, and an advanced view showing all information. The information to shown should be customizable. The user should be able to edit the metadata and save them in ISO 19139 format.

The project should also create the necessary widgets for showing a Coordinate Reference System (CRS) definition and allow the user to edit it. Another widget should use the CRS definitions for applying coordinate operations (map projections) using the existing Apache SIS referencing engine, and show the result in a table with information about accuracy and domain of validity.

Edit (March 2021): A JavaFX application has been created. It has widget for metadata and vector data, but we still need widget for Coordinate Reference System definitions. See SIS wiki for screenshots.

Difficulty: Major
Potential mentors:
Martin Desruisseaux, mail: desruisseaux (at) apache.org
Project Devs, mail:

Coordinate operation methods to implement

Solr

Refactor test infra to work with a managed SolrClient; ditch TestHarness

This is a proposal to substantially refactor SolrTestCaseJ4 and some of its intermediate subclasses in the hierarchy.  In essence, I envision that tests should work with a SolrClient typed "solrClient" field managed by the test infrastructure. With only a few lines of code, a test should be able to pick between an instance based on EmbeddedSolrServer (lighter tests), HttpSolrClient (tests HTTP/Jetty behavior directly or indirectly), SolrCloud, and perhaps a special one for our distributed search tests. STCJ4 would refactor its methods to use the solrClient field instead of TestHarness. TestHarness would disappear as-such; bits of its existing code would migrate elsewhere, such as to manage an EmbeddedSolrServer for testing.

I think we can do a transition like this in stages and furthermore minimally affecting most tests by adding some deprecated shims. Perhaps STCJ4 should become the deprecated shim so that users can still use it during 7.x and to help us with the transition internally too. More specifically, we'd add a new superclass to STCJ4 that is the future – "SolrTestCase".

Additionally, there are a bunch of methods on SolrTestCaseJ4 that I question the design of, especially ones that return XML strings like delI (generates a delete-by-id XML string) and adoc. Perhaps that used to be a fine idea before there was a convenient SolrClient API but we've got one now and a test shouldn't be building XML unless it's trying to test exactly that.

For consulting work I once developed a JUnit4 TestRule managing a SolrClient that is declared in a test with an annotation of @ClassRule. I had a variation for SolrCloud and EmbeddedSolrServer that was easy for a test to choose. Since TestRule is an interface, I was able to make a special delegating SolrClient subclass that implements TestRule. This isn't essential but makes use of it easier since otherwise you'd be forced to call something like getSolrClient(). We could go the TestRule route here, which I prefer (with or without having it subclass SolrClient), or we could alternatively do TestCase subclassing to manage the lifecycle.

Initially I'm just looking for agreement and refinement of the approach. After that, sub-tasks ought to be added. I won't have time to work on this for some time

This is an umbrella task for some coordinate operation methods not yet supported in Apache SIS. Coordinate operations include map projections (e.g. Transverse Mercator, Lambert Conic Conformal, etc.), datum shifts (e.g. transformations from NAD27 to NAD83 in United States), transformation of vertical coordinates, etc. We can of course not list all possible formulas that we do not support, but this JIRA task lists at least some of the operations listed in the EPSG guidance notes.

The main material for this work is the EPSG guidance notes, which can be downloaded freely from the following site:

IOGP Publication 373-7-2 – Geomatics Guidance Note number 7, part 2
Coordinate Conversions and Transformations including Formulas
http://www.epsg.org/GuidanceNotes

Google summer of code students interested in this work would need to be reasonably comfortable with the Java language (but not necessarily with the JDK library at large, since this work uses relatively few JDK classes outside Math), and in mathematic. In particular, this work requires a good understanding of affine transforms: their representation as a matrix, and how to map a term in a formula to a coefficient in the affine transform matrix.

Apache SIS has one advanced feature which is not easily found in popular geospatial software or text books: the capability to compute the derivative (or more precisely, the Jacobian) of a transformation at a given point. Implementation of this feature requires the capability to find the analytic derivative of a non-linear formula and to simplify it.

Implementations of those formulas take place in one of the org.apache.sis.referencing.operation sub-packages (projection or transform). Implementations of JUnit test happen partially in Apache SIS, and partially in the "conformance module" of the GeoAPI project, if possible through the Geospatial Integrity of Geoscience Software (GIGS) tests.

Difficulty: Major
Potential mentors:
Martin DesruisseauxDavid Smiley, mail: desruisseaux dsmiley (at) apache.org
Project Devs, mail:

Solr

Pulsar

Integration with Apache Ranger

Currently, Pulsar only supports store authorization policies under local-zookeeper. Is it possible to support [ranger](https://github.com/apache/ranger), apache ranger can provide a framework for central administration of security policies and monitoring of user access.

Difficulty: Major
Potential mentors:
Penghui Li, mail: penghui (at) apache.org
Project Devs, mail:

Throttle the ledger rollover for the broker

In Pulsar, the ledger rollover is split the data of a topic into multiple segments. For each ledger roll over operation, the metadata of the topic needs to be updated in the ZookKeeper. High ledger rollover frequency may lead to the ZookKeeper cluster in heavy workload. In order to make the ZookKeeper run more stable, we should limit the ledger rollover rate.

Difficulty: Major
Potential mentors:
Penghui Li, mail: penghui (at) apache.org
Project Devs, mail:

Support reset cursor by message index

Currently, Pulsar supports resetting the cursor according to time and message-id, e.g. you can reset the cursor to 3 hours ago or reset the cursor to a specific message-id. For some cases that users want to reset to the 10,000 earlier messages, Pulsar has not supported this operation yet

PIP-70 https://github.com/apache/pulsar/wiki/PIP-70%3A-Introduce-lightweight-raw-Message-metadata Introduced a broker level entry metadata which can support message index for a topic(or message offset of a topic), this will provide the ability to support reset cursor according to the message index.

Difficulty: Major
Potential mentors:
Penghui Li, mail: penghui (at) apache.org
Project Devs, mail:

Support publish and consume avro objects in pulsar-perf

We should use perf tool to benchmark producing and consuming messages using Schema.

Refactor test infra to work with a managed SolrClient; ditch TestHarness

This is a proposal to substantially refactor SolrTestCaseJ4 and some of its intermediate subclasses in the hierarchy.  In essence, I envision that tests should work with a SolrClient typed "solrClient" field managed by the test infrastructure. With only a few lines of code, a test should be able to pick between an instance based on EmbeddedSolrServer (lighter tests), HttpSolrClient (tests HTTP/Jetty behavior directly or indirectly), SolrCloud, and perhaps a special one for our distributed search tests. STCJ4 would refactor its methods to use the solrClient field instead of TestHarness. TestHarness would disappear as-such; bits of its existing code would migrate elsewhere, such as to manage an EmbeddedSolrServer for testing.

I think we can do a transition like this in stages and furthermore minimally affecting most tests by adding some deprecated shims. Perhaps STCJ4 should become the deprecated shim so that users can still use it during 7.x and to help us with the transition internally too. More specifically, we'd add a new superclass to STCJ4 that is the future – "SolrTestCase".

Additionally, there are a bunch of methods on SolrTestCaseJ4 that I question the design of, especially ones that return XML strings like delI (generates a delete-by-id XML string) and adoc. Perhaps that used to be a fine idea before there was a convenient SolrClient API but we've got one now and a test shouldn't be building XML unless it's trying to test exactly that.

For consulting work I once developed a JUnit4 TestRule managing a SolrClient that is declared in a test with an annotation of @ClassRule. I had a variation for SolrCloud and EmbeddedSolrServer that was easy for a test to choose. Since TestRule is an interface, I was able to make a special delegating SolrClient subclass that implements TestRule. This isn't essential but makes use of it easier since otherwise you'd be forced to call something like getSolrClient(). We could go the TestRule route here, which I prefer (with or without having it subclass SolrClient), or we could alternatively do TestCase subclassing to manage the lifecycle.

Initially I'm just looking for agreement and refinement of the approach. After that, sub-tasks ought to be added. I won't have time to work on this for some time.

Difficulty: Major
Potential mentors:
David SmileyPenghui Li, mail: dsmiley penghui (at) apache.org
Project Devs, mail:

Pulsar

Expose the broker level message metadata to the client

Currently, Pulsar exposes the message written count metrics though the Prometheus endpoint, and the metrics maintain in the broker, no been persistent. So if the topic ownership changes or restart broker, this will lead to reset the message written count of the topic to 0. This will confused users and not able to get the correct message written count metrics.

PIP-70 https://github.com/apache/pulsar/wiki/PIP-70%3A-Introduce-lightweight-rawbroker-Messageentry-metadata Introduced metadata Introduced a broker level entry metadata and already which can support add message index and broker add a timestamp for the message. But currently, the client can't get the broker level message metadata since the broker skip this information when dispatching messages to the client. Provide a way to expose the broker level message metadata to the client.for a topic(or message offset of a topic), this will provide the ability to calculate the precise message written count for a topic. So we can leverage PIP-70 to improve the message written count metrics for the topic

Difficulty: Major
Potential mentors:
Penghui Li, mail: penghui (at) apache.org
Project Devs, mail:

Integration with Apache Ranger

Currently, Pulsar only supports store authorization policies under local-zookeeper. Is it possible to support [ranger](https://github.com/apache/ranger), apache ranger can provide a framework for central administration of security policies and monitoring of user access.

Difficulty: Major
Potential mentors:
Penghui Li, mail: penghui (at) apache.org
Project Devs, mail:

Throttle the ledger rollover for the broker

In Pulsar, the ledger rollover is split the data of a topic into multiple segments. For each ledger roll over operation, the metadata of the topic needs to be updated in the ZookKeeper. High ledger rollover frequency may lead to the ZookKeeper cluster in heavy workload. In order to make the ZookKeeper run more stable, we should limit the ledger rollover rate.

Difficulty: Major
Potential mentors:
Penghui Li, mail: penghui (at) apache.org
Project Devs, mail:

Improve the message backlogs for the topic

In Pulsar, the client usually sends several messages with a batch. From the broker side, the broker receives a batch and write the batch message to the storage layer.

The message backlog is maintaining how many messages should be handled for a subscription. But unfortunately, the current backlog is based on the batches, not the messages. This will confuse users that they have pushed 1000 messages to the topic, but from the subscription side, when to check the backlog, will return a value that lower than 1000 messages such as 100 batches. Not able to get the message based backlog is it's so expensive to calculate the number of messages in each batch.


PIP-70 https://github.com/apache

Support reset cursor by message index

Currently, Pulsar supports resetting the cursor according to time and message-id, e.g. you can reset the cursor to 3 hours ago or reset the cursor to a specific message-id. For some cases that users want to reset to the 10,000 earlier messages, Pulsar has not supported this operation yet

PIP-70 https://github.com/apache/pulsar/wiki/PIP-70%3A-Introduce-lightweight-raw-Message-metadata Introduced a broker level entry metadata which can support message index for a topic(or message offset of a topic), this . This will provide the ability to support reset cursor according to the message index.

Difficulty: Major
Potential mentors:
Penghui Li, mail: penghui (at) apache.org
Project Devs, mail:

to calculate the number of messages between a message index to another message index. So we can leverage PIP-70 to improve the message backlog implementation to able to get the message-based backlog.


For the Exclusive subscription or Failover subscription, it easy to implement by calculating the messages between the mark delete position and the LAC position. But for the Shared and Key_Shared subscription, the individual acknowledgment will bring some complexity. We can cache the individual acknowledgment count in the broker memory, so the way to calculate the message backlog for the Shared and Key_Shared subscription is `backlogOfTheMarkdeletePosition` - `IndividualAckCount`

Support publish and consume avro objects in pulsar-perf

We should use perf tool to benchmark producing and consuming messages using Schema.

Difficulty: Major
Potential mentors:
Penghui Li, mail: penghui (at) apache.org
Project Devs, mail:
Improve

Expose the

message written count metrics for the topicCurrently, Pulsar exposes the message written count metrics though the Prometheus endpoint, and the metrics maintain in the broker, no been persistent. So if the topic ownership changes or restart broker, this will lead to reset the message written count of the topic to 0. This will confused users and not able to get the correct message written count metrics.

broker level message metadata to the client.

PIP-70 https://github.com/apache/pulsar/wiki/PIP-70%3A-Introduce-lightweight-brokerraw-entryMessage-metadata Introduced metadata Introduced a broker level entry metadata which can and already support add message index for a topic(or message offset of a topic), this will provide the ability to calculate the precise message written count for a topic. So we can leverage PIP-70 to improve the message written count metrics for the topicand broker add a timestamp for the message. 

But currently, the client can't get the broker level message metadata since the broker skip this information when dispatching messages to the client. Provide a way to expose the broker level message metadata to the client.

Difficulty: Major
Potential mentors:
Penghui Li, mail: penghui (at) apache.org
Project Devs, mail:

OODT

Improve OPSUI React.js UI with advanced functionalities

In GSoC 2019, we implemented a new OPSUI UI based on React.js. See the related blog posts [1] [2]. Several advanced features require to be implemented including.

  • Implement querying functionality at OPSUI side (scope can be determined)
  • Show progress of workflows and file ingestions
  • Introduce a proper REST API for resource manager component
  • Introduce proper packaging (with configurable external REST API URLs) and deployment mechanism (as a docker deployment or an npm package)

In this project, the student will have to work on the UI with React.js and will have to implement several REST APIs using JAX-RS. Furthermore, will have to work on making OPSUI easy to deploy.

The existing wicket based OPSUI will be replaced by the new React.js based OPSUI at the end of this project. And the linked blog posts will be a good start to understand what the new React.js based OPSUI is capable of doing.

[1] https://medium.com/faun/gsoc-2019-apache-oodt-react-based-opsui-dashboard-d93a9083981c
[2] https://medium.com/faun/whats-new-in-apache-oodt-react-opsui-dashboard-4cc6701628a9
[3] https://medium.com/faun/apache-oodt-with-docker-84d32525c798

Difficulty: Major
Potential mentors:
Imesha Sudasingha, mail: imesha (at) apache.org
Project Devs, mail:

James Server

[GSOC-2021] Implement Thread support for JMAP

Why ?

Mail user agents generally allow displaying emails grouped by conversations (replies, forward, etc...).

As part of JMAP RFC-8621 implementation, there is a dedicated concepts: threads. We did implement JMAP Threads in a rather naive way: each email is a thread of its own.

This naive implementation is specification compliant but defeat the overall purposes of threads.

I propose myself to mentor the implementation of Threads as part of the James JMAP implementation.

See: https://jmap.io/spec-mail.html#threads

Improve the message backlogs for the topic

In Pulsar, the client usually sends several messages with a batch. From the broker side, the broker receives a batch and write the batch message to the storage layer.

The message backlog is maintaining how many messages should be handled for a subscription. But unfortunately, the current backlog is based on the batches, not the messages. This will confuse users that they have pushed 1000 messages to the topic, but from the subscription side, when to check the backlog, will return a value that lower than 1000 messages such as 100 batches. Not able to get the message based backlog is it's so expensive to calculate the number of messages in each batch.

PIP-70 https://github.com/apache/pulsar/wiki/PIP-70%3A-Introduce-lightweight-raw-Message-metadata Introduced a broker level entry metadata which can support message index for a topic(or message offset of a topic). This will provide the ability to calculate the number of messages between a message index to another message index. So we can leverage PIP-70 to improve the message backlog implementation to able to get the message-based backlog.

For the Exclusive subscription or Failover subscription, it easy to implement by calculating the messages between the mark delete position and the LAC position. But for the Shared and Key_Shared subscription, the individual acknowledgment will bring some complexity. We can cache the individual acknowledgment count in the broker memory, so the way to calculate the message backlog for the Shared and Key_Shared subscription is `backlogOfTheMarkdeletePosition` - `IndividualAckCount`

Difficulty: Major
Potential mentors:
Penghui LiBenoit Tellier, mail: penghui btellier (at) apache.org
Project Devs, mail:

OODT

Fineract Cloud Native

Machine Learning Scorecard for Credit Risk Assessment Phase 4

Mentors

Overview & Objectives

Financial Organizations using Mifos/Fineract are depending on external agencies or their past experiences for evaluating credit scoring and identification of potential NPAs. Though information from external agencies is required, financial organizations can have an internal scorecard for evaluating loans so that preventive/proactive actions can be done along with external agencies reports. In industry, organizations are using rule based, Statistical and Machine learning methods for credit scoring, predicting potential NPAs, fraud detection and other activities. This project aims to implement a scorecard based on statistical and ML methods for credit scoring and identification of potential NPAs.

Description

The approach should factor and improve last year's GSOC work (https://gist.github.com/SupreethSudhakaranMenon/a20251271adb341f949dbfeb035191f7) on Features/Characteristics, Criteria and evaluation (link). The design and implementation of the screens should follow Mifos Application standards. Should implement statistical and ML methods with explainability on decision making. Should also be extensible for adding other functionalities such as fraud detection, cross-sell and up-sell, etc.

Helpful Skills

JAVA, Integrating Backend Service, MIFOS X, Apache Fineract, AngularJS, ORM, ML, Statistical Methods, Django

Impact

Streamlined Operations, Better RISK Management, Automated Response Mechanism

Other Resources

2019 Progress: https://gist.github.com/SupreethSudhakaranMenon/a20251271adb341f949dbfeb035191f7

https://gist.github.com/lalitsanagavarapu

Improve OPSUI React.js UI with advanced functionalities

In GSoC 2019, we implemented a new OPSUI UI based on React.js. See the related blog posts [1] [2]. Several advanced features require to be implemented including.

  • Implement querying functionality at OPSUI side (scope can be determined)
  • Show progress of workflows and file ingestions
  • Introduce a proper REST API for resource manager component
  • Introduce proper packaging (with configurable external REST API URLs) and deployment mechanism (as a docker deployment or an npm package)

In this project, the student will have to work on the UI with React.js and will have to implement several REST APIs using JAX-RS. Furthermore, will have to work on making OPSUI easy to deploy.

The existing wicket based OPSUI will be replaced by the new React.js based OPSUI at the end of this project. And the linked blog posts will be a good start to understand what the new React.js based OPSUI is capable of doing.

[1] https://medium.com/faun/gsoc-2019-apache-oodt-react-based-opsui-dashboard-d93a9083981c
[2] https://medium.com/faun/whats-new-in-apache-oodt-react-opsui-dashboard-4cc6701628a9
[3] https://medium.com/faun/apache-oodt-with-docker-84d32525c798

Difficulty: Major
Potential mentors:
Imesha SudasinghaEd Cable, mail: imesha edcable (at) apache.org
Project Devs, mail:

James Server

[GSOC-2021] Implement Thread support for JMAP

Why ?

Mail user agents generally allow displaying emails grouped by conversations (replies, forward, etc...).

As part of JMAP RFC-8621 implementation, there is a dedicated concepts: threads. We did implement JMAP Threads in a rather naive way: each email is a thread of its own.

This naive implementation is specification compliant but defeat the overall purposes of threads.

I propose myself to mentor the implementation of Threads as part of the James JMAP implementation.

See: https://jmap.io/spec-mail.html#threads

Difficulty: Major
Potential mentors:
Benoit Tellier, mail: btellier (at) apache.org
Project Devs, mail:

Fineract Cloud Native

Machine Learning Scorecard for Credit Risk Assessment Phase 4

Mentors

Overview & Objectives

Financial Organizations using Mifos/Fineract are depending on external agencies or their past experiences for evaluating credit scoring and identification of potential NPAs. Though information from external agencies is required, financial organizations can have an internal scorecard for evaluating loans so that preventive/proactive actions can be done along with external agencies reports. In industry, organizations are using rule based, Statistical and Machine learning methods for credit scoring, predicting potential NPAs, fraud detection and other activities. This project aims to implement a scorecard based on statistical and ML methods for credit scoring and identification of potential NPAs.

Description

The approach should factor and improve last year's GSOC work (https://gist.github.com/SupreethSudhakaranMenon/a20251271adb341f949dbfeb035191f7) on Features/Characteristics, Criteria and evaluation (link). The design and implementation of the screens should follow Mifos Application standards. Should implement statistical and ML methods with explainability on decision making. Should also be extensible for adding other functionalities such as fraud detection, cross-sell and up-sell, etc.

Helpful Skills

JAVA, Integrating Backend Service, MIFOS X, Apache Fineract, AngularJS, ORM, ML, Statistical Methods, Django

Impact

Streamlined Operations, Better RISK Management, Automated Response Mechanism

Other Resources

2019 Progress: https://gist.github.com/SupreethSudhakaranMenon/a20251271adb341f949dbfeb035191f7

https://gist.github.com/lalitsanagavarapu

Difficulty: Major
Potential mentors:
Ed Cable, mail: edcable (at) apache.org
Project Devs, mail: dev (at) fineract.apache.org
dev (at) fineract.apache.org

Create Open Banking Layer for Fineract CN Self-Service App

Mentors

Overview & Objectives

Across our ecosystem we're seeing more and more adoption and innovation from fintechs. A huge democratizing force across the financial services sector is the Open Banking movement providing Open Banking APIs to enable third parties to directly interact with customers of financial institutions. We have recently started providing an Open Banking API layer that will allow financial institutions using Mifos and Fineract to offer third parties access to requesting account information and initiating payments via these APIs. Most recently the Mojaloop community, led by Google, has led the development of a centralized PISP API. We have chosen to the follow the comprehensive UK Open Banking API standard which is being followed and adopted by a number of countries through Sub-Saharan Africa and Latin America.

Tremendous impact can be had at the Base of the Pyramid by enabling third parties to establish consent with customers to authorize transactions to be initiated or information to be accessed from accounts at their financial institution. This Open Banking API layer would enable any institution using Mifos or Fineract to provide a UK Open Banking API layer to third parties and fintechs.

The API Gateway to connect to is still being chosen (WS02, Gravitee, etc.)

Description

The APIS that are consumed by the the reference Fineract 1.x mobile banking application have been documented in the spreadsheet below. The APIs have also been categorized according to whether they are an existing self-service API or back-office API and if they have an equivalent Open Banking API and if so, a link to the corresponding Open Banking API.

For each API with an equivalent Open Banking API, the interns must: Take rest api, upload swagger definition, do transformation in OpenBanking Adapter, and publish on API gateway.

For back-office and/or self-service APIs with no equivalent Open Banking API, the process is: Take rest api, upload swagger definition, and publish on API gateway.

For example:

Mifos Mobile CN API Matrix (completed by Garvit)
https://docs.google.com/spreadsheets/d/1-HrfPKhh1kO7ojK15Ylf6uzejQmaz72eXf5MzCBCE3M/edit#gid=0
https://docs.google.com/document/d/15LbxVoQQRoa4uU7QiV7FpJFVjkyyNb9_HJwFvS47O4I/edit?pli=1#
Mobile Wallet API Matrix (completed by Devansh)
https://docs.google.com/spreadsheets/d/1VgpIwN2JsljWWytk_Qb49kKzmWvwh6xa1oRgMNIAv3g/edit#gid=0

Helpful Skills

Android development, SQL, Java, Javascript, Git, Spring, OpenJPA, Rest, Kotlin, Gravitee, WSO2

Impact

By providing a standard UK Open Banking API layer we can provide both a secure way for our trusted first party apps to allow customers to authenticate and access their accounts as well as an API layer for third party fintechs to securely access Fineract and request information or initiate transactions with the consent of customers.

Other Resources

CGAP Research on Open Banking: https://www.cgap.org/research/publication/open-banking-how-design-financial-inclusion
Docs: https://mifos.gitbook.io/docs/wso2-1/setup-openbanking-apis
Self-Service APIs: https://demo.mifos.io/api-docs/apiLive.htm#selfbasicauth

Reference Open Banking Fintech App:

UK Open Banking API Standard: https://standards.openbanking.org.uk/

Open Banking Developer Zone: https://openbanking.atlassian.net/wiki/spaces/DZ/overview

Examples of Open Banking Apps: https://www.ft.com/content/a5f0af78-133e-11e9-a581-4ff78404524e

See https://openmf.github.io/mobileapps.github.io/

Difficulty: Major
Potential mentors:
Ed Cable, mail: edcable (at) apache.org
Project Devs, mail: dev (at) fineract.apache.org

Functional Enhancements to Fineract CN Mobile

Mentors

Overview & Objectives

Just as we have a mobile field operations app on Apache Fineract 1.x, we have recently built out on top of the brand new Apache Fineract CN micro-services architecture, an initial version of a mobile field operations app with an MVP architecture and material design. Given the flexibility of the new architecture and its ability to support different methodologies - MFIs, credit unions, cooperatives, savings groups, agent banking, etc - this mobile app will have different flavors and workflows and functionalities.

Description

In 2020, our Google Summer of Code intern worked on additional functionality in the Fineract CN mobile app. In 2021, the student will work on the following tasks:

  • Integrate with Payment Hub to enable disbursement via Mobile Money API
  • Improve Task management features into the app.
  • Create UI for creating new account and displaying account details
  • Create UI for creating tellers and displaying tellers details
  • Improve GIS features like location tracking, dropping of pin into the app
  • Improve offline mode via Couchbase support
  • Write Unit Test, Integration Test and UI tests

    Helpful Skills

    Android Development, Kotlin, Java, Git, OpenJPA, Rest API

    Impact

    Allows staff to go directly into the field to connect to the client. Reduces cost of operations by enabling organizations to go paperless and be more efficient.

    Other Resources

Create Open Banking Layer for Fineract CN Self-Service App

Mentors

Overview & Objectives

Across our ecosystem we're seeing more and more adoption and innovation from fintechs. A huge democratizing force across the financial services sector is the Open Banking movement providing Open Banking APIs to enable third parties to directly interact with customers of financial institutions. We have recently started providing an Open Banking API layer that will allow financial institutions using Mifos and Fineract to offer third parties access to requesting account information and initiating payments via these APIs. Most recently the Mojaloop community, led by Google, has led the development of a centralized PISP API. We have chosen to the follow the comprehensive UK Open Banking API standard which is being followed and adopted by a number of countries through Sub-Saharan Africa and Latin America.

Tremendous impact can be had at the Base of the Pyramid by enabling third parties to establish consent with customers to authorize transactions to be initiated or information to be accessed from accounts at their financial institution. This Open Banking API layer would enable any institution using Mifos or Fineract to provide a UK Open Banking API layer to third parties and fintechs.

The API Gateway to connect to is still being chosen (WS02, Gravitee, etc.)

Description

The APIS that are consumed by the the reference Fineract 1.x mobile banking application have been documented in the spreadsheet below. The APIs have also been categorized according to whether they are an existing self-service API or back-office API and if they have an equivalent Open Banking API and if so, a link to the corresponding Open Banking API.

For each API with an equivalent Open Banking API, the interns must: Take rest api, upload swagger definition, do transformation in OpenBanking Adapter, and publish on API gateway.

For back-office and/or self-service APIs with no equivalent Open Banking API, the process is: Take rest api, upload swagger definition, and publish on API gateway.

For example:

Mifos Mobile CN API Matrix (completed by Garvit)
https://docs.google.com/spreadsheets/d/1-HrfPKhh1kO7ojK15Ylf6uzejQmaz72eXf5MzCBCE3M/edit#gid=0
https://docs.google.com/document/d/15LbxVoQQRoa4uU7QiV7FpJFVjkyyNb9_HJwFvS47O4I/edit?pli=1#
Mobile Wallet API Matrix (completed by Devansh)
https://docs.google.com/spreadsheets/d/1VgpIwN2JsljWWytk_Qb49kKzmWvwh6xa1oRgMNIAv3g/edit#gid=0

Helpful Skills

Android development, SQL, Java, Javascript, Git, Spring, OpenJPA, Rest, Kotlin, Gravitee, WSO2

Impact

By providing a standard UK Open Banking API layer we can provide both a secure way for our trusted first party apps to allow customers to authenticate and access their accounts as well as an API layer for third party fintechs to securely access Fineract and request information or initiate transactions with the consent of customers.

Other Resources

CGAP Research on Open Banking: https://www.cgap.org/research/publication/open-banking-how-design-financial-inclusion
Docs: https://mifos.gitbook.io/docs/wso2-1/setup-openbanking-apis
Self-Service APIs: https://demo.mifos.io/api-docs/apiLive.htm#selfbasicauth
  1. Repo on Github:
    https://github.com/apache/fineract-cn-mobile
  2. Fineract CN API documentation:
    https://izakey.github.io/fineract-cn-api-docs-site/
  3. https://github.com/aasaru/fineract-cn-api-docs

  1. https://cwiki.apache.org/confluence/display/FINERACT/
Customer+Self-Service+Phase+2
Open Banking Adapter:
  1. Fineract+CN
  2. How to install and run Couchbase:
    https://gist.github.com/
openMF
  1. jawidMuhammadi/
openbanking-adapter
  • Transforms Open Banking API to Fineract API
  • Works with both Fineract 1.x and Fineract CN
  • Can connect to different API gateways and can transform against different API standards.
  • Reference Open Banking Fintech App:

  • Backend: https://github.com/openMF/openbanking-tpp-server
  • GUI:
    1. af6cd34058cacf20b100d335639b3ad8
    2. GSMA mobile money API:
      https://developer.mobilemoneyapi.io/1.1/oas3/22466
    3. Payment Hub:
    1. https://github.com/
    openMF/openbanking-tpp-client
    Google Whitepaper on 3PPI:
    1. search?q=openMF%2Fph-ee&ref=opensearch
    2. Some UI designs:

    https://

    static

    www.

    googleusercontent

    figma.com/

    media

    file/

    nextbillionusers.google/en//tools/3PPI-2021-whitepaper.pdf

    KHXtZPdIpC3TqvdIVZu8CW/fineract-cn-mobile?node-id=0%3A1

    UK Open Banking API Standard:
    1. 2020 GSoC progress report:
    1. https://
    standards
    1. gist.
    openbanking.org.uk/Open Banking Developer Zone:
    1. github.com/jawidMuhammadi/9fa91d37b1cbe43d9cdfe165ad8f2102
    2. JIRA Task
      https://
    openbanking
    1. issues.
    atlassian
    1. apache.
    net
    1. org/
    wiki
    1. jira/
    spaces/DZ/overview

    Examples of Open Banking Apps: https://www.ft.com/content/a5f0af78-133e-11e9-a581-4ff78404524e

    See https://openmf.github.io/mobileapps.github.io/
    1. browse/FINCN-241?filter=-2&jql=project%20%3D%20FINCN%20order%20by%20created%20DESC
    Difficulty: Major
    Potential mentors:
    Ed Cable, mail: edcable (at) apache.org
    Project Devs, mail: dev (at) fineract.apache.org

    Functional Enhancements to Fineract CN Mobile

    SkyWalking

    Apache SkyWalking: Python agent supports profiling

    Apache SkyWalking [1] is an application performance monitor (APM) tool for distributed systems, especially designed for microservices, cloud native and container-based (Docker, K8s, Mesos) architectures.

    SkyWalking is based on agent to instrument (automatically) monitored services, for now, we have many agents for different languages, Python agent [2] is one of them, which supports automatic instrumentations.

    The goal of this project is to extend the agent's features by supporting profiling [3] a function's invocation stack, help the users to analyze which method costs the most major time in a cross-services call.

    To complete this task, you must be comfortable with Python, have some knowledge of tracing system, otherwise you'll have a hard time coming up to speed..

    [1] http://skywalking.apache.org
    [2] http

    Mentors

    Overview & Objectives

    Just as we have a mobile field operations app on Apache Fineract 1.x, we have recently built out on top of the brand new Apache Fineract CN micro-services architecture, an initial version of a mobile field operations app with an MVP architecture and material design. Given the flexibility of the new architecture and its ability to support different methodologies - MFIs, credit unions, cooperatives, savings groups, agent banking, etc - this mobile app will have different flavors and workflows and functionalities.

    Description

    In 2020, our Google Summer of Code intern worked on additional functionality in the Fineract CN mobile app. In 2021, the student will work on the following tasks:

    • Integrate with Payment Hub to enable disbursement via Mobile Money API
    • Improve Task management features into the app.
    • Create UI for creating new account and displaying account details
    • Create UI for creating tellers and displaying tellers details
    • Improve GIS features like location tracking, dropping of pin into the app
    • Improve offline mode via Couchbase support
    • Write Unit Test, Integration Test and UI tests

      Helpful Skills

      Android Development, Kotlin, Java, Git, OpenJPA, Rest API

      Impact

      Allows staff to go directly into the field to connect to the client. Reduces cost of operations by enabling organizations to go paperless and be more efficient.

      Other Resources

    Repo on Github:
    https

    ://github.com/apache/

    fineract

    skywalking-

    cn-mobileFineract CN API documentation:

    python
    [3] https://

    izakey

    thenewstack.

    github.

    io/

    fineract-cn-api-docs-site/

    apache-skywalking-use-profiling-to-fix-the-blind-spot-of-distributed-tracing/

    Difficulty: Major
    Potential mentors:
    Zhenxu Ke, mail: kezhenxu94 (at) apache.org
    Project Devs, mail: dev (at) skywalking.apache.org

    Apache SkyWalking: Python agent collects and reports PVM metrics to backend

    Apache SkyWalking [1] is an application performance monitor (APM) tool for distributed systems, especially designed for microservices, cloud native and container-based (Docker, K8s, Mesos) architectures.

    Tracing distributed systems is one of the main features of SkyWalking, with those traces, it can analyze some service metrics such as CPM, success rate, error rate, apdex, etc. SkyWalking also supports receiving metrics from the agent side directly.

    In this task, we expect the Python agent to report its Python Virtual Machine (PVM) metrics, including (but not limited to, whatever metrics useful are also acceptable) CPU usage (%), memory used (MB), (active) thread/coroutine counts, garbage collection count, etc.

    To complete this task, you must be comfortable with Python and gRPC, otherwise you'll have a hard time coming up to speed.

    Live demo to play around: http://122.112.182.72:8080 (under reconstruction, maybe unavailable but latest demo address can be found at the GitHub index page http://github.com/apache/skywalking)

    [1] http://skywalking.apache.org

  • https://github.com/aasaru/fineract-cn-api-docs
    https://cwiki.apache.org/confluence/display/FINERACT/Fineract+CN
  • How to install and run Couchbase:
    https://gist.github.com/jawidMuhammadi/af6cd34058cacf20b100d335639b3ad8
  • GSMA mobile money API:
    https://developer.mobilemoneyapi.io/1.1/oas3/22466
  • Payment Hub:
    https://github.com/search?q=openMF%2Fph-ee&ref=opensearch
  • Some UI designs:
  • https://www.figma.com/file/KHXtZPdIpC3TqvdIVZu8CW/fineract-cn-mobile?node-id=0%3A1

  • 2020 GSoC progress report:
    https://gist.github.com/jawidMuhammadi/9fa91d37b1cbe43d9cdfe165ad8f2102
  • JIRA Task
    https://issues.apache.org/jira/browse/FINCN-241?filter=-2&jql=project%20%3D%20FINCN%20order%20by%20created%20DESC

    Difficulty: Major
    Potential mentors:
    Ed CableZhenxu Ke, mail: edcable kezhenxu94 (at) apache.org
    Project Devs, mail: dev (at) fineractskywalking.apache.org

    ...

    ShardingSphere

    Apache

    SkyWalking: Python agent supports profiling

    Apache SkyWalking [1] is an application performance monitor (APM) tool for distributed systems, especially designed for microservices, cloud native and container-based (Docker, K8s, Mesos) architectures.

    SkyWalking is based on agent to instrument (automatically) monitored services, for now, we have many agents for different languages, Python agent [2] is one of them, which supports automatic instrumentations.

    The goal of this project is to extend the agent's features by supporting profiling [3] a function's invocation stack, help the users to analyze which method costs the most major time in a cross-services call.

    To complete this task, you must be comfortable with Python, have some knowledge of tracing system, otherwise you'll have a hard time coming up to speed..

    [1] http://skywalking.apache.org
    [2] http://github.com/apache/skywalking-python
    [3] https://thenewstack.io/apache-skywalking-use-profiling-to-fix-the-blind-spot-of-distributed-tracing/

    Difficulty: Major
    Potential mentors:
    Zhenxu Ke, mail: kezhenxu94 (at) apache.org
    Project Devs, mail: dev (at) skywalking.apache.org

    ShardingSphere: Proofread the DDL/TCL SQL definitions for ShardingSphere Parser

    Apache ShardingSphere

    Apache ShardingSphere is a distributed database middleware ecosystem, including 2 independent products, ShardingSphere JDBC and ShardingSphere Proxy presently. They all provide functions of data sharding, distributed transaction, and database orchestration.
    Page: https://shardingsphere.apache.org
    Github: https://github.com/apache/shardingsphere

    Background

    ShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer` and `Oracle`, which means we have to understand different database dialect SQLs.
    More details: https://shardingsphere.apache.org/document/current/en/features/sharding/principle/parse/

    Task

    This issue is to proofread the following definitions,

    • All the DDL SQL definitions for Oracle except for ALTER, DROP, CREATE and TRUNCATE.
    • All the TCL (Transaction Control Language) SQL definitions for Oracle

    You can learn more here.

    As we have a basic Oracle SQL syntax definitions but do not keep in line with Oracle DOC, we need you to find out the vague SQL grammar definitions and correct them referring to Oracle DOC.

    Notice, when you review these target SQLs above, you will find that these definitions will involve some basic elements of Oracle SQL. No doubt, these elements are included in this task as well.

    Relevant Skills

    1. Master JAVA language
    2. Have a basic understanding of Antlr g4 file
    3. Be familiar with Oracle SQLs

    Targets files

    1. DDL SQLs g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/DDLStatement.g4
    2. TCL SQLs g4 file:
    https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/TCLStatement.g4
    3. Basic elements g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/BaseRule.g4

    References

    1. Oracle SQL quick reference: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlqr/SQL-Statements.html#GUID-1FA35EAD-AED2-4619-BFEE-348FF05D1F4A
    2. Detailed Oracle SQL info: https://docs.oracle.com/pls/topic/lookup?ctx=en/database/oracle/oracle-database/19/sqlqr&id=SQLRF008

    Mentor

    Juan Pan, PMC of Apache ShardingSphere, panjuan@apache.orgImage Added

    Difficulty: Major
    Potential mentors:
    Juan Pan, mail: panjuan

    Apache SkyWalking: Python agent collects and reports PVM metrics to backend

    Apache SkyWalking [1] is an application performance monitor (APM) tool for distributed systems, especially designed for microservices, cloud native and container-based (Docker, K8s, Mesos) architectures.

    Tracing distributed systems is one of the main features of SkyWalking, with those traces, it can analyze some service metrics such as CPM, success rate, error rate, apdex, etc. SkyWalking also supports receiving metrics from the agent side directly.

    In this task, we expect the Python agent to report its Python Virtual Machine (PVM) metrics, including (but not limited to, whatever metrics useful are also acceptable) CPU usage (%), memory used (MB), (active) thread/coroutine counts, garbage collection count, etc.

    To complete this task, you must be comfortable with Python and gRPC, otherwise you'll have a hard time coming up to speed.

    Live demo to play around: http://122.112.182.72:8080 (under reconstruction, maybe unavailable but latest demo address can be found at the GitHub index page http://github.com/apache/skywalking)

    [1] http://skywalking.apache.org

    Difficulty: Major
    Potential mentors:
    Zhenxu Ke, mail: kezhenxu94 (at) apache.org
    Project Devs, mail: dev (at) skywalkingshardingsphere.apache.org

    Apache ShardingSphere: Add unit test for example

    Apache ShardingSphere

    : Proofread the DDL/TCL SQL definitions for ShardingSphere Parser

    Apache

    Apache ShardingSphere

    Apache ShardingSphere is a distributed database middleware ecosystem, including 2 independent products, ShardingSphere JDBC and ShardingSphere Proxy presently. They all provide functions of data sharding, distributed transaction, and database orchestration.
    Page: https://shardingsphere.apache.org
    Github: https://github.com/apache/shardingsphere

    BackgroundShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer` and `Oracle`, which means we have to understand different database dialect SQLs.
    More details: https://shardingsphere.apache.org/document/current/en/features/sharding/principle/parse/

    The examples of ShardingSphere do not have test cases.
    After mvn install, developer can know compiling success only, but the can not guarantee code correct, especially config for YAML, spring namespace and spring boot starter.

    Task

    This issue is to proofread the following definitions,

    • All the DDL SQL definitions for Oracle except for ALTER, DROP, CREATE and TRUNCATE.
    • All the TCL (Transaction Control Language) SQL definitions for Oracle

    You can learn more here.

    As we have a basic Oracle SQL syntax definitions but do not keep in line with Oracle DOC, we need you to find out the vague SQL grammar definitions and correct them referring to Oracle DOC.

    Notice, when you review these target SQLs above, you will find that these definitions will involve some basic elements of Oracle SQL. No doubt, these elements are included in this task as well.

    add auto test cases with JUnit to assert startup success and code logic correct.

    Notice, the code of current example may need to be refactor to make it easy for test.

    Relevant

    Relevant

    Skills

    1. Master JAVA language
    2. Be familiar with spring framework
    3. Have a basic understanding of Antlr g4 file
    3. Be familiar with Oracle SQLsJUnit


    Targets files

    1. DDL SQLs g4 fileExample repo: https://github.com/apache/shardingsphere/blobtree/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/DDLStatement.g4
    2. TCL SQLs g4 file:
    https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/TCLStatement.g4
    3. Basic elements g4 fileexamples

    Mentor
    Liang Zhang, PMC Chair of Apache ShardingSphere, zhangliang@apache.org

    Difficulty: Major
    Potential mentors:
    Liang Zhang, mail: zhangliang (at) apache.org
    Project Devs, mail: dev (at) shardingsphere.apache.org

    Apache ShardingSphere: Proofread the DML SQL definitions for ShardingSphere Parser

    Apache ShardingSphere

    Apache ShardingSphere is a distributed database middleware ecosystem, including 2 independent products, ShardingSphere JDBC and ShardingSphere Proxy presently. They all provide functions of data sharding, distributed transaction, and database orchestration.
    Page: https://shardingsphere.apache.org
    Github: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/BaseRule.g4

    References

    Background

    ShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer` and `Oracle`, which means we have to understand different database dialect SQLs.
    More details1. Oracle SQL quick reference: https://docsshardingsphere.oracle.comapache.org/document/current/en/database/oracle/oracle-database/19/sqlqr/SQL-Statements.html#GUID-1FA35EAD-AED2-4619-BFEE-348FF05D1F4A
    2. Detailed Oracle SQL info: https://docs.oracle.com/pls/topic/lookup?ctx=en/database/oracle/oracle-database/19/sqlqr&id=SQLRF008

    Mentor

    Juan Pan, PMC of Apache ShardingSphere, panjuan@apache.orgImage Removed

    Difficulty: Major
    Potential mentors:
    Juan Pan, mail: panjuan (at) apache.org
    Project Devs, mail: dev (at) shardingsphere.apache.org

    Apache ShardingSphere: Add unit test for example

    features/sharding/principle/parse/

    Task

    This issue is to proofread the DML(SELECT/UPDATE/DELETE/INSERT) SQL definitions for Oracle. As we have a basic Oracle SQL syntax definitions but do not keep in line with Oracle DOC, we need you to find out the vague SQL grammar definitions and correct them referring to Oracle DOC.

    Notice, when you review these DML(SELECT/UPDATE/DELETE/INSERT) SQLs, you will find that these definitions will involve some basic elements of Oracle SQL. No doubt, these elements are included in this task as well.

    Relevant Skills

    1. Master JAVA language
    2. Have a basic understanding of Antlr g4 file
    3. Be familiar with Oracle SQLs

    Targets files

    1. DML SQLs g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/DMLStatement.g4
    2. Basic elements g4 file

    Apache ShardingSphere

    Apache ShardingSphere is a distributed database middleware ecosystem, including 2 independent products, ShardingSphere JDBC and ShardingSphere Proxy presently. They all provide functions of data sharding, distributed transaction, and database orchestration.
    Page: https://shardingsphere.apache.org
    Github: https://github.com/apache/shardingsphere/blob/master/shardingsphere

    Background

    The examples of ShardingSphere do not have test cases.
    After mvn install, developer can know compiling success only, but the can not guarantee code correct, especially config for YAML, spring namespace and spring boot starter.

    Task

    This issue is to add auto test cases with JUnit to assert startup success and code logic correct.

    Notice, the code of current example may need to be refactor to make it easy for test.

    Relevant Skills

    1. Master JAVA language
    2. Be familiar with spring framework
    3. Have a basic understanding of JUnit

    Targets files

    Example repo: https://github.com/apache/shardingsphere/tree/master/examples

    -sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/BaseRule.g4

    References

    1. Oracle SQL quick reference: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlqr/SQL-Statements.html#GUID-1FA35EAD-AED2-4619-BFEE-348FF05D1F4A
    2. Detailed Oracle SQL info: https://docs.oracle.com/pls/topic/lookup?ctx=en/database/oracle/oracle-database/19/sqlqr&id=SQLRF008

    Mentor

    Juan Pan, PMC of Apache ShardingSphere, panjuan@apache.orgImage AddedMentor
    Liang Zhang, PMC Chair of Apache ShardingSphere, zhangliang@apache.org

    Difficulty: Major
    Potential mentors:
    Liang ZhangJuan Pan, mail: zhangliang panjuan (at) apache.org
    Project Devs, mail: dev (at) shardingsphere.apache.org

    IoTDB

    Implement PISA index in Apache IoTDB

    Apache IoTDB is a highly efficient time series database, which supports high speed query process, including aggregation query.

    Currently, IoTDB pre-calculates the aggregation info, or called the summary info, (sum, count, max_time, min_time, max_value, min_value) for each page and each Chunk. The info is helpful for aggregation operations and some query filters. For example, if the query filter is value >10 and the max value of a page is 9, we can skip the page. For another example, if the query is select max(value) and the max value of 3 chunks are 5, 10, 20, then the max(value) is 20. 

    However, there are two drawbacks:

    1. The summary info actually reduces the data that needs to be scanned as 1/k (suppose each page has k data points). However, the time complexity is still O(N). If we store a long historical data, e.g., storing 2 years data with 500KHz, then the aggregation operation may be still time-consuming. So, a tree-based index to reduce the time complexity from O(N) to O(logN) is a good choice. Some basic ideas have been published in [1], while it can just handle data with fix frequency. So, improving it and implementing it into IoTDB is a good choice.

    2. The summary info is helpless for evaluating the query like where value >8 if the max value = 10. If we can enrich the summary info, e.g., storing the data histogram, we can use the histogram to evaluate how many points we can return. 

    This proposal is mainly for adding an index for speeding up the aggregation query. Besides, if we can let the summary info be more useful, it could be better.

    Notice that the premise is that the insertion speed should not be slow down too much!

    By the way, IoTDB provides an index framework already. So, the PISA index should be compatible with the index framework.

    You should know:
    • IoTDB query process
    • TsFile structure and organization
    • Basic index knowledge
    • Java 

    difficulty: Major
    mentors:
    hxd@apache.org

    Reference:

    [1] https://www.sciencedirect.com/science/article/pii/S0306437918305489
     
     
     

    Apache ShardingSphere: Proofread the DML SQL definitions for ShardingSphere Parser

    Apache ShardingSphere

    Apache ShardingSphere is a distributed database middleware ecosystem, including 2 independent products, ShardingSphere JDBC and ShardingSphere Proxy presently. They all provide functions of data sharding, distributed transaction, and database orchestration.
    Page: https://shardingsphere.apache.org
    Github: https://github.com/apache/shardingsphere

    Background

    ShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer` and `Oracle`, which means we have to understand different database dialect SQLs.
    More details: https://shardingsphere.apache.org/document/current/en/features/sharding/principle/parse/

    Task

    This issue is to proofread the DML(SELECT/UPDATE/DELETE/INSERT) SQL definitions for Oracle. As we have a basic Oracle SQL syntax definitions but do not keep in line with Oracle DOC, we need you to find out the vague SQL grammar definitions and correct them referring to Oracle DOC.

    Notice, when you review these DML(SELECT/UPDATE/DELETE/INSERT) SQLs, you will find that these definitions will involve some basic elements of Oracle SQL. No doubt, these elements are included in this task as well.

    Relevant Skills

    1. Master JAVA language
    2. Have a basic understanding of Antlr g4 file
    3. Be familiar with Oracle SQLs

    Targets files

    1. DML SQLs g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/DMLStatement.g4
    2. Basic elements g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/BaseRule.g4

    References

    1. Oracle SQL quick reference: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlqr/SQL-Statements.html#GUID-1FA35EAD-AED2-4619-BFEE-348FF05D1F4A
    2. Detailed Oracle SQL info: https://docs.oracle.com/pls/topic/lookup?ctx=en/database/oracle/oracle-database/19/sqlqr&id=SQLRF008

    Mentor

    Juan Pan, PMC of Apache ShardingSphere, panjuan@apache.orgImage Removed

    Difficulty: Major
    Potential mentors:
    Juan PanXiangdong Huang, mail: panjuan hxd (at) apache.org
    Project Devs, mail: dev (at) shardingsphereiotdb.apache.org

    ...

    Implement PISA index in

    Apache IoTDB Integration Test

    Apache IoTDB is a highly efficient time series database, which supports high speed query process, including aggregation query.Currently, IoTDB pre-calculates the aggregation info, or called the summary info, (sum, count, max_time, min_time, max_value, min_value) for each page and each Chunk. The info is helpful for aggregation operations and some query filters. For example, if the query filter is value >10 and the max value of a page is 9, we can skip the page. For another example, if the query is select max(value) and the max value of 3 chunks are 5, 10, 20, then the max(value) is 20. an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.

    Now, IoTDB uses JUnit for its UT/IT test.

    However, there are two drawbacks:

    1. The summary info actually reduces the data that needs to be scanned as 1/k (suppose each page has k data points). However, the time complexity is still O(N). If we store a long historical data, e.g., storing 2 years data with 500KHz, then the aggregation operation may be still time-consuming. So, a tree-based index to reduce the time complexity from O(N) to O(logN) is a good choice. Some basic ideas have been published in [1], while it can just handle data with fix frequency. So, improving it and implementing it into IoTDB is a good choice.

    2. The summary info is helpless for evaluating the query like where value >8 if the max value = 10. If we can enrich the summary info, e.g., storing the data histogram, we can use the histogram to evaluate how many points we can return. 

    This proposal is mainly for adding an index for speeding up the aggregation query. Besides, if we can let the summary info be more useful, it could be better.

    Notice that the premise is that the insertion speed should not be slow down too much!

    By the way, IoTDB provides an index framework already. So, the PISA index should be compatible with the index framework.

    You should know:
    • IoTDB query process
    • TsFile structure and organization
    • Basic index knowledge
    • Java 

    difficulty: Major
    mentors:
    hxd@apache.org

    Reference:

    There are many singleton class instances in IoTDB. Therefore, modifying something in a test may impact others, and it requires us do many cleanup work after a test.

    Especially, after we open an serversocket (by Thrift), though we have called the socket.close, the socket may be not closed quickly (controlled by Thrift). But, if the next test begins, then a "the port is already used" error will occur.

    2. when testing IoTDB's cluster module, we may need to start at least 3 IoTDB instances in one server.
    Using JUnit, the 3 instances will be in one JVM, which is conflicted with the reality "IoTDB has many singleton instances".

    So, next, we want to use TestContainer, which is a combiner of Docker and JUnit.

    This task is for:

    1. using TestContainer to re-implement all IT codes of IoTDB;
    2. using TestContainer to add some IT codes for IoTDB's cluster module.

    Needed skills:

    • Java
    • Docker (Docker-Compose better)
    • Know or learn Junit and TestContainer

    [1] iotdb.apache.org
    [2[1] https://www.sciencedirect.com/science/article/pii/S0306437918305489
     
     
     testcontainers.org/

    Difficulty: Major
    Potential mentors:
    Xiangdong Huang, mail: hxd (at) apache.org
    Project Devs, mail: dev (at) iotdb.apache.org

    Apache IoTDB

    Integration Test

    C# library

    Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.

    Now, IoTDB uses JUnit for its UT/IT test.

    However, there are two drawbacks:

    1. There are many singleton class instances in IoTDB. Therefore, modifying something in a test may impact others, and it requires us do many cleanup work after a test.

    Especially, after we open an serversocket (by Thrift), though we have called the socket.close, the socket may be not closed quickly (controlled by Thrift). But, if the next test begins, then a "the port is already used" error will occur.

    2. when testing IoTDB's cluster module, we may need to start at least 3 IoTDB instances in one server.
    Using JUnit, the 3 instances will be in one JVM, which is conflicted with the reality "IoTDB has many singleton instances".

    So, next, we want to use TestContainer, which is a combiner of Docker and JUnit.

    This task is for:

    1. using TestContainer to re-implement all IT codes of IoTDB;
    2. using TestContainer to add some IT codes for IoTDB's cluster module.

    Needed skills:

    IoTDB has two kinds of client interfaces: SQL and native API (also called as session API.)

    This task is for the native API.

    IoTDB uses Apache Thrift[2] as its RPC framework, so all native API can be generated by Thrift. However, to accelerate the performance, we may use some byte array in Thrift, rather than a Struct, which is not quite friendly to users.

    That is why we provide our session API. Session API just wraps the interfaces of the generated thrift codes. Now we have Java[4], Python and c++ version[3]. The C# version is left.

    This task hopes you can provide a c# library for IoTDB.

    Needed skills:

    • Thrift
    • C#
    • know Java
    • Java
    • Docker (Docker-Compose better)
    • Know or learn Junit and TestContainer

    [1] iotdb.apache.org
    [2] http://thrift.apache.org/
    [3] https://iotdb.apache.org/UserGuide/Master//www.testcontainers.org/Client/Programming%20-%20Other%20Languages.html
    [4] https://iotdb.apache.org/UserGuide/Master/Client/Programming%20-%20Native%20API.html

    Difficulty: Major
    Potential mentors:
    Xiangdong Huang, mail: hxd (at) apache.org
    Project Devs, mail: dev (at) iotdb.apache.org

    Apache IoTDB

    C# library

    : Metadata (Schema) Storage Engine

    Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.

    IoTDB has two kinds of client interfaces: SQL and native API (also called as session API.)

    This task is for the native API.

    IoTDB uses Apache Thrift[2] as its RPC framework, so all native API can be generated by Thrift. However, to accelerate the performance, we may use some byte array in Thrift, rather than a Struct, which is not quite friendly to users.

    That is why we provide our session API. Session API just wraps the interfaces of the generated thrift codes. Now we have Java[4], Python and c++ version[3]. The C# version is left.

    This task hopes you can provide a c# library for IoTDB.

    Needed skills:

    • Thrift
    • C#
    • know Java

    Different with traditional relational databases, IoTDB uses tree-based structure in memory to manage the schema (a.k.a, metadata), and use a Write-Ahead-Log-like file structure to persist the schema.

    Now, each time series will take about 300 Bytes in memory. However, an IoTDB instance may manage more than 100 million time series, which may take more than 30GB memory.

    Therefore, we'd like to re-design the schema management module.
    1. File: Persist the tree on disk like a b-tree.
    2. WAL: implement the WAL of the metadata. So we can update the tree on disk in batch, rather than one operation by one.
    3. Cache: we may have no memory to load the whole tree. So a cache is needed and query from the tree on disk is needed.

    What knowledge you need to know:
    1. Java
    2. Basic design idea about Database [2]

    [1[1] iotdb.apache.org
    [2] http://thrift.apache.org/
    [3]https://iotdb.apache.org/UserGuide/Master/Client/Programming%20-%20Other%20Languages.html
    [42]httpshttp://iotdbpages.cs.apache.org/UserGuide/Master/Client/Programming%20-%20Native%20API.htmlwisc.edu/~dbbook/openAccess/firstEdition/slides/pdfslides/mod2l1.pdf

    Difficulty: Major
    Potential mentors:
    Xiangdong Huang, mail: hxd (at) apache.org
    Project Devs, mail: dev (at) iotdb.apache.org

    Apache IoTDB:

    Metadata (Schema) Storage Engine

    GUI workbench

    Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.

    Different with traditional relational databases, IoTDB uses tree-based structure in memory to manage the schema (a.k.a, metadata), and use a Write-Ahead-Log-like file structure to persist the schema.

    Now, each time series will take about 300 Bytes in memory. However, an IoTDB instance may manage more than 100 million time series, which may take more than 30GB memory.

    Therefore, we'd like to re-design the schema management module.
    1. File: Persist the tree on disk like a b-tree.
    2. WAL: implement the WAL of the metadata. So we can update the tree on disk in batch, rather than one operation by one.
    3. Cache: we may have no memory to load the whole tree. So a cache is needed and query from the tree on disk is needed.

    What knowledge you need to know:
    1. Java
    2. Basic design idea about Database [2]

    As a database, it is good to have a workbench to operate IoTDB using a GUI.

    For example, there is a 3rd-part web-based workbench for Apache Cassandra [2]. MySQL supports a more complex workbench application [3].

    We also want to IoTDB has a workbench.

    Task:
    1. execute SQL and show results in Table or Chart.
    2. view the schema of IoTDB (how many Storage groups, how many time series etc..)
    3. View and modify IoTDB's configuration
    4. View IoTDB's dynamic status (e.g., info that JMX can get)

    (As we have integrated IOTDB with Apache Zeppelin, task 1 has done. So, we hope this workbench can be more lightweight than using Zeppelin.)

    Better to use Java. (Python or some others are also ok).

    Needed Skills:

    • Java
    • Web application development

    [1] [1] https:// iotdb.apache.org
    [2]httphttps://pages.cs.wisc.edu/~dbbook/openAccess/firstEdition/slides/pdfslides/mod2l1.pdfgithub.com/avalanche123/cassandra-web
    [3] https://www.mysql.com/cn/products/workbench/

    Difficulty: Major
    Potential mentors:
    Xiangdong Huang, mail: hxd (at) apache.org
    Project Devs, mail: dev (at) iotdb.apache.org

    Apache IoTDB:

    GUI workbench

    Complex Arithmetic Operations in SELECT Clauses

    Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.

    As a database, it is good to have a workbench to operate IoTDB using a GUI.

    For example, there is a 3rd-part web-based workbench for Apache Cassandra [2]. MySQL supports a more complex workbench application [3].

    We also want to IoTDB has a workbench.

    Task:
    1. execute SQL and show results in Table or Chart.
    2. view the schema of IoTDB (how many Storage groups, how many time series etc..)
    3. View and modify IoTDB's configuration
    4. View IoTDB's dynamic status (e.g., info that JMX can get)

    (As we have integrated IOTDB with Apache Zeppelin, task 1 has done. So, we hope this workbench can be more lightweight than using Zeppelin.)

    Better to use Java. (Python or some others are also ok).

    Needed Skills:

    • Java
    • Web application development

    [1] iotdb.apache.org
    [2] https://github.com/avalanche123/cassandra-web
    [3] https://www.mysql.com/cn/products/workbench/

    Difficulty: Major
    Potential mentors:
    Xiangdong Huang, mail: hxd (at) apache.org
    Project Devs, mail: dev (at) iotdb.apache.org

    Apache IoTDB: Complex Arithmetic Operations in SELECT Clauses

    Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.

    We have recently been working to improve the ease of use of IoTDB. For queries, we hope that IoTDB can provide more powerful analysis capabilities.

    IOTDB supports many types of queries: raw data queries, function queries (including UDF queries), and so on. However, currently there is no easy way to combine the results of multiple queries. Therefore, we hope that IoTDB can support complex arithmetic operations in the SELECT clause, which will greatly improve the analysis capabilities.

    Function description:
    Applied to: raw time series, literal numbers and function outputs.
    Applicable data types: all types except TIMESTAMP and TEXT.
    Applicable operators: at least five binary operators ( + , - , * , / , % ) and two unary operator (+ , -).

    Usage examples:

    1. raw queries
      SELECT -a FROM root.sg.d;
      SELECT a, b, c, b * b - 4 * a * c FROM root.sg.d WHERE b > 0;
      SELECT a, b, -(bool_value * (a - b)) FROM root.sg.d;
      SELECT -3.14 + a / 15 + 926 FROM root.sg.d;
      SELECT +a % 3.14 FROM root.sg.d WHERE a < 0;
    1. function queries
      SELECT a + abs(a), sin(a) * cos(a) FROM root.sg.d;
      SELECT a, b, sqrt(a) * sqrt(b) / (a * b) FROM FROM root.sg.d WHERE a < 0;
    1. nested queries
      select a, b, a + b + udf(sin(a) * sin(b), cos(a) * cos(b)) FROM root.sg.d;
      select a, a + a, sin(sin(sin(a + a))) FROM root.sg.d WHERE a < 0;

    Additional requirements:
    1. For performance reasons, it's better to perform as few disk read operations as possible.
    Example:
    SELECT a, sin(a + a) FROM root.sg.d WHERE a < 0;
    The series root.sg.d.a should be read only once during the query.

    2. For performance reasons, it's better to reuse intermediate calculation results as much as possible.
    Example:
    SELECT a + a, sin(a + a) FROM root.sg.d WHERE a < 0;
    The intermediate calculation result a + a should only be evaluated once during the query.

    3. Need to consider memory-constrained scenarios.

    What knowledge you need to know:
    1. Java
    2. Basic database knowledge (such as SQL, etc.)
    3. ANTLR
    4. IoTDB query process

    Links:
    [1] iotdb.apache.org

    Difficulty: Major
    Potential mentors:
    Xiangdong Huang, mail: hxd (at) apache.org
    Project Devs, mail: dev (at) iotdb.apache.org

    We have recently been working to improve the ease of use of IoTDB. For queries, we hope that IoTDB can provide more powerful analysis capabilities.

    IOTDB supports many types of queries: raw data queries, function queries (including UDF queries), and so on. However, currently there is no easy way to combine the results of multiple queries. Therefore, we hope that IoTDB can support complex arithmetic operations in the SELECT clause, which will greatly improve the analysis capabilities.

    Function description:
    Applied to: raw time series, literal numbers and function outputs.
    Applicable data types: all types except TIMESTAMP and TEXT.
    Applicable operators: at least five binary operators ( + , - , * , / , % ) and two unary operator (+ , -).

    Usage examples:

    1. raw queries
      SELECT -a FROM root.sg.d;
      SELECT a, b, c, b * b - 4 * a * c FROM root.sg.d WHERE b > 0;
      SELECT a, b, -(bool_value * (a - b)) FROM root.sg.d;
      SELECT -3.14 + a / 15 + 926 FROM root.sg.d;
      SELECT +a % 3.14 FROM root.sg.d WHERE a < 0;
    1. function queries
      SELECT a + abs(a), sin(a) * cos(a) FROM root.sg.d;
      SELECT a, b, sqrt(a) * sqrt(b) / (a * b) FROM FROM root.sg.d WHERE a < 0;
    1. nested queries
      select a, b, a + b + udf(sin(a) * sin(b), cos(a) * cos(b)) FROM root.sg.d;
      select a, a + a, sin(sin(sin(a + a))) FROM root.sg.d WHERE a < 0;

    Additional requirements:
    1. For performance reasons, it's better to perform as few disk read operations as possible.
    Example:
    SELECT a, sin(a + a) FROM root.sg.d WHERE a < 0;
    The series root.sg.d.a should be read only once during the query.

    2. For performance reasons, it's better to reuse intermediate calculation results as much as possible.
    Example:
    SELECT a + a, sin(a + a) FROM root.sg.d WHERE a < 0;
    The intermediate calculation result a + a should only be evaluated once during the query.

    3. Need to consider memory-constrained scenarios.

    What knowledge you need to know:
    1. Java
    2. Basic database knowledge (such as SQL, etc.)
    3. ANTLR
    4. IoTDB query process

    Links:
    [1] iotdb.apache.org

    Difficulty: Major
    Potential mentors:
    Xiangdong Huang, mail: hxd (at) apache.org
    Project Devs, mail: dev (at) iotdb.apache.org

    Apache IoTDB: integration with Chaos Mesh

    Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.


    Chaos Mesh [2] is a versatile chaos engineering solution that features all-around fault injection methods for complex systems on Kubernetes [3], covering faults in Pod, network, file system, and even the kernel.


    We hope that Chaos Mesh can be used as a versatile chaos test tool for the IoTDB cluster module, so as to verify the reliability of the IoTDB cluster module in production environment.


    You should define a series of failure simulations for the cluster using Chaos Mesh, such as Network partition, Network packet loss and Node collapse, and then define a series of operations and the expected results of those operations.


    This task hopes that you can set up an automated framework for IoTDB cluster module chaos testing, so that we can detect potential problems of cluster module and and iteratively fix them.


    Needed skills:

    • Java
    • Go
    • Kubernetes
    • Chao mesh
    • Know iotdb-benchmark [4]


    [1] https://iotdb.apache.org

    [2] https://chaos-mesh.org

    [3] https://kubernetes.io

    [4] https://github.com/thulab/iotdb-benchmark

    Difficulty: Major
    Potential mentors:
    Xiangdong Huang, mail: hxd (at) apache.org
    Project Devs, mail: dev (at) iotdb.apache.org

    ...

    GSOC: Varnish Cache support in Apache Traffic Control

    Background
    Apache Traffic Control is a Content Delivery Network (CDN) control plane for large scale content distribution.

    Traffic Control currently requires Apache Traffic Server as the underlying cache. Help us expand the scope by integrating with the very popular Varnish Cache.

    There are multiple aspects to this project:

    • Configuration Generation: Write software to build Varnish configuration files (VCL). This code will be implemented in our Traffic Ops and cache client side utilities, both written in Go.
    • Health Monitoring: Implement monitoring of the Varnish cache health and performance. This code will run both in the Traffic Monitor component and within Varnish. Traffic Monitor is written in Go and Varnish is written in C.
    • Testing: Adding automated tests for new code

    Skills:

    • Proficiency in Go is required
    • A basic knowledge of HTTP and caching is preferred, but not required for this project.
    Difficulty: Major
    Potential mentors:
    Eric Friedrich, mail: friede (at) apache.org
    Project Devs, mail: dev (at) trafficcontrol.apache.org

    ...

    DolphinScheduler

    Apache DolphinScheduler-Parameter coverage

    Apache DolphinScheduler

    Apache DolphinScheduler is a distributed and extensible workflow scheduler platform with powerful DAG visual interfaces, dedicated to solving complex job dependencies in the data pipeline and providing various types of jobs available out of the box.

    Page:https://dolphinscheduler.apache.org
    GitHub: https://github.com/apache/incubator-dolphinscheduler

    Background:
    Configuration parameter override

    At present, our parameter configuration is mainly based on configuration files: you can refer to PropertiesUtils,

    But usually important parameters will be injected through the form of Java runtime virtual machine parameters, so we need to support this way of parameter injection. At the same time, because different ways of parameter injection have different priorities, we need to achieve configuration coverage. There are two main situations at present, SystemProperties and LocalFile. The priority of SystemProperties should be the highest, followed by LocalFile (that is, our various configuration files, such as master.properties).

    issue:
    https://github.com/apache/incubator-dolphinscheduler/issues/5164

    for example:
    1: Configure master.max.cpuload.avg=-1 in master.prperties

    2: Java runtime virtual machine parameters -Dmaster.max.cpuload.avg=1

    3:PropertiesUtils.get("master.max.cpuload.avg") = 1

    Task: realize configuration parameter coverage

    Mentor: CalvinKirs kirs@apache.org

    Difficulty: Major
    Potential mentors:
    Calvin Kirs, mail: kirs (at) apache.org
    Project Devs, mail:

    CloudStack

    dev (at) dolphinscheduler.apache.org

    CouchDB

    GSoC: Apache CouchDB and Debezium integration

    Apache CouchDB software is a document-oriented database that can be queried and indexed in a MapReduce fashion using JavaScript. CouchDB also offers incremental replication with bi-directional conflict detection and resolution.

    Debezium is an open source distributed platform for change data capture. Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases. Debezium is durable and fast, so your apps can respond quickly and never miss an event, even when things go wrong.


    CouchDB has a change capture feed as a public HTTP API endpoint. Integrating with Debezium would provide an easy way to translate the _changes feed into a Kafka topic which plugs us into a much larger ecosystem of tools and alleviates the need for every consumer of data in CouchDB to build a bespoke “follower” of the _changes feed.


    The project for GSoC 2021 here is to design, implement and test a CouchDB connector for Debezium.


    Required skills:

    • Java

    Nice-to-have skills:

    • Erlang

    CloudStack GSoC 2021 - Clone a Virtual Machine (with all the data disks)

    Hi there,

    Here is the background of the proposed improvement in the CloudStack.

    Currently, there is no straight way to clone / create a copy of the VM (with all the data disks) in CloudStack. Operator/Admin requires a series of steps/API cmds to be followed to achieve that in CloudStack, and also it takes considerable time (to wait and check each cmd response before proceeding to next step). Some hypervisors (Eg. VMware) already supports clone VM operation, and CloudStack can leverage that.

    The support for this new functionality, can be integrated by introducing a new (admin-only) API to clone the VM, something like cloneVirtualMachine , which facilitates direct way to clone / create a copy of the VM (with all the data disks) can be . CloudStack internally performs all the required operations to create the copy of the VM (leverages the relevant hypervisor(s) operations if necessary), and returns the VM as response when success, otherwise throws the relevant error message.

    This improvement will be a good addition to the VM operations supported in the CloudStack. It requires some virtualization/cloud domain knowledge & usage.

    More details here: https://github.com/apache/cloudstack/issues/4818

    Skills Required:

  • Java and Python
  • Vue.js (for UI integration)
     
    Difficulty: Major
    Potential mentors:
    Suresh Kumar AnapartiBalázs Donát Bessenyei, mail: sureshkumar.anaparti bessbd (at) apache.org
    Project Devs, mail: dev (at) cloudstackcouchdb.apache.org

    CloudStack

    CloudStack GSoC 2021

    Ideas

    - Clone a Virtual Machine (with all the data disks)

    Hi there,

    Here is the background of the proposed improvement in the CloudStack.

    Currently, there is no straight way to clone / create a copy of the VM (with all the data disks) in CloudStack. Operator/Admin requires a series of steps/API cmds to be followed to achieve that in CloudStack, and also it takes considerable time (to wait and check each cmd response before proceeding to next step). Some hypervisors (Eg. VMware) already supports clone VM operation, and CloudStack can leverage that.

    The support for this new functionality, can be integrated by introducing a new (admin-only) API to clone the VM, something like cloneVirtualMachine , which facilitates direct way to clone / create a copy of the VM (with all the data disks) can be . CloudStack internally performs all the required operations to create the copy of the VM (leverages the relevant hypervisor(s) operations if necessary), and returns the VM as response when success, otherwise throws the relevant error message.

    This improvement will be a good addition to the VM operations supported in the CloudStack. It requires some virtualization/cloud domain knowledge & usage.

    More details here:

    Hello Students! We are the Apache CloudStack project. From our project website: "Apache CloudStack is open source software designed to deploy and manage large networks of virtual machines, as a highly available, highly scalable Infrastructure as a Service (IaaS) cloud computing platform. CloudStack is used by a number of service providers to offer public cloud services, and by many companies to provide an on-premises (private) cloud offering, or as part of a hybrid cloud solution."

    2-min video on the Apache CloudStack project - https://www.youtube.com/watch?v=oJ4b8HFmFTc 

    Here's about an hour-long intro to what is CloudStack - https://www.youtube.com/watch?v=4qFFwyK9hos 

    The general skills student would need are - Java, Python, JavaScript/Vue. Idea-specific requirements are mentioned on the idea issue.  We're a diverse and welcoming community and we encourage interested students to join the dev ML: http://cloudstack.apache.org/mailing-lists.html  (dev@cloudstack.apache.orgImage Removed)

    All our Apache CloudStack GSoC2021 ideas are tracked on the project's Github issue: https://github.com/apache/cloudstack/issues?q=is%3Aissue+is%3Aopen+label%3Agsoc2021

    Feature

    /4818

    Skills Required:

    • Java and Python
    • Vue.js (for UI integration)
       
    Difficulty: Major
    Potential mentors:
    Suresh Kumar Anaparti, mail: sureshkumar.anaparti (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    CloudStack GSoC 2021 Ideas

    Hello Students! We are the Apache CloudStack project. From our project website: "Apache CloudStack is open source software designed to deploy and manage large networks of virtual machines, as a highly available, highly scalable Infrastructure as a Service (IaaS) cloud computing platform. CloudStack is used by a number of service providers to offer public cloud services, and by many companies to provide an on-premises (private) cloud offering, or as part of a hybrid cloud solution."

    2-min video on the Apache CloudStack project - https://www.youtube.com/watch?v=oJ4b8HFmFTc 

    Here's about an hour-long intro to what is CloudStack - https://www.youtube.com/watch?v=4qFFwyK9hos 

    The general skills student would need are - Java, Python, JavaScript/Vue. Idea-specific requirements are mentioned on the idea issue.  We're a diverse and welcoming community and we encourage interested students to join the dev ML: http://cloudstack.apache.org/mailing-lists.html  (dev@cloudstack.apache.orgImage Added)

    All our Apache CloudStack GSoC2021 ideas are tracked on the project's Github issue: https://github.com/apache/cloudstack/issues?q=is%3Aissue+is%3Aopen+label%3Agsoc2021



    FeatureSkills RequiredDifficulty LevelPotential Mentor(s)Details and Discussion
    Support Multiple SSH Keys for VMsJava, Javascript/VueMediumDavid Jumani
    david.jumani@shapeblue.comImage Added
    Difficulty LevelPotential Mentor(s)Details and Discussion
    Support Multiple SSH Keys for VMsJava, Javascript/VueMediumDavid Jumani
    david.jumani@shapeblue.comImage Removed
    https://github.com/apache/cloudstack/issues/4813
    Clone a Virtual Machine Java, Javascript/VueMediumSuresh Anaparti
    sureshanaparti@apache.orgImage Removed
    https://github.com/apache/cloudstack/issues/4818
    UI Shortcuts (UX improvements in the UI)Javascript, VueEasyBoris Stoyanov
    boris.stoyanov@shapeblue.comImage Removed 
    David Jumani
    david.jumani@shapeblue.comImage Removed 
    https://github.com/apache/cloudstack/issues/4798
    CloudStack OAuth2 Plugin Java, Javascript/Vue  MediumNicolas Vazquez
    nicovazquez90@gmail.comImage Removed 
    Rohit Yadav
    rohit@apache.orgImage Removed 
    https://github.com/apache/cloudstack/issues/4834
    Synchronization of network devices on newly added hosts for Persistent NetworksJavaMediumPearl Dsilva
    pearl.dsilva@shapeblue.comImage Removed
    https://github.com/apache/cloudstack/issues/4814
    Add SPICE console for vms on KVM/XenServerJava, Python, JavascriptHardWei Zhou
    ustcweizhou@gmail.comImage Removed 
    https://github.com/apache/cloudstack/issues/48034813
    Clone a Virtual Machine Java, Javascript/VueMediumSuresh Anaparti
    sureshanaparti@apache
    Configuration parameters and APIs mappingsJava, PythonHardHarikrishna Patnala
    harikrishna@apache.orgImage Modified 
    https://github.com/apache/cloudstack/issues/4825Add virt-v2v support in CloudStack for VM import to KVM4818
    UI Shortcuts (UX improvements in the UI)Javascript, VueEasyBoris Stoyanov
    boris.stoyanov@shapeblue.comImage Added 
    David Jumani
    david.jumani@shapeblue.comImage Added 
    Java, Python, libvirt, libguestfsHardRohit Yadav
    rohit@apache.orgImage Removed
    https://github.com/apache/cloudstack/issues/4696
    We have an onboarding course for students to learn and get started with CloudStack:
    4798
    CloudStack OAuth2 Plugin Java, Javascript/Vue  MediumNicolas Vazquez
    nicovazquez90@gmail.comImage Added 
    Rohit Yadav
    rohit@apache.orgImage Added 
    https://github.com/
    shapeblue/hackerbook
    apache/cloudstack/issues/4834
    Synchronization of network devices on newly added hosts for Persistent NetworksJavaMediumPearl Dsilva
    pearl.dsilva@shapeblue.comImage Added
    Project wiki and other resources:cwiki.orgconfluence/display/CLOUDSTACK
    issues/4814
    Add SPICE console for vms on KVM/XenServerJava, Python, JavascriptHardWei Zhou
    ustcweizhou@gmail.comImage Added 
    https://github.com/apache/cloudstack
    http:
    /issues/4803
    Configuration parameters and APIs mappingsJava, PythonHardHarikrishna Patnala
    harikrishna@apache.orgImage Added 
    https://github.com/apache/cloudstack/issues/4825
    Add virt-v2v support in CloudStack for VM import to KVMJava, Python, libvirt, libguestfsHardRohit Yadav
    rohit@apache.orgImage Added
    https://github.com/apache/cloudstack/issues/4696


    We have an onboarding course for students to learn and get started with CloudStack:
    https://github.com/shapeblue/hackerbook

    Project wiki and other resources:
    https://cwiki.apache.org/confluence/display/CLOUDSTACK

    https://github.com/apache/cloudstack

    http://docs.cloudstack.apache.org/

    Difficulty: Major
    Potential mentors:
    Rohit Yadav, mail: bhaisaab (at) apache.org
    Project Devs, mail: dev (at) cloudstack.apache.org

    ...

    Prevent and fail-fast any attempts to incremental repair cdc/mv tables

    Running incremental repairs on CDC or MV tables breaks them.

    Attempting to run incremental repair on such should fail-fast and be prevented, with a clear error message.

    Difficulty: Normal
    Potential mentors:
    paulo, mail: paulo (at) apache.org
    Project Devs, mail: dev (at) cassandra.apache.org

    Per-node overrides for table settings

    Add ability to ttl snapshots

    It should be possible to add a TTL to snapshots, after which it automatically cleans itself up.

    This will be useful together with the auto_snapshot option, where you want to keep an emergency snapshot in case of accidental drop or truncation but automatically remove it after a specified period when it's no longer useful. So in addition to allowing a user to specify a snapshot ttl on nodetool snapshot we should have a auto_snapshot_ttl option that allows a user to set a ttl for automatic snaphots on drop/truncate.

    Difficulty: Normal
    Potential mentors:
    paulo, mail: paulo (at) apache.org
    Project Devs, mail: dev (at) cassandra.apache.org

    Add nodetool command to display or export the contents of a virtual table

    Several virtual tables were recently added, but they're currently only accessible via cqlsh or programmatically. While this is valuable for many use cases, operators are accustomed with the convenience of querying system metrics with a simple nodetool command.

    In addition to that, a relatively common request is to provide nodetool output in different formats (JSON, YAML and even XML) (CASSANDRA-5977, CASSANDRA-12035, CASSANDRA-12486, CASSANDRA-12698, CASSANDRA-12503). However this requires lots of manual labor as each nodetool subcommand needs to be adapted to support new output formats.

    I propose adding a new nodetool command that will consistently print to the standard output the contents of a virtual table. By default the command will print the output in a human-readable tabular format similar to cqlsh, but a "--format" parameter can be specified to modify the output to some other format like JSON or YAML.

    It should be possible to add a limit to the amount of rows displayed and filter to display only rows from a specific keyspace or table. The command should be flexible and provide simple hooks for registration and customization of new virtual tables.

    I propose calling this command nodetool show <virtualtable> (naming bikeshedding welcome), for example:

    nodetool show --list
                caches
                clients

    There is a few cases where it's convenient to set some table parameters on only one of a few nodes. For instance, it's useful for experimenting with settings like caching options, compaction, compression, read repair chance, gcGrace ... Another case is when you want to completely migrate to a new setting, but want to do that node-per-node (mainly useful when switching compaction strategy, see CASSANDRA-10898).

    I'll note that we can already do some of this through JMX for some of the settings as we have methods like ColumnFamilyStoreMBean.setCompactionParameters(), but:

    1. parameters settings are initially set in CQL. Having to go to JMX for this sounds less consistent to me. The fact we have both a ColumnFamilyStoreMBean.setCompactionParameters() and a ColumnFamilyStoreMBean.setCompactionParametersJson() (as I assume the former one is inconvenient to use) is also proof to me than JMX ain't terribly appropriate.
    2. I think this can be potentially useful for almost all table settings, but we don't expose JMX methods for all settings, and it would be annoying to have to. The method suggested below wouldn't have to be updated every time we add a new settings (if done right).
    3. Changing options through JMX is not persistent across restarts. This may arguably be fine in some cases, but if you're trying to migrate your compaction strategy node per node, or want to experiment with a setting over a mediumish time period, it's mostly a pain.

    So what I suggest would be add node overrides in the normal table setting (which would be part of the schema as any other setting). In other words, if you want to set LCS for only one specific node, you'd do:

    ALTER TABLE foo WITH node_overrides = { '192.168.0.1' : { 'compaction' : { 'class' : 'LeveledCompactionStrategy' } }
                }internode_inbound
                

    I'll note that I already suggested that idea on CASSANDRA-10898, but as it's more generic than what that latter ticket is about, so creating its own ticket.

    Difficulty: Challenging
    Potential mentors:
    paulo, mail: paulo (at) apache.org
    Project Devs, mail: dev (at) cassandra.apache.org

    Expose application_name and application_version in virtual table system_views.clients

    Recent java-driver's com.datastax.oss.driver.api.core.session.SessionBuilder respects properties ApplicationName and ApplicationVersion.

    It would be helpful to exposed this information via virtual table system_views.clients and with nodetool clientstats.
    internode_outbound
                settings
                sstable_tasks
                system_properties
                thread_pools
                
                nodetool show clients --format yaml
                ...
                nodetool show internode_outboud --format json
                ...
                nodetool show sstabletasks --keyspace my_ks --table -my_table
                ...
                
    Difficulty: Normal
    Potential mentors:
    paulo, mail: paulo (at) apache.org
    Project Devs, mail: dev (at) cassandra.apache.org

    Script to autogenerate cassandra.yaml

    It would be useful to have a script that can ask the user a few questions and generate a recommended cassandra.yaml based on their answers. This will help solve issues like selecting num_tokens. It can also be integrated into OS specific packaging tools such as debconf[1]. Rather than just documenting on the website, it is best to provide a simple script to auto-generate configuration based on common use-cases.

    [1] https://wiki.debian.org/debconf

    Add ability to disable schema changes, repairs, bootstraps, etc (during upgrades)

    There are a lot of operations that aren't supposed to be run in a mixed version cluster: schema changes, repairs, topology changes, etc. However, it's easily possible for these operations to be accidentally run by a script, another user unaware of the upgrade, or an operator that's not aware of these rules.

    We should make it easy to follow the rules by making it possible to prevent/disable all of these operations through nodetool commands. At the start of an upgrade, an operator can disable all of these until the upgrade has been completed.

    Difficulty: Normal
    Potential mentors:
    paulo, mail: paulo (at) apache.org
    Project Devs, mail: dev (at) cassandra.apache.org

    Allow table property defaults (e.g. compaction, compression) to be specified for a cluster/keyspace

    During an IRC discussion in cassandra-dev it was proposed that we could have table property defaults stored on a Keyspace or globally within the cluster. For example, this would allow users to specify "All new tables on this cluster should default to LCS with SSTable size of 320MiB" or "all new tables in Keyspace XYZ should have Zstd commpression with a 8 KiB block size" or "default_time_to_live should default to 3 days" etc ... This way operators can choose the default that makes sense for their organization once (e.g. LCS if they are running on fast SSDs), rather than requiring developers creating the Keyspaces/Tables to make the decision on every creation (often without context of which choices are right).

    A few implementation options were discussed including:

    • A YAML option
    • Schema provided at the Keyspace level that would be inherited by any tables automatically
    • Schema provided at the Cluster level that would be inherited by any Keyspaces or Tables automatically

    In IRC it appears that rough consensus was found in having global -> keyspace -> table defaults which would be stored in schema (no YAML configuration since this isn't node level really, it's a cluster level config).

    Difficulty: Challenging
    Potential mentors:
    paulo, mail: paulo (at) apache.org
    Project Devs, mail: dev (at) cassandra.apache.org

    Global configuration parameter to reject repairs with anti-compaction

    We have moved from Cassandra 2.1 to 3.0 and from an operational aspect, the Cassandra repair area changed significantly / got more complex. Beside incremental repairs not working reliably, also full repairs (-full command-line option) are running into anti-compaction code paths, splitting repaired / non-repaired data into separate SSTables, even with full repairs.

    Casandra 4.x (with repair enhancements) is quite away for us (for production usage), thus we want to avoid anti-compactions with Cassandra 3.x at any cost. Especially for our on-premise installations at our customer sites, with less control over on how e.g. nodetool is used, we simply want to have a configuration parameter in e.g. cassandra.yaml, which we could use to reject any repair invocations that results in anti-compaction being active.

    I know, such a flag still can be flipped then (by the customer), but as a first safety stage possibly sufficient enough to reject anti-compaction repairs, e.g. if someone executes nodetool repair ... the wrong way (accidentally).

    Difficulty: Normal
    Potential mentors:
    paulo, mail: paulo (at) apache.org
    Project Devs, mail: dev (at) cassandra.apache.org

    Expose application_name and application_version in virtual table system_views.clients

    Recent java-driver's com.datastax.oss.driver.api.core.session.SessionBuilder respects properties ApplicationName and ApplicationVersion.

    It would be helpful to exposed this information via virtual table system_views.clients and with nodetool clientstats.

    Add nodetool command to display or export the contents of a virtual table

    Several virtual tables were recently added, but they're currently only accessible via cqlsh or programmatically. While this is valuable for many use cases, operators are accustomed with the convenience of querying system metrics with a simple nodetool command.

    In addition to that, a relatively common request is to provide nodetool output in different formats (JSON, YAML and even XML) (CASSANDRA-5977, CASSANDRA-12035, CASSANDRA-12486, CASSANDRA-12698, CASSANDRA-12503). However this requires lots of manual labor as each nodetool subcommand needs to be adapted to support new output formats.

    I propose adding a new nodetool command that will consistently print to the standard output the contents of a virtual table. By default the command will print the output in a human-readable tabular format similar to cqlsh, but a "--format" parameter can be specified to modify the output to some other format like JSON or YAML.

    It should be possible to add a limit to the amount of rows displayed and filter to display only rows from a specific keyspace or table. The command should be flexible and provide simple hooks for registration and customization of new virtual tables.

    I propose calling this command nodetool show <virtualtable> (naming bikeshedding welcome), for example:

    nodetool show --list caches clients internode_inbound internode_outbound settings sstable_tasks system_properties thread_pools nodetool show clients --format yaml ... nodetool show internode_outboud --format json ... nodetool show sstabletasks --keyspace my_ks --table -my_table ...

    Difficulty: Normal
    Potential mentors:
    paulo, mail: paulo (at) apache.org
    Project Devs, mail: dev (at) cassandra.apache.org

    Script to autogenerate cassandra.yaml

    It would be useful to have a script that can ask the user a few questions and generate a recommended cassandra.yaml based on their answers. This will help solve issues like selecting num_tokens. It can also be integrated into OS specific packaging tools such as debconf[1]. Rather than just documenting on the website, it is best to provide a simple script to auto-generate configuration based on common use-cases.

    [1] https://wiki.debian.org/debconf

    Difficulty: Normal
    Potential mentors:
    paulo, mail: paulo (at) apache.org
    Project Devs, mail: dev (at) cassandra.apache.org
    , mail: paulo (at) apache.org
    Project Devs, mail: dev (at) cassandra.apache.org

    Per-node overrides for table settings

    There is a few cases where it's convenient to set some table parameters on only one of a few nodes. For instance, it's useful for experimenting with settings like caching options, compaction, compression, read repair chance, gcGrace ... Another case is when you want to completely migrate to a new setting, but want to do that node-per-node (mainly useful when switching compaction strategy, see CASSANDRA-10898).

    I'll note that we can already do some of this through JMX for some of the settings as we have methods like ColumnFamilyStoreMBean.setCompactionParameters(), but:

    1. parameters settings are initially set in CQL. Having to go to JMX for this sounds less consistent to me. The fact we have both a ColumnFamilyStoreMBean.setCompactionParameters() and a ColumnFamilyStoreMBean.setCompactionParametersJson() (as I assume the former one is inconvenient to use) is also proof to me than JMX ain't terribly appropriate.
    2. I think this can be potentially useful for almost all table settings, but we don't expose JMX methods for all settings, and it would be annoying to have to. The method suggested below wouldn't have to be updated every time we add a new settings (if done right).
    3. Changing options through JMX is not persistent across restarts. This may arguably be fine in some cases, but if you're trying to migrate your compaction strategy node per node, or want to experiment with a setting over a mediumish time period, it's mostly a pain.

    So what I suggest would be add node overrides in the normal table setting (which would be part of the schema as any other setting). In other words, if you want to set LCS for only one specific node, you'd do:

    ALTER TABLE foo WITH node_overrides = { '192.168.0.1' : { 'compaction' : { 'class' : 'LeveledCompactionStrategy' } }
                }
                

    I'll note that I already suggested that idea on CASSANDRA-10898, but as it's more generic than what that latter ticket is about, so creating its own ticket.

    Difficulty: Challenging

    Global configuration parameter to reject repairs with anti-compaction

    We have moved from Cassandra 2.1 to 3.0 and from an operational aspect, the Cassandra repair area changed significantly / got more complex. Beside incremental repairs not working reliably, also full repairs (-full command-line option) are running into anti-compaction code paths, splitting repaired / non-repaired data into separate SSTables, even with full repairs.

    Casandra 4.x (with repair enhancements) is quite away for us (for production usage), thus we want to avoid anti-compactions with Cassandra 3.x at any cost. Especially for our on-premise installations at our customer sites, with less control over on how e.g. nodetool is used, we simply want to have a configuration parameter in e.g. cassandra.yaml, which we could use to reject any repair invocations that results in anti-compaction being active.

    I know, such a flag still can be flipped then (by the customer), but as a first safety stage possibly sufficient enough to reject anti-compaction repairs, e.g. if someone executes nodetool repair ... the wrong way (accidentally).

    Difficulty: Normal
    Potential mentors:
    paulo, mail: paulo (at) apache.org
    Project Devs, mail: dev (at) cassandra.apache.org.apache.org

    Add ability to disable schema changes, repairs, bootstraps, etc (during upgrades)

    There are a lot of operations that aren't supposed to be run in a mixed version cluster: schema changes, repairs, topology changes, etc. However, it's easily possible for these operations to be accidentally run by a script, another user unaware of the upgrade, or an operator that's not aware of these rules.

    We should make it easy to follow the rules by making it possible to prevent/disable all of these operations through nodetool commands. At the start of an upgrade, an operator can disable all of these until the upgrade has been completed

    Add ability to ttl snapshots

    It should be possible to add a TTL to snapshots, after which it automatically cleans itself up.

    This will be useful together with the auto_snapshot option, where you want to keep an emergency snapshot in case of accidental drop or truncation but automatically remove it after a specified period when it's no longer useful. So in addition to allowing a user to specify a snapshot ttl on nodetool snapshot we should have a auto_snapshot_ttl option that allows a user to set a ttl for automatic snaphots on drop/truncate.

    Difficulty: Normal
    Potential mentors:
    paulo, mail: paulo (at) apache.org
    Project Devs, mail: dev (at) cassandra.apache.org

    ...