Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Contents

Commons Numbers

Userguide and reports

Review contents of the component towards providing an up-to-date userguide and write benchmarking code for generating performance reports (JMH).

Difficulty: Minor
Potential mentors:
Gilles Sadowski, mail: erans (at) apache.org
Project Devs, mail: dev (at) commons.apache.org

Apache IoTDB

Apache IoTDB website supports static documents and search engine

Apache IoTDB currently uses VUE to develop the website (iotdb.apache.org), and show the markdown documents from GitHub to the website using JS.

However, there are two drawbacks now:

  1. if we display documents from GitHub to the website using JS, then Google crawler will never index the content of the documents.
  2. when users read the documents on the website, they may do not know where the content is. For example, someone wants to find the syntax of 'show timeseries', but  he or she may do not know whether it is in chapter 5-1 or 5-4. So, a search engine embedded in the website is a good choice. 

You should learn:

  • VUE
  • Other Website developing technology.

Mentors:

hxd@apache.org

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd

Stream-based utilities

Since it is possible to release different modules with different language level requirements, we could consider creating a commons-numbers-complex-stream module to hold the utilities currently in class ComplexUtils.

From a management POV, it would avoid keeping the maintenance burden of an outdated API once the whole component switches to Java 8.

Release 1.0 should not ship with ComplexUtils.

Difficulty: Minor
Potential mentors:
Gilles Sadowski, mail: erans (at) apache.org
Project Devs, mail: dev (at) commonsiotdb.apache.org

RocketMQ

Apache IoTDB integration with Prometheus

IoTDB is a highly efficient time series database.

Prometheus is a monitoring and alerting toolkit, which supports collecting data from other systems, servers, and IoT devices, saving data into a DB, visualizing data and provides some query APIs.


Prometheus allows users to use their database rather than just Prometheus DB for storing time series databases. 

This proposal is for integrating IoTDB with Prometheus.


You should know:

  • How to use Prometheus
  • How to use IoTDB
  • Java and Go language

difficulty: Major

mentors:

hxd@apache.org

RocketMQ Connect Cassandra

Content

The Cassandra sink connector allows writing data to Apache Cassandra. In this project, you need to implement a Cassandra sink connector based on OpenMessaging connect API, and run it on RocketMQ connect runtimeh3. You should learn before applying for this topic
Cassandra/[Apache RocketMQ

https://rocketmq.apache.org/]/[Apache RocketMQ Connecthttps://github.com/apache/rocketmq-externals/tree/master/rocketmq-connect]/ OpenMessaging Connect APIh3. Mentor
duhengforever@apache.orgImage Removedvongosling@apache.orgImage Removed

 
 

Difficulty: Major
Potential mentors:
duhengXiangdong Huang, mail: duheng hxd (at) apache.org
Project Devs, mail: dev (at) rocketmqiotdb.apache.org

RocketMQ Connect InfluxDB

Content

The InfluxDB sink connector allows moving data from Apache RocketMQ to InfluxDB. It writes data from a topic in Apache RocketMQ to InfluxDB. While The InfluxDB source connector is used to export data from InfluxDB Server to RocketMQ.

In this project, you need to implement an InfluxDB sink connector(source connector is optional) based on OpenMessaging connect API.

You should learn before applying for this topic

InfluxDB/[Apache RocketMQ|https://rocketmq.apache.org/]/[Apache RocketMQ Connect|https://github.com/apache/rocketmq-externals/tree/master/rocketmq-connect]/ OpenMessaging Connect API

Mentor

duhengforever@apache.orgImage Removedwlliqipeng@apache.orgImage Removedvongosling@apache.orgImage Removed

Apache IoTDB integration with MiNiFI/NiFi

IoTDB is a database for storing time series data.

MiNiFI is a data flow engine to transfer data from A to B, e.g., from PLC4X to IoTDB.

This proposal is for integration IoTDB with MiNiFi.

  • let MiNiFi/NiFi to support writing data into IoTDB.


Difficulty:  major

mentors:

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB Database Connection Pool and integration with some web framework

IoTDB is a time series database.

When using a database in an application, the database connection pool is much helpful for  high performance and saving resources.

Besides, when developing a website using Spring or some other web framework, now many developers do not control the database connection manually. Instead, developers just need to tell what database they will use and the web framework can handle everything well.

This proposal is for

  • letting IoTDB supports some database connection pools like Apache Commons DBCP, C3P0.
  • integration IoTDB with one web framework (e.g., Spring)


You should know:

  • IoTDB
  • At least one DB connection pool
  • Know Spring or some other web framework

mentors:

hxd@apache.org

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd
Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmqiotdb.apache.org

The Operator for RocketMQ Exporter

he exporter exposes the endpoint of monitoring data collection to Prometheus server in the form of HTTP service. Prometheus server can obtain the monitoring data to be collected by accessing the endpoint endpoint provided by the exporter. RocketMQ exporter is such an exporter. It first collects data from rocketmq cluster, and then normalizes the collected data to meet the requirements of Prometheus system with the help of the third-party client library provided by Prometheus. Prometheus regularly pulls data from the exporter. This topic needs to implement an operator of rocketmq exporter to facilitate the deployment of the exporter in kubenetes platform.

You should learn before applying for this topic

RocketMQ-Exporter Repo
RocketMQ-Exporter Overview
Kubetenes Operator
RocketMQ-Operator

Mentor

wlliqipeng@apache.orgImage Removedvongosling@apache.orgImage Removed

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Apache IoTDB trigger module for streaming cumputing

IoTDB is a time-series data management system and the data usually comes in a streaming way.

In the IoT area, when a data point comes, a trigger can be called because of the following scenario:

  • (single data point calculation) the data point is an outlier point, or the data value reaches a warning threshold. IoTDB needs to publish the data point to those who subscribed the event.
  • (multiply time series data point calculation) a device sends several metrics data to IoTDB, e.g., vehicle d1 sends average speed and running time to IoTDB. Then users may want to get the mileage of the vehicle (speed x time). IoTDB needs to calculate the result and save it to another time series.
  • (Time window calculation) a device reports its temperature every second. Though the temperature is not too high, if it keeps increasing in 5 seconds, IoTDB needs to report the event to those who subscribe that.


As there are many streaming computing projects already, we can integrate one of them into IoTDB.

  • If IoTDB runs on Edge, we can integrate Apache StreamPipes or Apache Edgent.
  • If IOTDB runs on a Server, the above also works  and Apache Flink is also a good choice.

The process is:

  • User registers a trigger into IoTDB.
  • When a data comes, IoTDB save it and check whether there are triggers on it
  • If so, call a streaming computing framework to do something;


You may need to know:

  • At least one streaming computing project.
  • SQL parser or some other DSL parser tool.

You have to modify the source codes of IoTDB server engine module.

Difficulty: A little hard

mentors:

RocketMQ Connect IoTDB

Content

The IoTDB sink connector allows moving data from Apache RocketMQ to IoTDB. It writes data from a topic in Apache RocketMQ to IoTDB.

IoTDB (Internet of Things Database) is a data management system for time series data, which can provide users specific services, such as, data collection, storage and analysis. Due to its lightweight structure, high performance and usable features together with its seamless integration with the Hadoop and Spark ecology, IoTDB meets the requirements of massive dataset storage, high throughput data input and complex data analysis in the industrial IoTDB field.

In this project, there are some update operations for historical data, so it is necessary to ensure the sequential transmission and consumption of data via RocketMQ. If there is no update operation in use, then there is no need to guarantee the order of data. IoTDB will process these data which may be disorderly.

So, in this project, you need to implement an IoTDB sink connector based on OpenMessaging connect API, and run it on RocketMQ connect runtime.

You should learn before applying for this topic

IoTDB/[Apache RocketMQ|https://rocketmq.apache.org/]/[Apache RocketMQ Connect|https://github.com/apache/rocketmq-externals/tree/master/rocketmq-connect]/ OpenMessaging Connect API

Mentor

duhengforever@apache.orgImage Removedwlliqipeng@apache.orgImage Removedvongosling@apache.orgImage Removed

Difficulty: Major
Potential mentors:
duhengXiangdong Huang, mail: duheng hxd (at) apache.org
Project Devs, mail: dev (at) rocketmqiotdb.apache.org

A complete Apache IoTDB JDBC driver and integration with JDBC driver based tools (DBeaver and Apache Zeppelin)

Apache IoTDB is a database for time series data management that written in Java. It provides a SQL-like query language and a JDBC driver for users. Current IoTDB JDBC driver has implemented some important interfaces of Statement, Connection, ResultSet, etc., which works well for most users' requirements.

However, we know there are many tools supporting integrating with a database if the database has a standard JDBC driver, e.g., DBeaver, Apache Zeppelin, Tableau, etc..


This proposal is for implementing a standard JDBC driver for IoTDB, and using the driver to integrate with DBeaver and Apache Zeppelin.


Because Apache Zeppelin supports customized Interpreter, we can also implement an IoTDB interpreter for integration with Zeppelin.


You should know:

  • how JDBC works.
  • learn to use IoTDB session API.
  • understand Zeppelin Interpreter interface. 


Difficulty: Major 

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache Fineract

Upgrade Fineract 1.x to Java 11

Upgrade Fineract 1.x from Java 8 to Java 11 so we can start using the latest LTS Java version and features.

RocketMQ Connect Elasticsearch

Content

The Elasticsearch sink connector allows moving data from Apache RocketMQ to Elasticsearch 6.x, and 7.x. It writes data from a topic in Apache RocketMQ to an index in Elasticsearch and all data for a topic have the same type.

Elasticsearch is often used for text queries, analytics and as an key-value store (use cases). The connector covers both the analytics and key-value store use cases.

For the analytics use case, each message is in RocketMQ is treated as an event and the connector uses topic+message queue+offset as a unique identifier for events, which then converted to unique documents in Elasticsearch. For the key-value store use case, it supports using keys from RocketMQ messages as document ids in Elasticsearch and provides configurations ensuring that updates to a key are written to Elasticsearch in order.

So, in this project, you need to implement a sink connector based on OpenMessaging connect API, and will executed on RocketMQ connect runtime.

You should learn before applying for this topic

Elasticsearch/[Apache RocketMQ|https://rocketmq.apache.org/]/[Apache RocketMQ Connect|https://github.com/apache/rocketmq-externals/tree/master/rocketmq-connect]/ OpenMessaging Connect API

Mentor

duhengforever@apache.orgImage Removedvongosling@apache.orgImage Removed

Difficulty: Major
Potential mentors:
duhengAwasum Yannick, mail: duheng awasum (at) apache.org
Project Devs, mail: dev (at) rocketmqfineract.apache.org

RocketMQ

RocketMQ Connect Cassandra


Content

The Cassandra sink connector allows writing data to Apache Cassandra. In this project, you need to implement a Cassandra sink connector based on OpenMessaging connect API, and run it on RocketMQ connect runtimeh3.

Apache RocketMQ CLI Admin Tool Developed by Golang

Apache rocketmq provides a cli admin tool developed by Java to querying, managing and diagnosing various problems. At the same time, it also provides a set of API interface, which can be called by Java application program to create, delete, query, message query and other functions. This topic requires the realization of CLI management tool and a set of API interface developed by golang language, through which go application can realize the creation, query and other operations of topic.

You should learn before applying for this topic
Cassandra/[Apache RocketMQ

https://rocketmq.apache.org/]/[Apache RocketMQ
Go Client
Connecthttps://github.com/apache/rocketmq-externals/tree/master/rocketmq-connect]/ OpenMessaging Connect APIh3. Mentor
duhengforever@apache

Mentor

wlliqipeng@apache

 
 

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org
Apache

RocketMQ

Schema Registry

Connect InfluxDB

Content

In order to help RocketMQ improve its event management capabilities, and at the same time better decouple the producer and receiver, keep the event forward compatible, so we need a service for event metadata management is called a schema registry.

Schema registry will provide a GraphQL interface for developers to define standard schemas for their events, share them across the organization and safely evolve them in a way that is backward compatible and future proof.

You should learn before applying for this topic

Apache RocketMQ/Apache RocketMQ SDK/

Mentor

The InfluxDB sink connector allows moving data from Apache RocketMQ to InfluxDB. It writes data from a topic in Apache RocketMQ to InfluxDB. While The InfluxDB source connector is used to export data from InfluxDB Server to RocketMQ.

In this project, you need to implement an InfluxDB sink connector(source connector is optional) based on OpenMessaging connect API.

You should learn before applying for this topic

InfluxDB/[Apache RocketMQ|https://rocketmq.apache.org/]/[Apache RocketMQ Connect|https://github.com/apache/rocketmq-externals/tree/master/rocketmq-connect]/ OpenMessaging Connect API

Mentor

duhengforever@apache.orgImage Addedwlliqipeng@apacheduhengforever@apache.orgImage Modifiedvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org
CloudEvents support

The Operator for RocketMQ

Exporter

he exporter exposes the endpoint of monitoring data collection to Prometheus server in the form of HTTP service. Prometheus server can obtain the monitoring data to be collected by accessing the endpoint endpoint provided by the exporter. RocketMQ exporter is such an exporter. It first collects data from rocketmq cluster, and then normalizes the collected data to meet the requirements of Prometheus system with the help of the third-party client library provided by Prometheus. Prometheus regularly pulls data from the exporter. This topic needs to implement an operator of rocketmq exporter to facilitate the deployment of the exporter in kubenetes platform

Context

Events are everywhere. However, event producers tend to describe events differently.

The lack of a common way of describing events means developers must constantly re-learn how to consume events. This also limits the potential for libraries, tooling and infrastructure to aide the delivery of event data across environments, like SDKs, event routers or tracing systems. The portability and productivity we can achieve from event data is hindered overall.

CloudEvents is a specification for describing event data in common formats to provide interoperability across services, platforms and systems.
RocketMQ as an event streaming platform, also hopes to improve the interoperability of different event platforms by being compatible with the CloudEvents standard and supporting CloudEvents SDK. In this topic, you need to improve the binding spec. and implement the RocketMQ CloudEvents SDK(Java、Golang or others).

You should learn before applying for this topic

RocketMQ-Exporter Repo
Apache RocketMQ/Apache RocketMQ SDK/CloudEventsRocketMQ-Exporter Overview
Kubetenes Operator
RocketMQ-Operator

Mentor

duhengforever@apachewlliqipeng@apache.orgImage Modifiedvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

RocketMQ Connect IoTDB

Content

Apache RocketMQ Channel for Knative

The IoTDB sink connector allows moving data from Apache RocketMQ to IoTDB. It writes data from a topic in Apache RocketMQ to IoTDB.

IoTDB (Internet of Things Database) is a data management system for time series data, which can provide users specific services, such as, data collection, storage and analysis. Due to its lightweight structure, high performance and usable features together with its seamless integration with the Hadoop and Spark ecology, IoTDB meets the requirements of massive dataset storage, high throughput data input and complex data analysis in the industrial IoTDB field.

In this project, there are some update operations for historical data, so it is necessary to ensure the sequential transmission and consumption of data via RocketMQ. If there is no update operation in use, then there is no need to guarantee the order of data. IoTDB will process these data which may be disorderly.

So, in this project, you need to implement an IoTDB sink connector based on OpenMessaging connect API, and run it on RocketMQ connect runtime

Context

Knative is a kubernetes based platform for building, deploying and managing modern serverless applications. Knative to provide a set of middleware components that are essential to building modern, source-centric, and container-based applications that can run anywhere: on-premises, in the cloud, or even in a third-party data centre. Knative consists of the Serving and Eventing components. Eventing is a system that is designed to address a common need for cloud-native development and provides composable primitives to enable late-binding event sources and event consumers. Eventing also defines an event forwarding and persistence layer, called a Channel. Each channel is a separate Kubernetes Custom Resource. This topic requires you to implement rocketmqchannel based on Apache RocketMQ.

You should learn before applying for this topic

How Knative works
RocketMQSource for Knative
Apache RocketMQ Operator

Mentor

IoTDB/[Apache RocketMQ|https://rocketmq.apache.org/]/[Apache RocketMQ Connect|https://github.com/apache/rocketmq-externals/tree/master/rocketmq-connect]/ OpenMessaging Connect API

Mentor

hxd@apache.org, duhengforever@apache.orgImage Addedwlliqipeng@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org
Apache

RocketMQ

Ingestion for Druid

Context

Druid is a real-time analytics database designed for fast slice-and-dice analytics ("OLAP" queries) on large data sets. In this topic, you should develop the RocketMQ indexing service enables the configuration of supervisors on the Overlord, which facilitate ingestion from RocketMQ by managing the creation and lifetime of RocketMQ indexing tasks. These indexing tasks read events using RocketMQ's own partition and offset mechanism. The supervisor oversees the state of the indexing tasks to coordinate handoffs, manage failures, and ensure that the scalability and replication requirements are maintained.

You should learn before applying for this topic

Apache Druid Data Ingestion

Mentor

vongosling@apache.orgImage Removedduhengforever@apache.orgImage Removed

Connect Elasticsearch

Content

The Elasticsearch sink connector allows moving data from Apache RocketMQ to Elasticsearch 6.x, and 7.x. It writes data from a topic in Apache RocketMQ to an index in Elasticsearch and all data for a topic have the same type.

Elasticsearch is often used for text queries, analytics and as an key-value store (use cases). The connector covers both the analytics and key-value store use cases.

For the analytics use case, each message is in RocketMQ is treated as an event and the connector uses topic+message queue+offset as a unique identifier for events, which then converted to unique documents in Elasticsearch. For the key-value store use case, it supports using keys from RocketMQ messages as document ids in Elasticsearch and provides configurations ensuring that updates to a key are written to Elasticsearch in order.

So, in this project, you need to implement a sink connector based on OpenMessaging connect API, and will executed on RocketMQ connect runtime.

You should learn before applying for this topic

Elasticsearch/[Apache RocketMQ|https://rocketmq.apache.org/]/[Apache RocketMQ Connect|https://github.com/apache/rocketmq-externals/tree/master/rocketmq-connect]/ OpenMessaging Connect API

Mentor

duhengforever@apache.orgImage Addedvongosling@apache.orgImage Added

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Apache RocketMQ

Connect Hudi

CLI Admin Tool Developed by Golang

Apache rocketmq provides a cli admin tool developed by Java to querying, managing and diagnosing various problems. At the same time, it also provides a set of API interface, which can be called by Java application program to create, delete, query, message query and other functions. This topic requires the realization of CLI management tool and a set of API interface developed by golang language, through which go application can realize the creation, query and other operations of topic

Context

Hudi could ingest and manage the storage of large analytical datasets over DFS (hdfs or cloud stores). It can act as either a source or sink for streaming processing platform such as Apache RocketMQ. it also can be used as a state store inside a processing DAG (similar to how rocksDB is used by Flink). This is an item on the roadmap of the Apache RocketMQ. This time, you should implement a fully hudi source and sink based on RocketMQ connect framework, which is a most important implementation of the OpenConnect.

You should learn before applying for this topic

Apache RocketMQ Connect Framework
Apache HudiRocketMQ Go Client.

Mentor

vongosling@apachewlliqipeng@apache.orgImage Modifiedduhengforever@apachevongosling@apache.orgImage Modified

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Apache RocketMQ

Connect Flink

Schema Registry

Content

In order to help RocketMQ improve its event management capabilities, and at the same time better decouple the producer and receiver, keep the event forward compatible, so we need a service for event metadata management is called a schema registry.

Schema registry will provide a GraphQL interface for developers to define standard schemas for their events, share them across the organization and safely evolve them in a way that is backward compatible and future proof.

You should learn before applying for this topic

Apache RocketMQ/Apache RocketMQ SDK/

Mentor

duhengforever@apache.orgImage Addedvongosling@apache.orgImage Added

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

CloudEvents support for RocketMQ

Context

Events are everywhere. However, event producers tend to describe events differently.

The lack of a common way of describing events means developers must constantly re-learn how to consume events. This also limits the potential for libraries, tooling and infrastructure to aide the delivery of event data across environments, like SDKs, event routers or tracing systems. The portability and productivity we can achieve from event data is hindered overall.

CloudEvents is a specification for describing event data in common formats to provide interoperability across services, platforms and systems.
RocketMQ as an event streaming platform, also hopes to improve the interoperability of different event platforms by being compatible with the CloudEvents standard and supporting CloudEvents SDK. In this topic, you need to improve the binding spec. and implement the RocketMQ CloudEvents SDK(Java、Golang or others).

You should learn before applying for this topic

Apache RocketMQ/Apache RocketMQ SDK/CloudEvents

Mentor

duhengforever@apache

Context

There are many ways that Apache Flink and Apache RocketMQ can integrate to provide elastic data processing at a large scale. RocketMQ can be used as a streaming source and streaming sink in Flink DataStream applications, which is the main implementation and popular usage in RocketMQ community. Developers can ingest data from RocketMQ into a Flink job that makes computations and processes real-time data, to then send the data back to a RocketMQ topic as a streaming sink. More details you could see from https://github.com/apache/rocketmq-externals/tree/master/rocketmq-flink.

With more and more DW or OLAP engineers using RocketMQ for their data processing work, another potential integration needs arose. Developers can take advantage of as both a streaming source and a streaming table sink for Flink SQL or Table API queries. Also, Flink 1.9.0 makes the Table API a first-class citizen. It's time to support SQL in RocketMQ. This is the topic for Apache RocketMQ connect Flink.

You should learn before applying for this topic

Apache RocketMQ Flink Connector
Apache Flink Table API

Extension

For some expertise students in the streaming field, you could continue to implements and provides an exactly-once streaming source and at-least-once(or exactly-once)streaming sink, like the issue #500 said.

Mentor

nicholasjiang@apache.orgImage Removed ,   duhengforever@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Apache RocketMQ

Scaler

Channel for

KEDA

Knative

Context

KEDA allows for fine-grained autoscaling (including to/from zero) for event-driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition. KEDA has a number of “scalers” that can both detect if a deployment should be activated or deactivated, and feed custom metrics for a specific event source. In this topic, you need to implement the RocketMQ scalersKnative is a kubernetes based platform for building, deploying and managing modern serverless applications. Knative to provide a set of middleware components that are essential to building modern, source-centric, and container-based applications that can run anywhere: on-premises, in the cloud, or even in a third-party data centre. Knative consists of the Serving and Eventing components. Eventing is a system that is designed to address a common need for cloud-native development and provides composable primitives to enable late-binding event sources and event consumers. Eventing also defines an event forwarding and persistence layer, called a Channel. Each channel is a separate Kubernetes Custom Resource. This topic requires you to implement rocketmqchannel based on Apache RocketMQ.

You should learn before applying for this topic

Helm/How Knative works
RocketMQSource for Knative
Apache RocketMQ Operator/Apache RocketMQ Docker ImageApache RocketMQ multi-replica mechanism(based on DLedger)
How KEDA works

Mentor

wlliqipeng@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Camel

camel-minio - Component to store/load files from blob store

min.io is a s3 like blob store. So users have more freedom than being locked into aws

We can create a camel-minio component for it
https://github.com/minio/minio-java

Difficulty: Major
Potential mentors:
Claus Ibsen, mail: davsclaus (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Apache RocketMQ Ingestion for Druid

Context

Druid is a real-time analytics database designed for fast slice-and-dice analytics ("OLAP" queries) on large data sets. In this topic, you should develop the RocketMQ indexing service enables the configuration of supervisors on the Overlord, which facilitate ingestion from RocketMQ by managing the creation and lifetime of RocketMQ indexing tasks. These indexing tasks read events using RocketMQ's own partition and offset mechanism. The supervisor oversees the state of the indexing tasks to coordinate handoffs, manage failures, and ensure that the scalability and replication requirements are maintained.

You should learn before applying for this topic

Apache Druid Data Ingestion

Mentor

vongosling@apache.orgImage Addedduhengforever@apache.orgImage Added

Camel grpc component doesn't transfer the Message headers

Headers that are added to the Message in the camel Exchange before making a call to the camel-grpc component are not received at the grpc consumer. The expectation is that these headers would be added to the grpcStub before sending over the wire (like other components like http4 etc).

Our team has come up with a workaround for this but it is extremely cumbersome. We had to extend the GrpcProducer to introduce a custom GrpcExchangeForwarder that would copy header from exchange to the stub before invoking the sync/async method.

At the consumer side we had to extend the GrpcConsumer to use a custom ServerInterceptor to capture the grpc headers and custom MethodHandler to transfer the grpc headers to the Camel exchange headers.

Difficulty: Major
Potential mentors:
Vishal Vijayanduheng, mail: vijayanv duheng (at) apache.org
Project Devs, mail: dev (at) camelrocketmq.apache.org

camel-snmp - Support for multiple security mechanisms in SNMP v3

Allow to add multiple users for SNMP v3 i.e. the SnmpTrapConsumer should support multiple combinations of authentication and privacy protocols and different passphrases. We cannot have a route per security mechanism.

Consider the below scenario.

I have multiple SNMP devices which have multiple authentication protocols and privacy protocols with different passphrases. Moreover, they can send any version of SNMP traps from v1 to v3. I must be able to configure those in a properties file or a DSL (i.e. the snmp version, the USM users etc).

Apache RocketMQ Connect Hudi

Context

Hudi could ingest and manage the storage of large analytical datasets over DFS (hdfs or cloud stores). It can act as either a source or sink for streaming processing platform such as Apache RocketMQ. it also can be used as a state store inside a processing DAG (similar to how rocksDB is used by Flink). This is an item on the roadmap of the Apache RocketMQ. This time, you should implement a fully hudi source and sink based on RocketMQ connect framework, which is a most important implementation of the OpenConnect.

You should learn before applying for this topic

Apache RocketMQ Connect Framework
Apache Hudi
.

Mentor

vongosling@apache.orgImage Addedduhengforever@apache.orgImage Added

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Context

There are many ways that Apache Flink and Apache RocketMQ can integrate to provide elastic data processing at a large scale. RocketMQ can be used as a streaming source and streaming sink in Flink DataStream applications, which is the main implementation and popular usage in RocketMQ community. Developers can ingest data from RocketMQ into a Flink job that makes computations and processes real-time data, to then send the data back to a RocketMQ topic as a streaming sink. More details you could see from https://github.com/apache/rocketmq-externals/tree/master/rocketmq-flink.

With more and more DW or OLAP engineers using RocketMQ for their data processing work, another potential integration needs arose. Developers can take advantage of as both a streaming source and a streaming table sink for Flink SQL or Table API queries. Also, Flink 1.9.0 makes the Table API a first-class citizen. It's time to support SQL in RocketMQ. This is the topic for Apache RocketMQ connect Flink.

You should learn before applying for this topic

Apache RocketMQ Flink Connector
Apache Flink Table API

Extension

For some expertise students in the streaming field, you could continue to implements and provides an exactly-once streaming source and at-least-once(or exactly-once)streaming sink, like the issue #500 said.

Mentor

nicholasjiang@apache.orgImage Added ,   duhengforever@apache.orgImage Addedvongosling@apache.orgImage Added

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Apache RocketMQ Scaler for KEDA

Context

KEDA allows for fine-grained autoscaling (including to/from zero) for event-driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition. KEDA has a number of “scalers” that can both detect if a deployment should be activated or deactivated, and feed custom metrics for a specific event source. In this topic, you need to implement the RocketMQ scalers.

You should learn before applying for this topic

Helm/Apache RocketMQ Operator/Apache RocketMQ Docker Image
Apache RocketMQ multi-replica mechanism(based on DLedger)
How KEDA works

Mentor

wlliqipeng@apache.orgImage Addedvongosling@apache.orgImage Added


Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Camel

Link to 'easy to resolve' issues in ASF Jira does not work

Contributor guidelines page here has a link pointing to ASF Jira. The link targets a filter, owned and published by Daniel Kulp. Something in the data structures appears to have changed over the course of time and Jira is erroring out. This could be a turn-off for potential new contributors.

To fix there are two ways.
1. If this filter could be fixed by its owner, thats the easiest possible option

2. Alternately, someone with access to ASF Jira to create filters should create an easy to pick filter and the link the the filter should be updated at <root>/docs/user-manual/modules/ROOT/pages/contributing.adoc : Line 9

Difficulty: Minor
Potential mentors:
Praveen Kottarathil, mail: praveenkottarathil@gmail.com (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

camel-minio - Component to store/load files from blob store

min.io is a s3 like blob store. So users have more freedom than being locked into aws

We can create a camel-minio component for it
https://github.com/minio/minio-java

Example:

            snmp.getUSM().addUser(
            new OctetString("MD5DES"),
            new UsmUser(new OctetString("MD5DES"),
            AuthMD5.ID,
            new OctetString("UserName"), PrivDES.ID,
            new OctetString("PasswordUser")));
            snmp.getUSM().addUser(
            new OctetString("MD5DES"),
            new UsmUser(new OctetString("MD5DES"),
            null, null, null,
            null));
             
            

.. other users with different auth, priv mechanisms (i.e. different security mechanisms). I must be able to receive traps from all of them.

Difficulty: Minor
Potential mentors:
Gowtham Gutha, mail: gowthamgutha (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

java-dsl - Add support for method references bean::methodName

Hi

This is not related only to spring integration.
I like to be able to use a spring service annotated class or bean directly from route but without using the method name as string, i.e. .bean(instance, "<method name") and instead use a method reference: .bean(instance::method)

But why?: 
1. not being able to navigate quickly(open) that method from the IDE. Need to do some intermediary steps to reach that method.
2. using of reflection internally by Camel to call that method.
3. not being able to rename the method without breaking the route.
4. not being able to see quickly (Alt+F7) who calls a methods in the IDE.
5. using strings to reference a method when we have method references seems not right.

As a workaround I had to add a helper class to simulate passing of method references and then internally to translate to method.

In case it helps explaining I am attaching the helper Bean.java class (you can use it for free or to do better).

You can use the class in any route like this:

from (X)
.bean(call(cancelSubscriptionService::buildSalesforceCase))
.to(Y)
.routeId(Z);

As you see I am forced to use the intermediary helper 'call' in order to translate to an Expression.
I would like to not have to use my helper and have the support built directly into Camel if possible. Let me know if there is a better solution to my problem.

Thanks

Difficulty: Major
Potential mentors:
Cristian DonoiuClaus Ibsen, mail: doncristiano davsclaus (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Camel grpc component doesn't transfer the Message headers

Headers that are added to the Message in the camel

-restdsl-swagger-plugin - create camel routes for generated rest DSL

Exchange before making a call to the camel-grpc component are not received at the grpc consumer. The expectation is that these headers would be added to the grpcStub before sending over the wire (like other components like http4 etc).

Our team has come up with a workaround for this but it is extremely cumbersome. We had to extend the GrpcProducer to introduce a custom GrpcExchangeForwarder that would copy header from exchange to the stub before invoking the sync/async method.

At the consumer side we had to extend the GrpcConsumer to use a custom ServerInterceptor to capture the grpc headers and custom MethodHandler to transfer the grpc headers to the Camel exchange headers

camel-restdsl-swagger-plugin can generate CamelRoutes.java from a Swagger / OpenAPI spec, which includes the REST DSL with to("direct:restN") stubs. Would be nice if it also autogenerated the equivalent from("direct:restN").log() placeholders to help jump start coding.

Difficulty: Major
Potential mentors:
Scott CrantonVishal Vijayan, mail: scranton vijayanv (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Expose OData4 based service as consumer endpoint

Right now, only polling consumer is available for olingo4 component. It's better to have a real listening consumer for this.

The method may have name like 'listen' to be able to create a listening consumer.

Difficulty: Major
Potential mentors:
Dmitry Volodin, mail: dmvolod (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

camel-snmp - Support for multiple security mechanisms in SNMP v3

Allow to add multiple users for SNMP v3 i.e. the SnmpTrapConsumer should support multiple combinations of authentication and privacy protocols and different passphrases. We cannot have a route per security mechanism.


Consider the below scenario.

I have multiple SNMP devices which have multiple authentication protocols and privacy protocols with different passphrases. Moreover, they can send any version of SNMP traps from v1 to v3. I must be able to configure those in a properties file or a DSL (i.e. the snmp version, the USM users etc).

Example:


            snmp.getUSM().addUser(
            new OctetString("MD5DES"),
            new UsmUser(new OctetString("MD5DES"),
            AuthMD5.ID,
            new OctetString("UserName"), PrivDES.ID,
            new OctetString("PasswordUser")));
            snmp.getUSM().addUser(
            new OctetString("MD5DES"),
            new UsmUser(new OctetString("MD5DES"),
            null, null, null,
            null));
             
            

.. other users with different auth, priv mechanisms (i.e. different security mechanisms). I must be able to receive traps from all of them.

Difficulty: Minor
Potential mentors:
Gowtham Gutha, mail: gowthamgutha (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

java-dsl - Add support for method references bean::methodName

Hi

This is not related only to spring integration.
I like to be able to use a spring service annotated class or bean directly from route but without using the method name as string, i.e. .bean(instance, "<method name") and instead use a method reference: .bean(instance::method)

But why?: 
1. not being able to navigate quickly(open) that method from the IDE. Need to do some intermediary steps to reach that method.
2. using of reflection internally by Camel to call that method.
3. not being able to rename the method without breaking the route.
4. not being able to see quickly (Alt+F7) who calls a methods in the IDE.
5. using strings to reference a method when we have method references seems not right.

As a workaround I had to add a helper class to simulate passing of method references and then internally to translate to method.

In case it helps explaining I am attaching the helper Bean.java class (you can use it for free or to do better).

You can use the class in any route like this:

from (X)
.bean(call(cancelSubscriptionService::buildSalesforceCase))
.to(Y)
.routeId(Z);

As you see I am forced to use the intermediary helper 'call' in order to translate to an Expression.
I would like to not have to use my helper and have the support built directly into Camel if possible. Let me know if there is a better solution to my problem.

Thanks

Ability to load an SSLContextParameter with a Keystore containing multiple keys (aliases)

Hello,
I wish I could initialize a single SSLContextParameters at camel startup containing my truststore.jks (> 1 alias) and my keystore.jks (> 1 alias) in order to call it (refer to) in Routes (FTPs, HTTPs) without have to redefine a new SSLContextParameter for each EndPoint.

<camel:sslContextParameters id="sslIContextParameters">
<camel:trustManagers>
<camel:keyStore password="${truststore.jks.file.password}"
resource="${truststore.jks.file.location}" />
</camel:trustManagers>
<camel:keyManagers >
<camel:keyStore password="${keystore.jks.file.password}"
resource="${keystore.jks.file.location}" />
</camel:keyManagers>
</camel:sslContextParameters>

When my Keystore contains more than 1 alias, I have the following error when creating the Route at startup : 

Caused by: org.apache.camel.ResolveEndpointFailedException: Failed to resolve endpoint: https4://<host>:<port>/<address>?authPassword=RAW(password)&authUsername=login&authenticationPreemptive=true&bridgeEndpoint=true&sslContextParameters=sslContextParameters&throwExceptionOnFailure=true due to: Cannot recover key

due to

Caused by: java.security.UnrecoverableKeyException: Cannot recover key

When my keystore contains only one key, it works very well.

<camel:sslContextParameters id="sslIContextParameters">
<camel:trustManagers>
<camel:keyStore password="${truststore.jks.file.password}"
resource="${truststore.jks.file.location}" />
</camel:trustManagers>
<camel:keyManagers keyPassword="keyPassword">
<camel:keyStore password="${keystore.jks.file.password}"
resource="${keystore.jks.file.location}" />
</camel:keyManagers>
</camel:sslContextParameters>

So I would like to be able to call my SSLContextParameter for different EndPoint by specifying (if necessary) the alias of the Keystore needed (by specifying the alias and / or password of the key)

Objectif in my project :

  • 1 TrustStore.jks 
  • 1 Keystore.jsk
  • 1 unique SSLContextParameter
  • > 200 camelRoutes FTPs/HTTPs (ssl one way ou two way)

Thank a lot

Difficulty: Major
Potential mentors:
Florian B.Cristian Donoiu, mail: Boosy doncristiano (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

camel-restdsl-swagger-plugin - create camel routes for generated rest DSL

camel-restdsl-swagger-plugin can generate CamelRoutes.java from a Swagger / OpenAPI spec, which includes the REST DSL with to("direct:restN") stubs. Would be nice if it also autogenerated the equivalent from("direct:restN").log() placeholders to help jump start coding.

Difficulty: Major
Potential mentors:
Scott Cranton, mail: scranton (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Expose OData4 based service as consumer endpoint

Right now, only polling consumer is available for olingo4 component. It's better to have a real listening consumer for this.

The method may have name like 'listen' to be able to create a listening consumer.

Difficulty: Major
Potential mentors:
Dmitry Volodin, mail: dmvolod

Add tool to generate swagger doc at build time

We do not have at this moment a tool that can generate the swagger doc at build time. However I think it would be possible to develop such tool. We have existing tooling that parses the Java or XML source code (Camel routes) which we use for validating endpoints, or do route-coverage reports etc.
 
https://github.com/apache/camel/blob/master/tooling/maven/camel-maven-plugin/src/main/docs/camel-maven-plugin.adoc
 
We could then make that tool parse the rest-dsl and build up that model behind the scene and feed that into the swagger-java library for it to spit out the generated swagger doc.
 
We could make it as a goal on the existing camel-maven-plugin, or build a new maven plugin: camel-maven-swagger or something. Then people could use it during build time to generate the swagger doc etc. 
 
We maybe should allow to override/configure things as well from the tooling, so you can maybe set/mask hostname or set descriptions or other things that may not all be detailed in the rest-dsl.

Difficulty: Major
Potential mentors:
Claus Ibsen, mail: davsclaus (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Add call options to the camel-grpc component

Add advanced call options related to the one operation and not overriding channel option:

  • deadline
  • compression
  • etc.
Difficulty: Major
Potential mentors:
Dmitry Volodin, mail: dmvolod (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Upgrade to JUnit 5

See http://junit.org/junit5/

Note: it provides a junit-vintage module so we should be able to migrate stuffs easily

Most users should now be able to write JUnit 5 tests using the modules created in CAMEL-13342.
Concerning the migration of camel own tests to JUnit5, the last blocker is that migrating flaky tests to JUnit 5 is not handled until mavensurefire 3 has been released or until open discussions in the junit team has converged.

Difficulty: Major
Potential mentors:
Luca Burgazzoli, mail: lb (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Camel website

This is an issue to track the work on the new Camel website.

If you wish to contribute to building the new Camel website please look in the website component issues labelled with help-wanted.

Difficulty: Major
Potential mentors:
Zoran Regvart, mail: zregvart (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Create a component for Kafka-Stream

Difficulty: Minor
Potential mentors:
Andrea Cosentino, mail: acosentino (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Ability to load an SSLContextParameter with a Keystore containing multiple keys (aliases)

Hello,
I wish I could initialize a single SSLContextParameters at camel startup containing my truststore.jks (> 1 alias) and my keystore.jks (> 1 alias) in order to call it (refer to) in Routes (FTPs, HTTPs) without have to redefine a new SSLContextParameter for each EndPoint.

<camel:sslContextParameters id="sslIContextParameters">
<camel:trustManagers>
<camel:keyStore password="${truststore.jks.file.password}"
resource="${truststore.jks.file.location}" />
</camel:trustManagers>
<camel:keyManagers >
<camel:keyStore password="${keystore.jks.file.password}"
resource="${keystore.jks.file.location}" />
</camel:keyManagers>
</camel:sslContextParameters>

When my Keystore contains more than 1 alias, I have the following error when creating the Route at startup : 

Caused by: org.apache.camel.ResolveEndpointFailedException: Failed to resolve endpoint: https4://<host>:<port>/<address>?authPassword=RAW(password)&authUsername=login&authenticationPreemptive=true&bridgeEndpoint=true&sslContextParameters=sslContextParameters&throwExceptionOnFailure=true due to: Cannot recover key

due to

Caused by: java.security.UnrecoverableKeyException: Cannot recover key


When my keystore contains only one key, it works very well.

<camel:sslContextParameters id="sslIContextParameters">
<camel:trustManagers>
<camel:keyStore password="${truststore.jks.file.password}"
resource="${truststore.jks.file.location}" />
</camel:trustManagers>
<camel:keyManagers keyPassword="keyPassword">
<camel:keyStore password="${keystore.jks.file.password}"
resource="${keystore.jks.file.location}" />
</camel:keyManagers>
</camel:sslContextParameters>


So I would like to be able to call my SSLContextParameter for different EndPoint by specifying (if necessary) the alias of the Keystore needed (by specifying the alias and / or password of the key)


Objectif in my project :

  • 1 TrustStore.jks 
  • 1 Keystore.jsk
  • 1 unique SSLContextParameter
  • > 200 camelRoutes FTPs/HTTPs (ssl one way ou two way)


Thank a lot



Difficulty: Major
Potential mentors:
Florian B., mail: Boosy (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Add tool to generate swagger doc at build time

We do not have at this moment a tool that can generate the swagger doc at build time. However I think it would be possible to develop such tool. We have existing tooling that parses the Java or XML source code (Camel routes) which we use for validating endpoints, or do route-coverage reports etc.
 
https://github.com/apache/camel/blob/master/tooling/maven/camel-maven-plugin/src/main/docs/camel-maven-plugin.adoc
 
We could then make that tool parse the rest-dsl and build up that model behind the scene and feed that into the swagger-java library for it to spit out the generated swagger doc.
 
We could make it as a goal on the existing camel-maven-plugin, or build a new maven plugin: camel-maven-swagger or something. Then people could use it during build time to generate the swagger doc etc. 
 
We maybe should allow to override/configure things as well from the tooling, so you can maybe set/mask hostname or set descriptions or other things that may not all be detailed in the rest-dsl.

Difficulty: Major
Potential mentors:
Claus Ibsen, mail: davsclaus (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Add call options to the camel-grpc component

Add advanced call options related to the one operation and not overriding channel option:

  • deadline
  • compression
  • etc.

Create a camel component for etcd v3

Difficulty: Minor
Potential mentors:
Luca Burgazzoli, mail: lb (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Beam

BeamSQL aggregation analytics functionality

BeamSQL has a long list of of aggregation/aggregation analytics functionalities to support.

To begin with, you will need to support this syntax:

            analytic_function_name ( [ argument_list ] )
            OVER (
            [ PARTITION BY partition_expression_list ]
            [ ORDER BY expression [{ ASC
            | DESC }] [, ...] ]
            [ window_frame_clause ]
            )
            

This will requires touch core components of BeamSQL:
1. SQL parser to support the syntax above.
2. SQL core to implement physical relational operator.
3. Distributed algorithms to implement a list of functions in a distributed manner.
4. Build benchmarks to measure performance of your implementation.

To understand what SQL analytics functionality is, you could check this great explanation doc: https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts.

To know about Beam's programming model, check: https://beam.apache.org/documentation/programming-guide/#overview

Difficulty: Major
Potential mentors:
Rui Wang, mail: amaliujia (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

Add Daffodil IO for Apache Beam

From https://daffodil.apache.org/:

Daffodil is an open source implementation of the DFDL specification that uses these DFDL schemas to parse fixed format data into an infoset, which is most commonly represented as either XML or JSON. This allows the use of well-established XML or JSON technologies and libraries to consume, inspect, and manipulate fixed format data in existing solutions. Daffodil is also capable of the reverse by serializing or “unparsing” an XML or JSON infoset back to the original data format.

We should create a Beam IO that accepts a DFDL schema as an argument and can then produce and consume data in the specified format. I think it would be most natural for Beam users if this IO could produce Beam Rows, but an initial version that just operates with Infosets could be useful as well.
Difficulty: Major
Potential mentors:
Brian HuletteDmitry Volodin, mail: bhulette dmvolod (at) apache.org
Project Devs, mail: dev (at) beamcamel.apache.org

Implement Nexmark (benchmark suite) in Python and integrate it with Spark and Flink runners

Apache Beam [1] is a unified and portable programming model for data processing jobs (pipelines). The Beam model [2, 3, 4] has rich mechanisms to process endless streams of events.

Nexmark [5] is a benchmark for streaming jobs, basically a set of jobs (queries) to test different use cases of the execution system. Beam implemented Nexmark for Java [6, 7] and it has been succesfully used to improve the features of multiple Beam runners and discover performance regressions.

Thanks to the work on portability [8] we can now run Beam pipelines on top of open source systems like Apache Spark [9] and Apache Flink [10]. The goal of this issue/project is to implement the Nexmark queries on Python and configure them to run on our CI on top of open source systems like Apache Spark and Apache Flink. The goal is that it helps the project to track and improve the evolution of portable open source runners and our python implementation as we do for Java.

Because of the time constraints of GSoC we will adjust the goals (sub-tasks) depending on progress.

[1] https://beam.apache.org/
[2] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
[3] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
[4] https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43864.pdf
[5] https://web.archive.org/web/20100620010601/http://datalab.cs.pdx.edu/niagaraST/NEXMark/
[6] https://beam.apache.org/documentation/sdks/java/testing/nexmark/
[7] https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark
[8] https://beam.apache.org/roadmap/portability/
[9] https://spark.apache.org/
[10] https://flink.apache.org/

Difficulty: Minor
Potential mentors:
Ismaël Mejía, mail: iemejia (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

Apache Airflow

Upgrade to JUnit 5

See http://junit.org/junit5/

Note: it provides a junit-vintage module so we should be able to migrate stuffs easily

Most users should now be able to write JUnit 5 tests using the modules created in CAMEL-13342.
Concerning the migration of camel own tests to JUnit5, the last blocker is that migrating flaky tests to JUnit 5 is not handled until mavensurefire 3 has been released or until open discussions in the junit team has converged.

Difficulty: Major
Potential mentors:
Luca Burgazzoli, mail: lb (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Camel website

This is an issue to track the work on the new Camel website.

If you wish to contribute to building the new Camel website please look in the website component issues labelled with help-wanted.

Difficulty: Major
Potential mentors:
Zoran Regvart, mail: zregvart (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Create a component for Kafka-Stream


Difficulty: Minor
Potential mentors:
Andrea Cosentino, mail: acosentino (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Create a camel component for etcd v3

Difficulty: Minor
Potential mentors:
Luca Burgazzoli, mail: lb (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Beam

Implement Nexmark (benchmark suite) in Python and integrate it with Spark and Flink runners

Apache Beam [1] is a unified and portable programming model for data processing jobs (pipelines). The Beam model [2, 3, 4] has rich mechanisms to process endless streams of events.

Nexmark [5] is a benchmark for streaming jobs, basically a set of jobs (queries) to test different use cases of the execution system. Beam implemented Nexmark for Java [6, 7] and it has been succesfully used to improve the features of multiple Beam runners and discover performance regressions.

Thanks to the work on portability [8] we can now run Beam pipelines on top of open source systems like Apache Spark [9] and Apache Flink [10]. The goal of this issue/project is to implement the Nexmark queries on Python and configure them to run on our CI on top of open source systems like Apache Spark and Apache Flink. The goal is that it helps the project to track and improve the evolution of portable open source runners and our python implementation as we do for Java.

Because of the time constraints of GSoC we will adjust the goals (sub-tasks) depending on progress.

[1] https://beam.apache.org/
[2] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
[3] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
[4] https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43864.pdf
[5] https://web.archive.org/web/20100620010601/http://datalab.cs.pdx.edu/niagaraST/NEXMark/
[6] https://beam.apache.org/documentation/sdks/java/testing/nexmark/
[7] https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark
[8] https://beam.apache.org/roadmap/portability/
[9] https://spark.apache.org/
[10] https://flink.apache.org/

Difficulty: Minor
Potential mentors:
Ismaël Mejía, mail: iemejia (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

Implement an Azure blobstore filesystem for Python SDK

This is similar to BEAM-2572, but for Azure's blobstore.

One of Celery executor tests is flaky

tests/executors/test_celery_executor.py::TestCeleryExecutor::test_celery_integration_0_amqp_guest_guest_rabbitmq_5672

Log attached.

Difficulty: Major
Potential mentors:
Jarek Potiuk, mail: potiuk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

add GDriveToGcsOperator

There is GcsToGDriveOperator but there isn't the equivalent in the other direction

Difficulty: Major
Potential mentors:
lovk korm, mail: lovk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

prevent autocomplete of username in login UI

Login page of the UI has autocomplete for username field. This should be disabled for security

Difficulty: Major
Potential mentors:
t ooPablo Estrada, mail: toopt4 pabloem (at) apache.org
Project Devs, mail: dev (at) airflowbeam.apache.org

Apache Airflow

CLONE - Allow filtering by all columns in Browse Logs view

The "Browse Logs" UI currently allows filtering by "DAG ID", "Task ID", "Execution Date", and "Extra".

Image Added

For consistency and flexibility, it would be good to allow filtering by any of the available columns, specifically "Datetime", "Event", "Execution Date", and "Owner". 

Image Added

Difficulty: Minor
Potential mentors:
Ebrima Jallow, mail: maubeh1 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

UI - Show count of tasks in each dag on the main dags page

Main DAGs page in UI - would benefit from showing a new column: number of tasks for each dag id

Difficulty: Minor

list_dag_runs cli command should allow exec_date between start/end range and print start/end times

1. accept argument exec_date_from, exec_date_to to filter execution_dates returned, ie show dag runs with exec_date between 20190901 and 20190930
2. separate to that in the output print the start_date and end_date of each dagrun (ie execdate for 20190907 had start_date 2019090804:23 and end_date 2019090804:38
3. dag_id arg should be optional

Difficulty: Major
Potential mentors:
t oo, mail: toopt4 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

One of Celery executor tests is flaky

tests/executors/test_celery_executor.py::TestCeleryExecutor::test_celery_integration_0_amqp_guest_guest_rabbitmq_5672


Log attached.

Difficulty: Major
Potential mentors:
Jarek Potiuk, mail: potiuk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

add GDriveToGcsOperator

There is GcsToGDriveOperator but there isn't the equivalent in the other direction



Difficulty: Major
Potential mentors:
lovk korm, mail: lovk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

prevent autocomplete of username in login UI

Login page of the UI has autocomplete for username field. This should be disabled for security

Difficulty: Major
Potential mentors:
t oo, mail: toopt4 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Mock Cassandra in tests

Cassandra consume 1.173GiB of memory. Travis does not have very efficient machines, so we should limit system/integration tests of components that do not require much attention, e.g. they are not changed often. Cassandra is a good candidate for this. This will allow the machine power to be used for more needed work.

Mock Cassandra in tests

Cassandra consume 1.173GiB of memory. Travis does not have very efficient machines, so we should limit system/integration tests of components that do not require much attention, e.g. they are not changed often. Cassandra is a good candidate for this. This will allow the machine power to be used for more needed work.

               CONTAINER ID        NAME                                  CPU %               MEM USAGE / LIMIT     MEM %               NET
            I/O             BLOCK I/O           PIDS
            8aa37ca50f7c        ci_airflow-testing_run_1f3aeb6d1052   0.00%               5.715MiB / 3.855GiB   0.14%               1.14kB
            / 0B         2.36MB / 0B         2
            f2b3be15558f        ci_cassandra_1                        0.69%               1.173GiB / 3.855GiB   30.42%              2.39kB
            / 0B         75.3MB / 9.95MB     50
            ef1de3981ca6        ci_krb5-kdc-server_1                  0.02%               12.15MiB / 3.855GiB   0.31%               2.46kB
            / 0B         18.9MB / 184kB      4
            be808233eb91        ci_mongo_1                            0.31%               36.71MiB / 3.855GiB   0.93%               2.39kB
            / 0B         43.2MB / 19.1MB     24
            667e047be097        ci_rabbitmq_1                         0.77%               69.95MiB / 3.855GiB   1.77%               2.39kB
            / 0B         43.2MB / 508kB      92
            2453dd6e7cca        ci_postgres_1                         0.00%               7.547MiB / 3.855GiB   0.19%               1.05MB
            / 889kB      35.4MB / 145MB      6
            78050c5c61cc        ci_redis_1                            0.29%               1.695MiB / 3.855GiB   0.04%               2.46kB
            / 0B         6.94MB / 0B         4
            c117eb0a0d43        ci_mysql_1                            0.13%               452MiB / 3.855GiB     11.45%              2.21kB
            / 0B         33.9MB / 548MB      21
            131427b19282        ci_openldap_1                         0.00%               45.68MiB / 3.855GiB   1.16%               2.64kB
            / 0B         32.8MB / 16.1MB     4
            8c2549c010b1        ci_docker_1                           0.59%   0.59%               22.06MiB / 3.855GiB   0.56%               2.39kB
            / 0B         95.9MB / 291kB      30
               22.06MiB / 3.855GiB   0.56%               2.39kB
            / 0B         95.9MB / 291kB      30
            
Difficulty: Major
Potential mentors:
Kamil Bregula, mail: kamil.bregula (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Pool name - increase length > 50, make cli give error if too large

create some pool names (using cli) with 70 or 80 character length


1. UI does not allow creating > 50 length but why does cli?

click on one of the pool names listed (link is cut to 50 char name: https://domain:8080/admin/airflow/task?flt1_pool_equals=qjfdal_CRCE_INTERCONNECTION_FORECAST_TNC_EJFLSA_LP)

If click 'edit' it shows full 80chars in Description but cut 50chars in Pool

2. why limit to 50 length at all? should be increased - say 256

3. if trying to create really large length (more than models length) then cli should give error

Difficulty: Major
Potential mentors:
Kamil Bregulat oo, mail: kamil.bregula toopt4 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

...

SageMakerEndpointOperator is not idempotent

The SageMakerEndpointOperator currently taken an argument "operati on" with value "create"/"update" which determines whether to create or update a SageMaker endpoint. However this doesn't work in the following situation:

  • DAG run #1 create the endpoint (have to provide operation="create" here)
  • Following DAG runs will update the endpoint created by DAG run #1 (would have to edit DAG and set operation="update" here)

Which should be a very valid use case IMO.

The SageMakerEndpointOperator should check itself if an endpoint with name X already exists and overwrite it (configurable desired by the user).

Difficulty: Major
Potential mentors:
Bas Harenslak, mail: basph (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Check and document that docker-compose >= 1.20 is needed to run breeze

Difficulty: Major
Potential mentors:
Jarek Potiuk, mail: potiuk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Airflow UI should also display dag_concurrency reached

Currently, in the main view, the schedule column box is highlighted in red if the max. number of DAG runs is achieved. In this case no more DAG runs can be started until a DAG run completes.

I think it should also display in red when the dag_concurrency (i.e. max concurrent tasks) is achieved. In this case also, no more tasks can be started until a task completes. However there is currently nothing in the UI showing that (currently running 1.10.5).

Difficulty: Major
Potential mentors:
Bas Harenslak, mail: basph (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

...

Show lineage in visualization


Difficulty: Major
Potential mentors:
Bolke de Bruin, mail: bolke (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add additional quick start to INSTALL

Difficulty: Blocker
Potential mentors:
Bolke de Bruin, mail: bolke (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

...