You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 28 Next »

Contents

Apache APISIX

Apache APISIX is a cloud-native microservices API gateway, delivering the ultimate performance, security, open source and scalable platform for all your APIs and microservices.

add commit message checker

For the quality of every commit message, please add the Commit Message checker just like https://github.com/vuejs/vue-next/blob/master/scripts/verifyCommit.js

Difficulty: Minor

mentors: juzhiyuan@apache.org
Potential mentors:
Project Devs, mail: dev (at) apisix.apache.org

Check the API version for every request

In order to make sure the dashboard is using the correct API version, we'd better add the APISIX version in every api response.

Please

  1. Add the dashboard api version variable in the config file.
  2. Check every api response in the request.ts file, show a alert when the dashboard version is not compatible with the APISIX version.

Difficulty: Minor

mentors: juzhiyuan@apache.org
Potential mentors:
Project Devs, mail: dev (at) apisix.apache.org

Adding X-API-KEY for api request

Our API for dashboard is using API Key to auth the request, please add the API Key Header in global request handler. Just like [1] for reference. Please note, this key should be added by config file, such as .env file. I recommend fetching this key by fetchAPIKey API.

[1] b3b3065#diff-084c3d9c2786b7cd963be84e40a38725R32

Difficulty: Minor

mentors: juzhiyuan@apache.org
Potential mentors:
Project Devs, mail: dev (at) apisix.apache.org

Complete the License information for the next dashboard

We are building the next dashboard based on Ant Design Pro, which is an Open Source project for building awesome dashboard. We need your help to add the License to the following files under next branch

Difficulty: Minor

mentors: juzhiyuan@apache.org
Potential mentors:

Project Devs, mail: dev (at) apisix.apache.org

support etcd cluster in Apache APISIX

The etcd cluster is highly available, so Apache APISIX should allow multiple etcd addresses to be configured to connect to the etcd cluster.

etcd:
  host: 
      - "http://127.0.0.1:2379"   # multiple etcd address
      - "http://127.0.0.1:2380"  
  prefix: "/apisix"               # apisix configurations prefix
  timeout: 3                      # 3 seconds

BTW, the driver resty-etcd has supported this feature.

https://github.com/iresty/lua-resty-etcd/blob/master/t/v2/cluster.t#L52

Difficulty: Minor

mentors: agile6v@apache.org, wenming@apache.org, yousa@apache.org
Potential mentors:
Project Devs, mail: dev (at) apisix.apache.org

implement Apache APISIX echo plugin

APISIX currently provides a simple example plugin, but it does not provide useful functionality.

So we can provide a useful plugin to help users understand as fully as possible how to develop an APISIX plugin.

This plugin could implement the corresponding functionality in the common phases such as init, rewrite, access, balancer, header filer, body filter and log . But the specific functionality are still being considered.

Difficulty: Major

mentors: agile6v@apache.org, wenming@apache.org, yousa@apache.org
Potential mentors:
Project Devs, mail: dev (at) apisix.apache.org

feature: Support follow redirect

When a client request passes through APISIX to upstream, if upstream returns 301 or 302 and then APISIX returns directly to the client by default. The client receives 301 or 302 response and then initiates the request again based on the address specified by Location. Sometimes the client wants APISIX to help it do this, so APISIX can provide this capability to support more scenarios.

Difficulty: Major

mentors: agile6v@apache.org, wenming@apache.org, yousa@apache.org
Potential mentors:
Project Devs, mail: dev (at) apisix.apache.org

Apache IoTDB

Apache IoTDB integration with more powerful aggregation index

IoTDB is a highly efficient time series database, which supports high speed query process, including aggregation query.

Currently, IoTDB pre-calculates the aggregation info, or called the summary info, (sum, count, max_time, min_time, max_value, min_value) for each page and each Chunk. The info is helpful for aggregation operations and some query filters. For example, if the query filter is value >10 and the max value of a page is 9, we can skip the page. For another example, if the query is select max(value) and the max value of 3 chunks are 5, 10, 20, then the max(value) is 20. 

However, there are two drawbacks:

1. The summary info actually reduces the data that needs to be scanned as 1/k (suppose each page has k data points). However, the time complexity is still O(N). If we store a long historical data, e.g., storing 2 years data with 500KHz, then the aggregation operation may be still time-consuming. So, a tree-based index to reduce the time complexity from O(N) to O(logN) is a good choice. Some basic ideas have been published in [1], while it can just handle data with fix frequency. So, improving it and implementing it into IoTDB is a good choice.

2. The summary info is helpless for evaluating the query like where value >8 if the max value = 10. If we can enrich the summary info, e.g., storing the data histogram, we can use the histogram to evaluate how many points we can return. 

This proposal is mainly for adding an index for speeding up the aggregation query. Besides, if we can let the summary info be more useful, it could be better.

Notice that the premise is that the insertion speed should not be slow down too much!

You should know:
• IoTDB query process
• TsFile structure and organization
• Basic index knowledge
• Java 

difficulty: Major
mentors:
hxd@apache.org

Reference:

[1] https://www.sciencedirect.com/science/article/pii/S0306437918305489
 
 
 

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB website supports static documents and search engine

Apache IoTDB currently uses VUE to develop the website (iotdb.apache.org), and show the markdown documents from GitHub to the website using JS.

However, there are two drawbacks now:

  1. if we display documents from GitHub to the website using JS, then Google crawler will never index the content of the documents.
  2. when users read the documents on the website, they may do not know where the content is. For example, someone wants to find the syntax of 'show timeseries', but  he or she may do not know whether it is in chapter 5-1 or 5-4. So, a search engine embedded in the website is a good choice. 

You should learn:

  • VUE
  • Other Website developing technology.

Mentors:

hxd@apache.org

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB integration with Prometheus

IoTDB is a highly efficient time series database.

Prometheus is a monitoring and alerting toolkit, which supports collecting data from other systems, servers, and IoT devices, saving data into a DB, visualizing data and provides some query APIs.


Prometheus allows users to use their database rather than just Prometheus DB for storing time series databases. 

This proposal is for integrating IoTDB with Prometheus.


You should know:

  • How to use Prometheus
  • How to use IoTDB
  • Java and Go language

difficulty: Major

mentors:

hxd@apache.org

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB integration with MiNiFI/NiFi

IoTDB is a database for storing time series data.

MiNiFI is a data flow engine to transfer data from A to B, e.g., from PLC4X to IoTDB.

This proposal is for integration IoTDB with MiNiFi.

  • let MiNiFi/NiFi to support writing data into IoTDB.


Difficulty:  major

mentors:

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB Database Connection Pool and integration with some web framework

IoTDB is a time series database.

When using a database in an application, the database connection pool is much helpful for  high performance and saving resources.

Besides, when developing a website using Spring or some other web framework, now many developers do not control the database connection manually. Instead, developers just need to tell what database they will use and the web framework can handle everything well.

This proposal is for

  • letting IoTDB supports some database connection pools like Apache Commons DBCP, C3P0.
  • integration IoTDB with one web framework (e.g., Spring)


You should know:

  • IoTDB
  • At least one DB connection pool
  • Know Spring or some other web framework

mentors:

hxd@apache.org

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB trigger module for streaming cumputing

IoTDB is a time-series data management system and the data usually comes in a streaming way.

In the IoT area, when a data point comes, a trigger can be called because of the following scenario:

  • (single data point calculation) the data point is an outlier point, or the data value reaches a warning threshold. IoTDB needs to publish the data point to those who subscribed the event.
  • (multiply time series data point calculation) a device sends several metrics data to IoTDB, e.g., vehicle d1 sends average speed and running time to IoTDB. Then users may want to get the mileage of the vehicle (speed x time). IoTDB needs to calculate the result and save it to another time series.
  • (Time window calculation) a device reports its temperature every second. Though the temperature is not too high, if it keeps increasing in 5 seconds, IoTDB needs to report the event to those who subscribe that.


As there are many streaming computing projects already, we can integrate one of them into IoTDB.

  • If IoTDB runs on Edge, we can integrate Apache StreamPipes or Apache Edgent.
  • If IOTDB runs on a Server, the above also works  and Apache Flink is also a good choice.

The process is:

  • User registers a trigger into IoTDB.
  • When a data comes, IoTDB save it and check whether there are triggers on it
  • If so, call a streaming computing framework to do something;


You may need to know:

  • At least one streaming computing project.
  • SQL parser or some other DSL parser tool.

You have to modify the source codes of IoTDB server engine module.

Difficulty: A little hard

mentors:

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

A complete Apache IoTDB JDBC driver and integration with JDBC driver based tools (DBeaver and Apache Zeppelin)

Apache IoTDB is a database for time series data management that written in Java. It provides a SQL-like query language and a JDBC driver for users. Current IoTDB JDBC driver has implemented some important interfaces of Statement, Connection, ResultSet, etc., which works well for most users' requirements.

However, we know there are many tools supporting integrating with a database if the database has a standard JDBC driver, e.g., DBeaver, Apache Zeppelin, Tableau, etc..


This proposal is for implementing a standard JDBC driver for IoTDB, and using the driver to integrate with DBeaver and Apache Zeppelin.


Because Apache Zeppelin supports customized Interpreter, we can also implement an IoTDB interpreter for integration with Zeppelin.


You should know:

  • how JDBC works.
  • learn to use IoTDB session API.
  • understand Zeppelin Interpreter interface. 


Difficulty: Major 

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache Fineract

Upgrade Fineract 1.x to Java 11 and Upgrade Dependencies to latest versions

Upgrade Fineract 1.x from Java 8 to Java 11 so we can start using the latest LTS Java version and features.

This will also require you to upgrade other Fineract 1.x dependencies from their current to latest possible versions.

Difficulty: Major
Potential mentors:
Awasum Yannick, mail: awasum (at) apache.org
Project Devs, mail: dev (at) fineract.apache.org

RocketMQ

RocketMQ Connect Hive

Content

The Hive sink connector allows you to export data from Apache RocketMQ topics to HDFS files in a variety of formats and integrates with Hive to make data immediately available for querying with HiveQL. The connector periodically polls data from RocketMQ and writes them to HDFS.

The data from each RocketMQ topic is partitioned by the provided partitioner and divided into chunks. Each chunk of data is represented as an HDFS file with topic, queueName, start and end offsets of this data chunk in the filename.

So, in this project, you need to implement a Hive sink connector based on OpenMessaging connect API, and run it on RocketMQ connect runtime.

You should learn before applying for this topic
Hive/Apache RocketMQ/Apache RocketMQ Connect/ OpenMessaging Connect API

Mentor

chenguangsheng@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

RocketMQ Connect Hbase

Content

The Hbase sink connector allows moving data from Apache RocketMQ to Hbase. It writes data from a topic in RocketMQ to a table in the specified HBase instance. Auto-creation of tables and the auto-creation of column families are also supported.

So, in this project, you need to implement an Hbase sink connector based on OpenMessaging connect API, and will execute on RocketMQ connect runtime.

You should learn before applying for this topic
Hbase/Apache RocketMQ/Apache RocketMQ Connect/ OpenMessaging Connect API

Mentor

chenguangsheng@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

RocketMQ Connect Cassandra

Content

The Cassandra sink connector allows writing data to Apache Cassandra. In this project, you need to implement a Cassandra sink connector based on OpenMessaging connect API, and run it on RocketMQ connect runtimeh3. You should learn before applying for this topic
Cassandra/[Apache RocketMQ

https://rocketmq.apache.org/]/[Apache RocketMQ Connect
https://github.com/apache/rocketmq-externals/tree/master/rocketmq-connect]/ OpenMessaging Connect APIh3. Mentor
duhengforever@apache.orgvongosling@apache.org

 
 

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

RocketMQ Connect InfluxDB

Content

The InfluxDB sink connector allows moving data from Apache RocketMQ to InfluxDB. It writes data from a topic in Apache RocketMQ to InfluxDB. While The InfluxDB source connector is used to export data from InfluxDB Server to RocketMQ.

In this project, you need to implement an InfluxDB sink connector(source connector is optional) based on OpenMessaging connect API.

You should learn before applying for this topic

InfluxDB/[Apache RocketMQ|https://rocketmq.apache.org/]/[Apache RocketMQ Connect|https://github.com/apache/rocketmq-externals/tree/master/rocketmq-connect]/ OpenMessaging Connect API

Mentor

duhengforever@apache.orgwlliqipeng@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

The Operator for RocketMQ Exporter

he exporter exposes the endpoint of monitoring data collection to Prometheus server in the form of HTTP service. Prometheus server can obtain the monitoring data to be collected by accessing the endpoint endpoint provided by the exporter. RocketMQ exporter is such an exporter. It first collects data from rocketmq cluster, and then normalizes the collected data to meet the requirements of Prometheus system with the help of the third-party client library provided by Prometheus. Prometheus regularly pulls data from the exporter. This topic needs to implement an operator of rocketmq exporter to facilitate the deployment of the exporter in kubenetes platform.

You should learn before applying for this topic

RocketMQ-Exporter Repo
RocketMQ-Exporter Overview
Kubetenes Operator
RocketMQ-Operator

Mentor

wlliqipeng@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

RocketMQ Connect IoTDB

Content

The IoTDB sink connector allows moving data from Apache RocketMQ to IoTDB. It writes data from a topic in Apache RocketMQ to IoTDB.

IoTDB (Internet of Things Database) is a data management system for time series data, which can provide users specific services, such as, data collection, storage and analysis. Due to its lightweight structure, high performance and usable features together with its seamless integration with the Hadoop and Spark ecology, IoTDB meets the requirements of massive dataset storage, high throughput data input and complex data analysis in the industrial IoTDB field.

In this project, there are some update operations for historical data, so it is necessary to ensure the sequential transmission and consumption of data via RocketMQ. If there is no update operation in use, then there is no need to guarantee the order of data. IoTDB will process these data which may be disorderly.

So, in this project, you need to implement an IoTDB sink connector based on OpenMessaging connect API, and run it on RocketMQ connect runtime.

You should learn before applying for this topic

IoTDB/[Apache RocketMQ|https://rocketmq.apache.org/]/[Apache RocketMQ Connect|https://github.com/apache/rocketmq-externals/tree/master/rocketmq-connect]/ OpenMessaging Connect API

Mentor

hxd@apache.org, duhengforever@apache.orgwlliqipeng@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

RocketMQ Connect Elasticsearch

Content

The Elasticsearch sink connector allows moving data from Apache RocketMQ to Elasticsearch 6.x, and 7.x. It writes data from a topic in Apache RocketMQ to an index in Elasticsearch and all data for a topic have the same type.

Elasticsearch is often used for text queries, analytics and as an key-value store (use cases). The connector covers both the analytics and key-value store use cases.

For the analytics use case, each message is in RocketMQ is treated as an event and the connector uses topic+message queue+offset as a unique identifier for events, which then converted to unique documents in Elasticsearch. For the key-value store use case, it supports using keys from RocketMQ messages as document ids in Elasticsearch and provides configurations ensuring that updates to a key are written to Elasticsearch in order.

So, in this project, you need to implement a sink connector based on OpenMessaging connect API, and will executed on RocketMQ connect runtime.

You should learn before applying for this topic

Elasticsearch/[Apache RocketMQ|https://rocketmq.apache.org/]/[Apache RocketMQ Connect|https://github.com/apache/rocketmq-externals/tree/master/rocketmq-connect]/ OpenMessaging Connect API

Mentor

duhengforever@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Apache RocketMQ CLI Admin Tool Developed by Golang

Apache rocketmq provides a cli admin tool developed by Java to querying, managing and diagnosing various problems. At the same time, it also provides a set of API interface, which can be called by Java application program to create, delete, query, message query and other functions. This topic requires the realization of CLI management tool and a set of API interface developed by golang language, through which go application can realize the creation, query and other operations of topic.

You should learn before applying for this topic

Apache RocketMQ
Apache RocketMQ Go Client

Mentor

wlliqipeng@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Apache RocketMQ Schema Registry

Content

In order to help RocketMQ improve its event management capabilities, and at the same time better decouple the producer and receiver, keep the event forward compatible, so we need a service for event metadata management is called a schema registry.

Schema registry will provide a GraphQL interface for developers to define standard schemas for their events, share them across the organization and safely evolve them in a way that is backward compatible and future proof.

You should learn before applying for this topic

Apache RocketMQ/Apache RocketMQ SDK/

Mentor

duhengforever@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

CloudEvents support for RocketMQ

Context

Events are everywhere. However, event producers tend to describe events differently.

The lack of a common way of describing events means developers must constantly re-learn how to consume events. This also limits the potential for libraries, tooling and infrastructure to aide the delivery of event data across environments, like SDKs, event routers or tracing systems. The portability and productivity we can achieve from event data is hindered overall.

CloudEvents is a specification for describing event data in common formats to provide interoperability across services, platforms and systems.
RocketMQ as an event streaming platform, also hopes to improve the interoperability of different event platforms by being compatible with the CloudEvents standard and supporting CloudEvents SDK. In this topic, you need to improve the binding spec. and implement the RocketMQ CloudEvents SDK(Java、Golang or others).

You should learn before applying for this topic

Apache RocketMQ/Apache RocketMQ SDK/CloudEvents

Mentor

duhengforever@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Apache RocketMQ Channel for Knative

Context

Knative is a kubernetes based platform for building, deploying and managing modern serverless applications. Knative to provide a set of middleware components that are essential to building modern, source-centric, and container-based applications that can run anywhere: on-premises, in the cloud, or even in a third-party data centre. Knative consists of the Serving and Eventing components. Eventing is a system that is designed to address a common need for cloud-native development and provides composable primitives to enable late-binding event sources and event consumers. Eventing also defines an event forwarding and persistence layer, called a Channel. Each channel is a separate Kubernetes Custom Resource. This topic requires you to implement rocketmqchannel based on Apache RocketMQ.

You should learn before applying for this topic

How Knative works
RocketMQSource for Knative
Apache RocketMQ Operator

Mentor

wlliqipeng@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Apache RocketMQ Ingestion for Druid

Context

Druid is a real-time analytics database designed for fast slice-and-dice analytics ("OLAP" queries) on large data sets. In this topic, you should develop the RocketMQ indexing service enables the configuration of supervisors on the Overlord, which facilitate ingestion from RocketMQ by managing the creation and lifetime of RocketMQ indexing tasks. These indexing tasks read events using RocketMQ's own partition and offset mechanism. The supervisor oversees the state of the indexing tasks to coordinate handoffs, manage failures, and ensure that the scalability and replication requirements are maintained.

You should learn before applying for this topic

Apache Druid Data Ingestion

Mentor

vongosling@apache.orgduhengforever@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Apache RocketMQ Connect Hudi

Context

Hudi could ingest and manage the storage of large analytical datasets over DFS (hdfs or cloud stores). It can act as either a source or sink for streaming processing platform such as Apache RocketMQ. it also can be used as a state store inside a processing DAG (similar to how rocksDB is used by Flink). This is an item on the roadmap of the Apache RocketMQ. This time, you should implement a fully hudi source and sink based on RocketMQ connect framework, which is a most important implementation of the OpenConnect.

You should learn before applying for this topic

Apache RocketMQ Connect Framework
Apache Hudi
.

Mentor

vongosling@apache.orgduhengforever@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Context

There are many ways that Apache Flink and Apache RocketMQ can integrate to provide elastic data processing at a large scale. RocketMQ can be used as a streaming source and streaming sink in Flink DataStream applications, which is the main implementation and popular usage in RocketMQ community. Developers can ingest data from RocketMQ into a Flink job that makes computations and processes real-time data, to then send the data back to a RocketMQ topic as a streaming sink. More details you could see from https://github.com/apache/rocketmq-externals/tree/master/rocketmq-flink.

With more and more DW or OLAP engineers using RocketMQ for their data processing work, another potential integration needs arose. Developers can take advantage of as both a streaming source and a streaming table sink for Flink SQL or Table API queries. Also, Flink 1.9.0 makes the Table API a first-class citizen. It's time to support SQL in RocketMQ. This is the topic for Apache RocketMQ connect Flink.

You should learn before applying for this topic

Apache RocketMQ Flink Connector
Apache Flink Table API

Extension

For some expertise students in the streaming field, you could continue to implements and provides an exactly-once streaming source and at-least-once(or exactly-once)streaming sink, like the issue #500 said.

Mentor

nicholasjiang@apache.org ,   duhengforever@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Apache RocketMQ Scaler for KEDA

Context

KEDA allows for fine-grained autoscaling (including to/from zero) for event-driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition. KEDA has a number of “scalers” that can both detect if a deployment should be activated or deactivated, and feed custom metrics for a specific event source. In this topic, you need to implement the RocketMQ scalers.

You should learn before applying for this topic

Helm/Apache RocketMQ Operator/Apache RocketMQ Docker Image
Apache RocketMQ multi-replica mechanism(based on DLedger)
How KEDA works

Mentor

wlliqipeng@apache.orgvongosling@apache.org


Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Camel

camel-pulsar : Support asynchronous processing

Currently, Pulsar messages are processed synchronously. Add support for asynchronous processing using the asynchronous routing engine.


Difficulty: Major
Potential mentors:
Masa, mail: Horiyama (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

camel component options - Favour annotation based options

We should favour options on component classes to be annotation based, eg with @Metadata so we mark up only the options that are options. As other delegates and getter/setters may get mixed up.

Then in the future we will drop support and only require marked up options, just like endpoints where you must use @UriParam etc.

At first we can make our tool log a WARN and then we can see how many of our own components suffer from this.

Difficulty: Major
Potential mentors:
Claus Ibsen, mail: davsclaus (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Introduce a SPI to automatic bind data format to transports

Some data formats such as the future CloudEvent one (https://issues.apache.org/jira/browse/CAMEL-13335) have specifications that describe how to bind them to specific transports (https://github.com/cloudevents/spec) so we should introduce a SPI to make this binding automatic so in a route like:

            from("undertow://http://0.0.0.0:8080")
                .unmarshal().cloudEvents()
            .to("kafka:my-topic");
            

the exchange gets automatically translated to a Kafka message according to the CloudEvent binding specs for Kafka.

Difficulty: Minor
Potential mentors:
Luca Burgazzoli, mail: lb (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

camel-microprofile-opentracing

A camel module for this spec
https://github.com/eclipse/microprofile-opentracing

Its likely using the existing camel-opentracing and then implement the spec API and use smallrye implementation
https://github.com/smallrye/smallrye-opentracing

Difficulty: Major
Potential mentors:
Claus Ibsen, mail: davsclaus (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Support for OpenTelemetry

OpenTelemetry is becoming more and more relevant and would be nice to support it in camel

Difficulty: Minor
Potential mentors:
Luca Burgazzoli, mail: lb (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

camel-minio - Component to store/load files from blob store

min.io is a s3 like blob store. So users have more freedom than being locked into aws

We can create a camel-minio component for it
https://github.com/minio/minio-java

Difficulty: Major
Potential mentors:
Claus Ibsen, mail: davsclaus (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Camel grpc component doesn't transfer the Message headers

Headers that are added to the Message in the camel Exchange before making a call to the camel-grpc component are not received at the grpc consumer. The expectation is that these headers would be added to the grpcStub before sending over the wire (like other components like http4 etc).

Our team has come up with a workaround for this but it is extremely cumbersome. We had to extend the GrpcProducer to introduce a custom GrpcExchangeForwarder that would copy header from exchange to the stub before invoking the sync/async method.

At the consumer side we had to extend the GrpcConsumer to use a custom ServerInterceptor to capture the grpc headers and custom MethodHandler to transfer the grpc headers to the Camel exchange headers.

Difficulty: Major
Potential mentors:
Vishal Vijayan, mail: vijayanv (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

camel-snmp - Support for multiple security mechanisms in SNMP v3

Allow to add multiple users for SNMP v3 i.e. the SnmpTrapConsumer should support multiple combinations of authentication and privacy protocols and different passphrases. We cannot have a route per security mechanism.


Consider the below scenario.

I have multiple SNMP devices which have multiple authentication protocols and privacy protocols with different passphrases. Moreover, they can send any version of SNMP traps from v1 to v3. I must be able to configure those in a properties file or a DSL (i.e. the snmp version, the USM users etc).

Example:


            snmp.getUSM().addUser(
            new OctetString("MD5DES"),
            new UsmUser(new OctetString("MD5DES"),
            AuthMD5.ID,
            new OctetString("UserName"), PrivDES.ID,
            new OctetString("PasswordUser")));
            snmp.getUSM().addUser(
            new OctetString("MD5DES"),
            new UsmUser(new OctetString("MD5DES"),
            null, null, null,
            null));
             
            

.. other users with different auth, priv mechanisms (i.e. different security mechanisms). I must be able to receive traps from all of them.

Difficulty: Minor
Potential mentors:
Gowtham Gutha, mail: gowthamgutha (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

java-dsl - Add support for method references bean::methodName

Hi

This is not related only to spring integration.
I like to be able to use a spring service annotated class or bean directly from route but without using the method name as string, i.e. .bean(instance, "<method name") and instead use a method reference: .bean(instance::method)

But why?: 
1. not being able to navigate quickly(open) that method from the IDE. Need to do some intermediary steps to reach that method.
2. using of reflection internally by Camel to call that method.
3. not being able to rename the method without breaking the route.
4. not being able to see quickly (Alt+F7) who calls a methods in the IDE.
5. using strings to reference a method when we have method references seems not right.

As a workaround I had to add a helper class to simulate passing of method references and then internally to translate to method.

In case it helps explaining I am attaching the helper Bean.java class (you can use it for free or to do better).

You can use the class in any route like this:

from (X)
.bean(call(cancelSubscriptionService::buildSalesforceCase))
.to(Y)
.routeId(Z);

As you see I am forced to use the intermediary helper 'call' in order to translate to an Expression.
I would like to not have to use my helper and have the support built directly into Camel if possible. Let me know if there is a better solution to my problem.

Thanks

Difficulty: Major
Potential mentors:
Cristian Donoiu, mail: doncristiano (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

camel-restdsl-swagger-plugin - create camel routes for generated rest DSL

camel-restdsl-swagger-plugin can generate CamelRoutes.java from a Swagger / OpenAPI spec, which includes the REST DSL with to("direct:restN") stubs. Would be nice if it also autogenerated the equivalent from("direct:restN").log() placeholders to help jump start coding.

Difficulty: Major
Potential mentors:
Scott Cranton, mail: scranton (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Expose OData4 based service as consumer endpoint

Right now, only polling consumer is available for olingo4 component. It's better to have a real listening consumer for this.

The method may have name like 'listen' to be able to create a listening consumer.

Difficulty: Major
Potential mentors:
Dmitry Volodin, mail: dmvolod (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Ability to load an SSLContextParameter with a Keystore containing multiple keys (aliases)

Hello,
I wish I could initialize a single SSLContextParameters at camel startup containing my truststore.jks (> 1 alias) and my keystore.jks (> 1 alias) in order to call it (refer to) in Routes (FTPs, HTTPs) without have to redefine a new SSLContextParameter for each EndPoint.

<camel:sslContextParameters id="sslIContextParameters">
<camel:trustManagers>
<camel:keyStore password="${truststore.jks.file.password}"
resource="${truststore.jks.file.location}" />
</camel:trustManagers>
<camel:keyManagers >
<camel:keyStore password="${keystore.jks.file.password}"
resource="${keystore.jks.file.location}" />
</camel:keyManagers>
</camel:sslContextParameters>

When my Keystore contains more than 1 alias, I have the following error when creating the Route at startup : 

Caused by: org.apache.camel.ResolveEndpointFailedException: Failed to resolve endpoint: https4://<host>:<port>/<address>?authPassword=RAW(password)&authUsername=login&authenticationPreemptive=true&bridgeEndpoint=true&sslContextParameters=sslContextParameters&throwExceptionOnFailure=true due to: Cannot recover key

due to

Caused by: java.security.UnrecoverableKeyException: Cannot recover key


When my keystore contains only one key, it works very well.

<camel:sslContextParameters id="sslIContextParameters">
<camel:trustManagers>
<camel:keyStore password="${truststore.jks.file.password}"
resource="${truststore.jks.file.location}" />
</camel:trustManagers>
<camel:keyManagers keyPassword="keyPassword">
<camel:keyStore password="${keystore.jks.file.password}"
resource="${keystore.jks.file.location}" />
</camel:keyManagers>
</camel:sslContextParameters>


So I would like to be able to call my SSLContextParameter for different EndPoint by specifying (if necessary) the alias of the Keystore needed (by specifying the alias and / or password of the key)


Objectif in my project :

  • 1 TrustStore.jks 
  • 1 Keystore.jsk
  • 1 unique SSLContextParameter
  • > 200 camelRoutes FTPs/HTTPs (ssl one way ou two way)


Thank a lot



Difficulty: Major
Potential mentors:
Florian B., mail: Boosy (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Add tool to generate swagger doc at build time

We do not have at this moment a tool that can generate the swagger doc at build time. However I think it would be possible to develop such tool. We have existing tooling that parses the Java or XML source code (Camel routes) which we use for validating endpoints, or do route-coverage reports etc.
 
https://github.com/apache/camel/blob/master/tooling/maven/camel-maven-plugin/src/main/docs/camel-maven-plugin.adoc
 
We could then make that tool parse the rest-dsl and build up that model behind the scene and feed that into the swagger-java library for it to spit out the generated swagger doc.
 
We could make it as a goal on the existing camel-maven-plugin, or build a new maven plugin: camel-maven-swagger or something. Then people could use it during build time to generate the swagger doc etc. 
 
We maybe should allow to override/configure things as well from the tooling, so you can maybe set/mask hostname or set descriptions or other things that may not all be detailed in the rest-dsl.

Difficulty: Major
Potential mentors:
Claus Ibsen, mail: davsclaus (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Add call options to the camel-grpc component

Add advanced call options related to the one operation and not overriding channel option:

  • deadline
  • compression
  • etc.
Difficulty: Major
Potential mentors:
Dmitry Volodin, mail: dmvolod (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Upgrade to JUnit 5

See http://junit.org/junit5/

Note: it provides a junit-vintage module so we should be able to migrate stuffs easily

Most users should now be able to write JUnit 5 tests using the modules created in CAMEL-13342.
Concerning the migration of camel own tests to JUnit5, the last blocker is that migrating flaky tests to JUnit 5 is not handled until mavensurefire 3 has been released or until open discussions in the junit team has converged.

Difficulty: Major
Potential mentors:
Luca Burgazzoli, mail: lb (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Camel website

This is an issue to track the work on the new Camel website.

If you wish to contribute to building the new Camel website please look in the website component issues labelled with help-wanted.

Difficulty: Major
Potential mentors:
Zoran Regvart, mail: zregvart (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Create a component for Kafka-Stream


Difficulty: Minor
Potential mentors:
Andrea Cosentino, mail: acosentino (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Create a camel component for etcd v3

Difficulty: Minor
Potential mentors:
Luca Burgazzoli, mail: lb (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Beam

BeamSQL aggregation analytics functionality

Mentor email: ruwang@google.com. Feel free to send emails for your questions.

Project Information
---------------------
BeamSQL has a long list of of aggregation/aggregation analytics functionalities to support.

To begin with, you will need to support this syntax:

            analytic_function_name ( [ argument_list ] )
            OVER (
            [ PARTITION BY partition_expression_list ]
            [ ORDER BY expression [{ ASC
            | DESC }] [, ...] ]
            [ window_frame_clause ]
            )
            

As there is a long list of analytics functions, a good start point is support rank() first.

This will requires touch core components of BeamSQL:
1. SQL parser to support the syntax above.
2. SQL core to implement physical relational operator.
3. Distributed algorithms to implement a list of functions in a distributed manner.
4. Build benchmarks to measure performance of your implementation.

To understand what SQL analytics functionality is, you could check this great explanation doc: https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts.

To know about Beam's programming model, check: https://beam.apache.org/documentation/programming-guide/#overview

Difficulty: Major
Potential mentors:
Rui Wang, mail: amaliujia (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

Implement Nexmark (benchmark suite) in Python and integrate it with Spark and Flink runners

Apache Beam [1] is a unified and portable programming model for data processing jobs (pipelines). The Beam model [2, 3, 4] has rich mechanisms to process endless streams of events.

Nexmark [5] is a benchmark for streaming jobs, basically a set of jobs (queries) to test different use cases of the execution system. Beam implemented Nexmark for Java [6, 7] and it has been succesfully used to improve the features of multiple Beam runners and discover performance regressions.

Thanks to the work on portability [8] we can now run Beam pipelines on top of open source systems like Apache Spark [9] and Apache Flink [10]. The goal of this issue/project is to implement the Nexmark queries on Python and configure them to run on our CI on top of open source systems like Apache Spark and Apache Flink. The goal is that it helps the project to track and improve the evolution of portable open source runners and our python implementation as we do for Java.

Because of the time constraints of GSoC we will adjust the goals (sub-tasks) depending on progress.

[1] https://beam.apache.org/
[2] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
[3] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
[4] https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43864.pdf
[5] https://web.archive.org/web/20100620010601/http://datalab.cs.pdx.edu/niagaraST/NEXMark/
[6] https://beam.apache.org/documentation/sdks/java/testing/nexmark/
[7] https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark
[8] https://beam.apache.org/roadmap/portability/
[9] https://spark.apache.org/
[10] https://flink.apache.org/

Difficulty: Minor
Potential mentors:
Ismaël Mejía, mail: iemejia (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

Implement an Azure blobstore filesystem for Python SDK

This is similar to BEAM-2572, but for Azure's blobstore.

Difficulty: Major
Potential mentors:
Pablo Estrada, mail: pabloem (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

Apache Airflow

UI - Show count of tasks in each dag on the main dags page

Main DAGs page in UI - would benefit from showing a new column: number of tasks for each dag id

Difficulty: Minor
Potential mentors:
t oo, mail: toopt4 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

One of Celery executor tests is flaky

tests/executors/test_celery_executor.py::TestCeleryExecutor::test_celery_integration_0_amqp_guest_guest_rabbitmq_5672


Log attached.

Difficulty: Major
Potential mentors:
Jarek Potiuk, mail: potiuk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

add GDriveToGcsOperator

There is GcsToGDriveOperator but there isn't the equivalent in the other direction



Difficulty: Major
Potential mentors:
lovk korm, mail: lovk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

prevent autocomplete of username in login UI

Login page of the UI has autocomplete for username field. This should be disabled for security

Difficulty: Major
Potential mentors:
t oo, mail: toopt4 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Mock Cassandra in tests

Cassandra consume 1.173GiB of memory. Travis does not have very efficient machines, so we should limit system/integration tests of components that do not require much attention, e.g. they are not changed often. Cassandra is a good candidate for this. This will allow the machine power to be used for more needed work.

            CONTAINER ID        NAME                                  CPU %               MEM USAGE / LIMIT     MEM %               NET
            I/O             BLOCK I/O           PIDS
            8aa37ca50f7c        ci_airflow-testing_run_1f3aeb6d1052   0.00%               5.715MiB / 3.855GiB   0.14%               1.14kB
            / 0B         2.36MB / 0B         2
            f2b3be15558f        ci_cassandra_1                        0.69%               1.173GiB / 3.855GiB   30.42%              2.39kB
            / 0B         75.3MB / 9.95MB     50
            ef1de3981ca6        ci_krb5-kdc-server_1                  0.02%               12.15MiB / 3.855GiB   0.31%               2.46kB
            / 0B         18.9MB / 184kB      4
            be808233eb91        ci_mongo_1                            0.31%               36.71MiB / 3.855GiB   0.93%               2.39kB
            / 0B         43.2MB / 19.1MB     24
            667e047be097        ci_rabbitmq_1                         0.77%               69.95MiB / 3.855GiB   1.77%               2.39kB
            / 0B         43.2MB / 508kB      92
            2453dd6e7cca        ci_postgres_1                         0.00%               7.547MiB / 3.855GiB   0.19%               1.05MB
            / 889kB      35.4MB / 145MB      6
            78050c5c61cc        ci_redis_1                            0.29%               1.695MiB / 3.855GiB   0.04%               2.46kB
            / 0B         6.94MB / 0B         4
            c117eb0a0d43        ci_mysql_1                            0.13%               452MiB / 3.855GiB     11.45%              2.21kB
            / 0B         33.9MB / 548MB      21
            131427b19282        ci_openldap_1                         0.00%               45.68MiB / 3.855GiB   1.16%               2.64kB
            / 0B         32.8MB / 16.1MB     4
            8c2549c010b1        ci_docker_1                           0.59%               22.06MiB / 3.855GiB   0.56%               2.39kB
            / 0B         95.9MB / 291kB      30
            
Difficulty: Major
Potential mentors:
Kamil Bregula, mail: kamil.bregula (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Pool name - increase length > 50, make cli give error if too large

create some pool names (using cli) with 70 or 80 character length


1. UI does not allow creating > 50 length but why does cli?

click on one of the pool names listed (link is cut to 50 char name: https://domain:8080/admin/airflow/task?flt1_pool_equals=qjfdal_CRCE_INTERCONNECTION_FORECAST_TNC_EJFLSA_LP)

If click 'edit' it shows full 80chars in Description but cut 50chars in Pool

2. why limit to 50 length at all? should be increased - say 256

3. if trying to create really large length (more than models length) then cli should give error

Difficulty: Major
Potential mentors:
t oo, mail: toopt4 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add SalesForce connection to UI

Airflow has SalesForceHook but it doesn't have a distinct connection.

In order to create a Connection one must expose it's secret token as text :

https://stackoverflow.com/questions/53510980/salesforce-connection-using-apache-airflow-ui

Also it's not very intuitive that the Conn Type should remain blank.

It would be easier and also user friendly if there will be salesforce connection in the UI which has a security_token field that is encrypted.

Difficulty: Major
Potential mentors:
Elad, mail: eladk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

dag_processor_manager/webserver/scheduler logs should be created under date folder

dag level logs are written under separate date folders. This is great because the old dates are not 'modified/accessed' so they can be easily purged by utilities like tmpwatch

This JIRA is about making other logs (such as dag_processor_manager/webserver/scheduler.etc) go under separate date folders to allow easy purging. the log from redirecting 'airflow scheduler' to stdout grows over 100mb a day in my env

Difficulty: Major
Potential mentors:
t oo, mail: toopt4 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Request for OktopostToGoogleStorageOperator

Difficulty: Major
Potential mentors:
HaloKu, mail: HaloKu (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

clear cli command needs a 'conf' option

key-value pairs of conf can be passed into trigger_dag command
ie
--conf '

{"ric":"amzn"}

'

clear command needs this feature too

ie in case exec_date is important and there was a failure halfway in the 1st dagrun due to bad conf being sent on trigger_dag command and want to run the same execdate but with new conf on 2nd dagrun

alternative solution would be a new delete_dag_run cli command so never need to 'clear' but can do a 2nd DagRun for same exec date

Difficulty: Major
Potential mentors:
t oo, mail: toopt4 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Improve multiline output in admin gui

Multiline attributes, rendered templates, or Xcom variables are not well supported in the admin GUI at present. Any values are treated as native HTML text() blocks and as such all formatting is lost. When passing structured data such as YAML in these variables, it makes a real mess of them.

Ideally, these values should keep their line-breaks and indentation.

This should only require having these code blocks wrapped in a <pre> block or setting `white-space: pre` on the class for the block.

Difficulty: Major
Potential mentors:
Paul Rhodes, mail: withnale (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Snowflake Connector cannot run more than one sql from a sql file

I am getting an error when passing in a SQL file with multiple SQL statements to snowflake operator

            snowflake.connector.errors.ProgrammingError: 000006 (0A000): 01908236-01a3-b2c4-0000-f36100052686: Multiple SQL statements
            in a single API call are not supported; use one API call per statement instead.
            

It only fails if you pass a file with multiple statements. A file with just one statement or list of statements to the operator works fine.

After looking at the current snowflake operator implementation it seems like a list of SQL statements work because it executes one statement at a time. Whereas multiple statements in a SQL file fails because all of them are read as one continuous string.


How can we fix this:

There is an API call in Snowflake python connector that supports multiple SQL statements.

https://docs.snowflake.net/manuals/user-guide/python-connector-api.html#execute_string

This can be fixed by overriding the run function in Snowflake Hook to support multiple sql statements in a file.

Difficulty: Major
Potential mentors:
Saad, mail: saadk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

SageMakerEndpointOperator is not idempotent

The SageMakerEndpointOperator currently taken an argument "operati on" with value "create"/"update" which determines whether to create or update a SageMaker endpoint. However this doesn't work in the following situation:

  • DAG run #1 create the endpoint (have to provide operation="create" here)
  • Following DAG runs will update the endpoint created by DAG run #1 (would have to edit DAG and set operation="update" here)

Which should be a very valid use case IMO.

The SageMakerEndpointOperator should check itself if an endpoint with name X already exists and overwrite it (configurable desired by the user).

Difficulty: Major
Potential mentors:
Bas Harenslak, mail: basph (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Check and document that docker-compose >= 1.20 is needed to run breeze


Difficulty: Major
Potential mentors:
Jarek Potiuk, mail: potiuk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Airflow UI should also display dag_concurrency reached

Currently, in the main view, the schedule column box is highlighted in red if the max. number of DAG runs is achieved. In this case no more DAG runs can be started until a DAG run completes.

I think it should also display in red when the dag_concurrency (i.e. max concurrent tasks) is achieved. In this case also, no more tasks can be started until a task completes. However there is currently nothing in the UI showing that (currently running 1.10.5).

Difficulty: Major
Potential mentors:
Bas Harenslak, mail: basph (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add ability to specify a maximum modified time for objects in GoogleCloudStorageToGoogleCloudStorageOperator

The fact that I can specify a minimum modified time to filter objects on in GoogleCloudStorageToGoogleCloudStorageOperator but not a maximum seems rather arbitrary. Especially considering the typical usage scenario of running a copy on a schedule, I would like to be able to find objects created within a particular schedule interval for my execution, and not just copy all of the latest objects.

Difficulty: Major
Potential mentors:
Joel Croteau, mail: TV4Fun (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add ability to specify multiple objects to copy to GoogleCloudStorageToGoogleCloudStorageOperator

The restriction in GoogleCloudStorageToGoogleCloudStorageOperator that I am only allowed to specify a single object to list is rather arbitrary. If I specify a wildcard, all it does is split at the wildcard and use that to get a prefix and delimiter. Why not just let me do this search myself and return a list of objects?

Difficulty: Major
Potential mentors:
Joel Croteau, mail: TV4Fun (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

security - hide all password/secret/credentials/tokens from log

I am proposing a new config flag. It will enforce a generic override in all airflow logging to suppress printing any lines containing case-insensitive match on any of: password|secret|credential|token


If you do a

            grep -iE 'password|secret|credential|token' -R <airflow_logs_folder>

you may be surprised with what you find :O


ideally could replace only the sensitive value but there are various formats like:  

            key=value, key'=value, key value, key"=value, key = value, key"="value, key:value

..etc

Difficulty: Major
Potential mentors:
t oo, mail: toopt4 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

AWS Batch Operator improvement to support batch job parameters

AWSBatchOperator does not currently support AWS Batch Job parameters.

When creating an AWS Batch Job Definition and when submitting a job to AWS Batch, it's possible to define and supply job parameters. Most of our AWS Batch jobs take parameters but we are not able to pass them using the AWSBatchOperator.

In order to support batch job parameters, a new argument to _init_(self) could be added called job_parameters, saved to an instance variable and supplied to self.client.submit_job() in the execute() method:

            self.client.submit_job(
            jobName=self.job_name,
            jobQueue=self.job_queue,
            jobDefinition=self.job_definition,
            containerOverrides=self.overrides,
            parameters=self.job_parameters)
            

See https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/batch.html#Batch.Client.submit_job

Difficulty: Major
Potential mentors:
Tim Mottershead, mail: TimJim (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add docs how to integrate with grafana and prometheus

I'm not sure how this is doable but one of the key components that is missing in airflow is the ability to notify about detecting anomalies something like graphana https://grafana.com/

It would be great if airflow can add support for such tools


I'm talking here about airflow itself. For example: if DAG run normally takes 5 minutes but now for any reason it's running over 30 minutes than we want an alert to be sent with graph that shows that anomaly.

Difficulty: Major
Potential mentors:
lovk korm, mail: lovk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add support for dmypy (Mypy daemon) to Breeze environment

Per discussion in https://github.com/apache/airflow/pull/5664 we might use dmypy for local development speedups.

Difficulty: Major
Potential mentors:
Jarek Potiuk, mail: potiuk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Allow GoogleCloudStorageToBigQueryOperator to accept source_objects as a string or otherwise take input from XCom

`GoogleCloudStorageToBigQueryOperator` should be able to have its `source_objects` dynamically determined by the results of a previous workflow. This is hard to do with it expecting a list, as any template expansion will render as a string. This could be implemented either as a check for whether `source_objects` is a string, and trying to parse it as a list if it is, or a separate argument for a string encoded as a list.

My particular use case for this is as follows:

  1. A daily DAG scans a GCS bucket for all objects created in the last day and loads them into BigQuery.
  2. To find these objects, a `PythonOperator` scans the bucket and returns a list of object names.
  3. A `GoogleCloudStorageToBigQueryOperator` is used to load these objects into BigQuery.

The operator should be able to have its list of objects provided by XCom, but there is no functionality to do this, and trying to do a template expansion along the lines of `source_objects='{{ task_instance.xcom_pull(key="KEY") }}'` doesn't work because this is rendered as a string, which `GoogleCloudStorageToBigQueryOperator` will try to treat as a list, with each character being a single item.

Difficulty: Major
Potential mentors:
Joel Croteau, mail: TV4Fun (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Make AWS Operators Pylint compatible

Make AWS Operators Pylint compatible.

Difficulty: Major
Potential mentors:
Ishan Rastogi, mail: gto (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add Google Search Ads 360 integration

Hi

This project lacks integration with the Google Search Ads 360 service. I would be happy if Airflow had proper operators and hooks that integrate with this service.

Product Documentation: https://developers.google.com/search-ads/
API Documentation: https://developers.google.com/resources/api-libraries/documentation/dfareporting/v3.3/python/latest/

Lots of love

Difficulty: Major
Potential mentors:
Kamil Bregula, mail: kamil.bregula (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add Cloud AutoML Tables integration

Hi

This project lacks integration with the Cloud AutoML Tables service. I would be happy if Airflow had proper operators and hooks that integrate with this service.

Product Documentation: https://cloud.google.com/automl-tables/docs/
API Documentation: https://googleapis.github.io/google-cloud-python/latest/automl/index.html

Love

Difficulty: Major
Potential mentors:
Kamil Bregula, mail: kamil.bregula (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add Cloud AutoML NL Sentiment integration

Hi

This project lacks integration with the Cloud AutoML NL Sentiment service. I would be happy if Airflow had proper operators and hooks that integrate with this service.

Product Documentation: https://cloud.google.com/natural-language/automl/sentiment/docs/
API Documentation: https://googleapis.github.io/google-cloud-python/latest/automl/index.html

Lots of love

Difficulty: Major
Potential mentors:
Kamil Bregula, mail: kamil.bregula (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add Cloud AutoML NL Entity Extraction integration

Hi

This project lacks integration with the Cloud AutoML NL Entity Extraction service. I would be happy if Airflow had proper operators and hooks that integrate with this service.

Product Documentation: https://cloud.google.com/natural-language/automl/entity-analysis/docs/
API Documentation: https://googleapis.github.io/google-cloud-python/latest/automl/index.html

Love

Difficulty: Major
Potential mentors:
Kamil Bregula, mail: kamil.bregula (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add Cloud AutoML NL Classification integration

Hi

This project lacks integration with the Cloud AutoML NL Classification service. I would be happy if Airflow had proper operators and hooks that integrate with this service.

Product Documentation: https://cloud.google.com/natural-language/automl/docs/
API Documentation: https://googleapis.github.io/google-cloud-python/latest/automl/index.html

Love

Difficulty: Major
Potential mentors:
Kamil Bregula, mail: kamil.bregula (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add Gzip compression to S3_hook

Allow to load compressed file in the load_file function.

We have similar logic in GoogleCloudStorageHook

Difficulty: Major
Potential mentors:
jack, mail: jackjack10 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add FacebookAdsHook

Add hook to interact with FacebookAds

Difficulty: Major
Potential mentors:
jack, mail: jackjack10 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

S3Hook load_file should support ACL policy parameter

We have a use case where we are uploading files to an S3 bucket in a different AWS account to the one Airflow is running in.  AWS S3 supports this situation using the pre canned ACL policy, specifically bucket-owner-full-control. 

However, the current implementations of the S3Hook.load_*() and S3Hook.copy_object() methods do not allow us to supply any ACL policy for the file being uploaded/copied to S3.  

It would be good to add another optional parameter to the S3Hook methods called acl_policy which would then be passed into the boto3 client method calls like so 


            # load_file
            ...
            if encrypt: 
            extra_args['ServerSideEncryption'] = "AES256"
            if acl_policy:
            extra_args['ACL'] = acl_policy
            
            client.upload_file(filename, bucket_name, key, ExtraArgs=extra_args)


            # load_bytes
            ...
            if encrypt: 
            extra_args['ServerSideEncryption'] = "AES256"
            if acl_policy:
            extra_args['ACL'] = acl_policy
            
            client.upload_file(filename, bucket_name, key, ExtraArgs=extra_args)
            # copy_object
            self.get_conn().copy_object(Bucket=dest_bucket_name,
            Key=dest_bucket_key,
            CopySource=CopySource, 
            ACL=acl_policy)
            
Difficulty: Major
Potential mentors:
Keith O'Brien, mail: kbobrien@gmail.com (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Allow filtering by all columns in Browse Logs view

The "Browse Logs" UI currently allows filtering by "DAG ID", "Task ID", "Execution Date", and "Extra".

For consistency and flexibility, it would be good to allow filtering by any of the available columns, specifically "Datetime", "Event", "Execution Date", and "Owner". 

Difficulty: Minor
Potential mentors:
Brylie Christopher Oxley, mail: brylie (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Support for emptyDir volume in KubernetesExecutor

Currently It seems that K8 Executor expects the dags_volume_claim or git_repo to be always defined through airflow.cfg. Otherwise it does not come up.
Though there is support for "emptyDir" volume in worker_configuration.py but kubernetes_executor fails in _validate function if these configs are not defined.
Our dag files are stored in some remote location which can be synced to worker pod through init/side-car container. We are exploring if it makes sense to allow K8 executor to come up for cases where dags_volume_claim are git_repo are not defined. In such cases worker pod would look for the dags in emptyDir and worker_airflow_dags path (like it does for git-sync). Dag files can be made available in worker_airflow_dags path through init/side-car container.


Difficulty: Major
Potential mentors:
raman, mail: ramandumcs (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Improve performance of cc1e65623dc7_add_max_tries_column_to_task_instance migration

The cc1e65623dc7_add_max_tries_column_to_task_instance migration creates a DagBag for the corresponding DAG for every single task instance. This is very redundant and not necessary.

Hence, there are discussions on Slack like these:

murquizo   [Jan 17th at 1:33 AM]
            Why does the airflow upgradedb command loop through all of the dags?
            
            ....
            
            murquizo   [14 days ago]
            NICE, @BasPH! that is exactly the migration that I was referring to.  We have about 600k task instances and have a several
            python files that generate multiple DAGs, so looping through all of the task_instances to update max_tries was too slow. 
            It took 3 hours and didnt even complete! i pulled the plug and manually executed the migration.   Thanks for your response.
            

An easy to accomplish improvement is to parse a DAG only once and after that set the task instance try_number. I created a branch for it (https://github.com/BasPH/incubator-airflow/tree/bash-optimise-db-upgrade), currently running tests and will make PR when done.

Difficulty: Major
Potential mentors:
Bas Harenslak, mail: basph (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

KubernetesPodOperator: Use secretKeyRef or configMapKeyRef in env_vars

The env_vars attribute of the KubernetesPodOperator allows to pass environment variables as string but it doesn't allows to pass a value from a configmap or a secret.

I'd like to be able to do

            modeling = KubernetesPodOperator(
            ...
            env_vars={
            'MY_ENV_VAR': {
            'valueFrom': {
            'secretKeyRef': {
            'name': 'an-already-existing-secret',
            'key': 'key',
            }
            }
            },
            ...
            )
            

Right now if I do that, Airflow generates the following config

            - name: MY_ENV_VAR
            value:
            valueFrom:
            configMapKeyRef:
            name: an-already-existing-secret
            key: key
            

instead of 

            - name: MY_ENV_VAR
            valueFrom:
            configMapKeyRef:
            name: an-already-existing-secret
            key: key
            

The extract_env_and_secrets method of the KubernetesRequestFactory could check if the value is a dictionary and use it directly.


Difficulty: Major
Potential mentors:
Arthur Brenaut, mail: abrenaut (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Support for Passing Custom Env variables while launching k8 Pod

Is there a way to provide env variables while launching K8 pod through K8 executor. we need to pass some env variable which are referred inside our Airflow Operator. so can we provide custom env variable to docker run command while launching task pod. Currently it seems that it supports predefined env variable.

worker_configuration.py

def get_environment(self): """Defines any necessary environment variables for the pod executor""" env = { 'AIRFLOWCOREDAGS_FOLDER': '/tmp/dags', 'AIRFLOWCORE_EXECUTOR': 'LocalExecutor' } if self.kube_config.airflow_configmap: env['AIRFLOW__CORE__AIRFLOW_HOME'] = self.worker_airflow_home return env


Possible solution

At the moment there is not a way to configure environmental variables on a per-task basis, but it shouldn't be too hard to add that functionality. Extra config options can be passed through the `executor_config` on any operator:

https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L2423-L2437

Which are eventually used here to construct the kubernetes pod for the task:

https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/kubernetes/worker_configuration.py#L186


Difficulty: Major
Potential mentors:
raman, mail: ramandumcs (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Drop snakebite in favour of pyarrow

The current HdfsHook relies on the snakebite library, which is unfortunately not compatible with Python 3. To add Python 3 support for the HdfsHook requires switching to a different library for interacting with HDFS. The hdfs3 library is an attractive alternative, as it supports Python 3 and seems to be stable and relatively well supported.

Update: hdfs3 doesn't get any updates anymore. The best library right now seems to be pyarrow: https://arrow.apache.org/docs/python/filesystems.html
Therefore I would like to upgrade to pyarrow instead of hdfs3.

Difficulty: Blocker
Potential mentors:
Julian de Ruiter, mail: jrderuiter (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Allow wild-cards in the search box in the UI

In the UI there is a search box.

If you search DAG name you will see the result for the search as you type.

Please allow support of wild-cards. Mainly for : *


So if I have a Dag called :abcd and I'm searching for ab* I will see it in the list.


This is very helpful for systems with 100+ dags.

Difficulty: Major
Potential mentors:
jack, mail: jackjack10 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Show lineage in visualization


Difficulty: Major
Potential mentors:
Bolke de Bruin, mail: bolke (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add additional quick start to INSTALL


Difficulty: Blocker
Potential mentors:
Bolke de Bruin, mail: bolke (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

HttpHook shall be configurable to non-status errors

When using HttpSensor, which underlying would use HttpHook to perform the request. If the target service is down, which refused the connection, the task would fail immediately.

would be great if this behaviour is configurable, so the sensor would keep sensoring until the service is up again.

traceback of the error:
[2017-04-29 02:00:31,248]

{base_task_runner.py:95}

INFO - Subtask: requests.exceptions.ConnectionError: HTTPConnectionPool(host='xxxx', port=123): Max retries exceeded with url: /xxxx (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f94b64b44e0>: Failed to establish a new connection: [Errno 111] Connection refused',))

Difficulty: Major
Potential mentors:
Deo, mail: jy00520336 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

`/health` endpoint on each component

Please provide a /health endpoint of each of the following component:

  • webservice (to avoid pinging the / root endpoint)
  • worker
  • scheduler

This would ease integration in Mesos/Marathon framework.

If you agree, I volunteer to add this change.

Difficulty: Major
Potential mentors:
gsemet, mail: gaetan@xeberon.net (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Custom parameters for DockerOperator

Add ability to specify custom parameters to docker cli. E.g. "--volume-driver=""" or --net="bridge" or any other

Difficulty: Major
Potential mentors:
Alexandr Nikitin, mail: alexandrnikitin (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org
  • No labels