Contents

APISIX

Complete the License information for the next dashboard

We are building the next dashboard based on Ant Design Pro, which is an Open Source project for building awesome dashboard. We need your help to add the License to the following files under next branch

Difficulty: Minor

mentors: juzhiyuan@apache.org
Potential mentors:

Project Devs, mail: dev (at) apisix.apache.org

Difficulty: Major
Potential mentors:
Ming Wen, mail: wenming (at) apache.org
Project Devs, mail: dev (at) apisix.apache.org

Check the API version for every request

In order to make sure the dashboard is using the correct API version, we'd better add the APISIX version in every api response.

Please

  1. Add the dashboard api version variable in the config file.
  2. Check every api response in the request.ts file, show a alert when the dashboard version is not compatible with the APISIX version.

mentors: juzhiyuan@apache.org
Potential mentors:
Project Devs, mail: dev (at) apisix.apache.org

Difficulty: Major
Potential mentors:
Ming Wen, mail: wenming (at) apache.org
Project Devs, mail: dev (at) apisix.apache.org

add commit message checker

For the quality of every commit message, please add the Commit Message checker just like https://github.com/vuejs/vue-next/blob/master/scripts/verifyCommit.js

Difficulty: Minor

mentors: juzhiyuan@apache.org
Potential mentors:
Project Devs, mail: dev (at) apisix.apache.org

Difficulty: Minor
Potential mentors:
Ming Wen, mail: wenming (at) apache.org
Project Devs, mail: dev (at) apisix.apache.org

implement Apache APISIX echo plugin

APISIX currently provides a simple example plugin, but it does not provide useful functionality.

So we can provide a useful plugin to help users understand as fully as possible how to develop an APISIX plugin.

This plugin could implement the corresponding functionality in the common phases such as init, rewrite, access, balancer, header filer, body filter and log . But the specific functionality are still being considered.

Difficulty: Major

mentors: agile6v@ apache.org, wenming@apache.org, yousa@apache.org
Potential mentors:
Project Devs, mail: dev (at) apisix.apache.org

Difficulty: Major
Potential mentors:
Ming Wen, mail: wenming (at) apache.org
Project Devs, mail: dev (at) apisix.apache.org

feature: Support follow redirect

When a client request passes through APISIX to upstream, if upstream returns 301 or 302 and then APISIX returns directly to the client by default. The client receives 301 or 302 response and then initiates the request again based on the address specified by Location. Sometimes the client wants APISIX to help it do this, so APISIX can provide this capability to support more scenarios.

Difficulty: Major

mentors: agile6v@ apache.org, wenming@apache.org, yousa@apache.org
Potential mentors:
Project Devs, mail: dev (at) apisix.apache.org

Difficulty: Major
Potential mentors:
Ming Wen, mail: wenming (at) apache.org
Project Devs, mail: dev (at) apisix.apache.org

Airavata

Create a admin portal for Airavata Managed File Transfer Agents

Managed File Transfer is a newer capability of Apache Airavata - https://github.com/apache/airavata-mft

Recently a UX student drafted a graphical version of how potential dashboard can be listing all agents. This EPIC is to track development of Django based User Interfaces integrating with the Custos Securty - https://github.com/apache/airavata-custos

Difficulty: Medium

Mentors: Dimuthu, Suresh

Difficulty: Major
Potential mentors:
Suresh Marru, mail: smarru (at) apache.org
Project Devs, mail: dev (at) airavata.apache.org

Implement Storage Quotas for multiple types of storages

Airavata based science gateways store data in gateway storage, typically mounted with the portal's hosting server. Each user's data is organized within user directories on these storage devices. As storage gets full, it often creates an issue in rationing the disks.

This Epic is to track a new capability to implement user-specific quotas within Airavata and track usage. This will involve:


Mentors: Sudhakar, Suresh, Dimuthu

Difficulty: Major
Potential mentors:
Suresh Marru, mail: smarru (at) apache.org
Project Devs, mail: dev (at) airavata.apache.org

Manage gateway users allocation usage

Once the allocation manager defines allocation policies per gateway and per gateway user the next is to calculate individual users allocation usage.
1. A gateway admin should be able to assign a allocation to an individual user
2. If an allocation is not assigned, the user can use unlimited CUs in the gateway
3. When the SU allocation is assigned, gateway admin should be able to also configure whether the assigned allocation is across all the HPCs the user could use or select the HPCs for the assigned allocation.
4. When the user is using the CUs the number of CUs used need to be calculate against the assigned.
5. When user is submitting an experiment a pre calculation has to be done to see whether the requesting number of CUs are available for the simulation execution.

Difficulty: Major
Potential mentors:
Eroma, mail: eroma_a (at) apache.org
Project Devs, mail: dev (at) airavata.apache.org

SciGaP adminportal monitoring module

SciGaP gateway platform is multi tenanted and serves multiple gateways at the same time. In order to manage some gateway features such as
Compute resource creation
Storage resource creation
Deploying new gateways
The platform has a super gateway portal; https://scigap.org/. This portal has mainly two user groups, gateway providers who create new gateway requests and SciGaP admins(SGRC team). SciGaP admins process the new gateway requests and create compute and storage resources required for the gateways.

The proposed monitoring module is to serve both SciGaP admins and individual gateway admins to generate report they need for various reporting and planning. This documentation will explain the monitoring requirements of SciGaP admins and gateway admins.

Another main aspect of the monitoring module would be to have an audit trail. Audit trail is required both at parent ScIGaP portal level as well as at individual gateway level. The audit is to generate report which states who has changed what in gateway Settings level. The audit is required to all aspects of Admin Settings and should display who has created, updated or deleted records within the gateway.

Difficulty: Major
Potential mentors:
Eroma, mail: eroma_a (at) apache.org
Project Devs, mail: dev (at) airavata.apache.org

SciGaP admins, Gateway admins/PIs communicating with gateway users

This task is to handle and implement communication between SciGaP team (Gateway providers) and gateway users.

  1. Multiple groups of gatew users are available.
  2. Gateway user groups are admin, gateway user, groups created by gateway users, admins, etc
  3. SciGaP admins/team need a way to communicate with either all the user groups or selected user groups for a specific gateway or for all gateway.
  4. The communication could be through email, or if the users have provided phone number and want to receive text messages. Sending text messages could be the second phase of this implementation. Emails of users are already there in the system and we could start with email.
  5. The communication could be to inform unavailability of system, specific compute resources maintenance, nw application release, etc...
  6. Currently the notices are only for users when they login to the gateway portal.
  7. Similar to SciGaP team reaching out to users gateway PIs should be able to email selected users, user groups through the same method.
Difficulty: Major
Potential mentors:
Eroma, mail: eroma_a (at) apache.org
Project Devs, mail: dev (at) airavata.apache.org

Implement user friendly actionable and informative messages in Airavata

Currently the direct ERROR logs are presented to the user and they may not be clear or directly actionable. Going forward apart fro the ERROR message more actionable message will be presented. The change will be for internal components of airaveata and also the the gateway bridge code. 

A gateway user by looking at a message should be able to understand what has gone wrong and what action he need to take.

Difficulty: Major
Potential mentors:
Eroma, mail: eroma_a (at) apache.org
Project Devs, mail: dev (at) airavata.apache.org

Experiments fail to submit jobs to HPC cluster queues due to queue reaching the max job limit per user.

Currently experiments fail when

  1. HPC queue reaches the max job number for the queue.
  2. When the job submission fails and HPC sent job submission response [1]airavata tags the experiment as FAILED.
  3. The only option for gateway user is to submit the experiment again.

Fix required is to Airavata to have internal queues or a way to manage such experiments until the HPC queue is available for jobs and not to FAIL the experiment.

When enabling internal Airavata queues, we need to focus on keeping queues per gateway, per HPC resource per gateway login user, etc. These implementation details need to be discussed and finalized and input will also be required from HPC system administrators as well.


[1]

This example os from stampede2

----------------------------------------------------------------- Welcome to the Stampede2 Supercomputer ----------------------------------------------------------------- No reservation for this job --> Verifying valid submit host (login3)...OK --> Verifying valid jobname...OK --> Enforcing max jobs per user...FAILED [*] Too many simultaneous jobs in queue. --> Max job limits for us3 = 50 jobs


Difficulty: Major
Potential mentors:
Eroma, mail: eroma_a (at) apache.org
Project Devs, mail: dev (at) airavata.apache.org

Implement quick job queue in HPCs e.g.: Jetstream

1. Gateways submit jobs with varying execution times.
2. Some gateways has minor jobs which would complete within matter of few seconds, may be even less than 10 seconds.
3. For such jobs the gateway and Airavata middleware need to support quick queues for executions and results.

Difficulty: Major
Potential mentors:
Eroma, mail: eroma_a (at) apache.org
Project Devs, mail: dev (at) airavata.apache.org

Implement Application input editor in experiment creation

1. Currently the Django gateway portal supports input file upload and text file view.
2. Once the files are uploaded the user can view but cannot edit the file.
3. The implementation is the had the input file updated after uploading to the portal.
4. User should also have the option of saving a local copy of the latest file if needed.

Difficulty: Major
Potential mentors:
Eroma, mail: eroma_a (at) apache.org
Project Devs, mail: dev (at) airavata.apache.org

Apache Airflow

HttpHook shall be configurable to non-status errors

When using HttpSensor, which underlying would use HttpHook to perform the request. If the target service is down, which refused the connection, the task would fail immediately.

would be great if this behaviour is configurable, so the sensor would keep sensoring until the service is up again.

traceback of the error:
[2017-04-29 02:00:31,248]

{base_task_runner.py:95}

INFO - Subtask: requests.exceptions.ConnectionError: HTTPConnectionPool(host='xxxx', port=123): Max retries exceeded with url: /xxxx (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f94b64b44e0>: Failed to establish a new connection: [Errno 111] Connection refused',))

Difficulty: Major
Potential mentors:
Deo, mail: jy00520336 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Improve performance of cc1e65623dc7_add_max_tries_column_to_task_instance migration

The cc1e65623dc7_add_max_tries_column_to_task_instance migration creates a DagBag for the corresponding DAG for every single task instance. This is very redundant and not necessary.

Hence, there are discussions on Slack like these:

murquizo   [Jan 17th at 1:33 AM]
            Why does the airflow upgradedb command loop through all of the dags?
            
            ....
            
            murquizo   [14 days ago]
            NICE, @BasPH! that is exactly the migration that I was referring to.  We have about 600k task instances and have a several
            python files that generate multiple DAGs, so looping through all of the task_instances to update max_tries was too slow. 
            It took 3 hours and didnt even complete! i pulled the plug and manually executed the migration.   Thanks for your response.
            

An easy to accomplish improvement is to parse a DAG only once and after that set the task instance try_number. I created a branch for it (https://github.com/BasPH/incubator-airflow/tree/bash-optimise-db-upgrade), currently running tests and will make PR when done.

Difficulty: Major
Potential mentors:
Bas Harenslak, mail: basph (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Show lineage in visualization


Difficulty: Major
Potential mentors:
Bolke de Bruin, mail: bolke (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add support for dmypy (Mypy daemon) to Breeze environment

Per discussion in https://github.com/apache/airflow/pull/5664 we might use dmypy for local development speedups.

Difficulty: Major
Potential mentors:
Jarek Potiuk, mail: potiuk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Make AWS Operators Pylint compatible

Make AWS Operators Pylint compatible.

Difficulty: Major
Potential mentors:
Ishan Rastogi, mail: gto (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

`/health` endpoint on each component

Please provide a /health endpoint of each of the following component:

  • webservice (to avoid pinging the / root endpoint)
  • worker
  • scheduler

This would ease integration in Mesos/Marathon framework.

If you agree, I volunteer to add this change.

Difficulty: Major
Potential mentors:
gsemet, mail: gaetan@xeberon.net (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

AWS Batch Operator improvement to support batch job parameters

AWSBatchOperator does not currently support AWS Batch Job parameters.

When creating an AWS Batch Job Definition and when submitting a job to AWS Batch, it's possible to define and supply job parameters. Most of our AWS Batch jobs take parameters but we are not able to pass them using the AWSBatchOperator.

In order to support batch job parameters, a new argument to _init_(self) could be added called job_parameters, saved to an instance variable and supplied to self.client.submit_job() in the execute() method:

            self.client.submit_job(
            jobName=self.job_name,
            jobQueue=self.job_queue,
            jobDefinition=self.job_definition,
            containerOverrides=self.overrides,
            parameters=self.job_parameters)
            

See https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/batch.html#Batch.Client.submit_job

Difficulty: Major
Potential mentors:
Tim Mottershead, mail: TimJim (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add docs how to integrate with grafana and prometheus

I'm not sure how this is doable but one of the key components that is missing in airflow is the ability to notify about detecting anomalies something like graphana https://grafana.com/

It would be great if airflow can add support for such tools


I'm talking here about airflow itself. For example: if DAG run normally takes 5 minutes but now for any reason it's running over 30 minutes than we want an alert to be sent with graph that shows that anomaly.

Difficulty: Major
Potential mentors:
lovk korm, mail: lovk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Support for Passing Custom Env variables while launching k8 Pod

Is there a way to provide env variables while launching K8 pod through K8 executor. we need to pass some env variable which are referred inside our Airflow Operator. so can we provide custom env variable to docker run command while launching task pod. Currently it seems that it supports predefined env variable.

worker_configuration.py

def get_environment(self): """Defines any necessary environment variables for the pod executor""" env = { 'AIRFLOWCOREDAGS_FOLDER': '/tmp/dags', 'AIRFLOWCORE_EXECUTOR': 'LocalExecutor' } if self.kube_config.airflow_configmap: env['AIRFLOW__CORE__AIRFLOW_HOME'] = self.worker_airflow_home return env


Possible solution

At the moment there is not a way to configure environmental variables on a per-task basis, but it shouldn't be too hard to add that functionality. Extra config options can be passed through the `executor_config` on any operator:

https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L2423-L2437

Which are eventually used here to construct the kubernetes pod for the task:

https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/kubernetes/worker_configuration.py#L186


Difficulty: Major
Potential mentors:
raman, mail: ramandumcs (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Drop snakebite in favour of pyarrow

The current HdfsHook relies on the snakebite library, which is unfortunately not compatible with Python 3. To add Python 3 support for the HdfsHook requires switching to a different library for interacting with HDFS. The hdfs3 library is an attractive alternative, as it supports Python 3 and seems to be stable and relatively well supported.

Update: hdfs3 doesn't get any updates anymore. The best library right now seems to be pyarrow: https://arrow.apache.org/docs/python/filesystems.html
Therefore I would like to upgrade to pyarrow instead of hdfs3.

Difficulty: Blocker
Potential mentors:
Julian de Ruiter, mail: jrderuiter (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

UI - Show count of tasks in each dag on the main dags page

Main DAGs page in UI - would benefit from showing a new column: number of tasks for each dag id

Difficulty: Minor
Potential mentors:
t oo, mail: toopt4 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

clear cli command needs a 'conf' option

key-value pairs of conf can be passed into trigger_dag command
ie
--conf '

{"ric":"amzn"}

'

clear command needs this feature too

ie in case exec_date is important and there was a failure halfway in the 1st dagrun due to bad conf being sent on trigger_dag command and want to run the same execdate but with new conf on 2nd dagrun

alternative solution would be a new delete_dag_run cli command so never need to 'clear' but can do a 2nd DagRun for same exec date

Difficulty: Major
Potential mentors:
t oo, mail: toopt4 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Improve multiline output in admin gui

Multiline attributes, rendered templates, or Xcom variables are not well supported in the admin GUI at present. Any values are treated as native HTML text() blocks and as such all formatting is lost. When passing structured data such as YAML in these variables, it makes a real mess of them.

Ideally, these values should keep their line-breaks and indentation.

This should only require having these code blocks wrapped in a <pre> block or setting `white-space: pre` on the class for the block.

Difficulty: Major
Potential mentors:
Paul Rhodes, mail: withnale (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Mock Cassandra in tests

Cassandra consume 1.173GiB of memory. Travis does not have very efficient machines, so we should limit system/integration tests of components that do not require much attention, e.g. they are not changed often. Cassandra is a good candidate for this. This will allow the machine power to be used for more needed work.

            CONTAINER ID        NAME                                  CPU %               MEM USAGE / LIMIT     MEM %               NET
            I/O             BLOCK I/O           PIDS
            8aa37ca50f7c        ci_airflow-testing_run_1f3aeb6d1052   0.00%               5.715MiB / 3.855GiB   0.14%               1.14kB
            / 0B         2.36MB / 0B         2
            f2b3be15558f        ci_cassandra_1                        0.69%               1.173GiB / 3.855GiB   30.42%              2.39kB
            / 0B         75.3MB / 9.95MB     50
            ef1de3981ca6        ci_krb5-kdc-server_1                  0.02%               12.15MiB / 3.855GiB   0.31%               2.46kB
            / 0B         18.9MB / 184kB      4
            be808233eb91        ci_mongo_1                            0.31%               36.71MiB / 3.855GiB   0.93%               2.39kB
            / 0B         43.2MB / 19.1MB     24
            667e047be097        ci_rabbitmq_1                         0.77%               69.95MiB / 3.855GiB   1.77%               2.39kB
            / 0B         43.2MB / 508kB      92
            2453dd6e7cca        ci_postgres_1                         0.00%               7.547MiB / 3.855GiB   0.19%               1.05MB
            / 889kB      35.4MB / 145MB      6
            78050c5c61cc        ci_redis_1                            0.29%               1.695MiB / 3.855GiB   0.04%               2.46kB
            / 0B         6.94MB / 0B         4
            c117eb0a0d43        ci_mysql_1                            0.13%               452MiB / 3.855GiB     11.45%              2.21kB
            / 0B         33.9MB / 548MB      21
            131427b19282        ci_openldap_1                         0.00%               45.68MiB / 3.855GiB   1.16%               2.64kB
            / 0B         32.8MB / 16.1MB     4
            8c2549c010b1        ci_docker_1                           0.59%               22.06MiB / 3.855GiB   0.56%               2.39kB
            / 0B         95.9MB / 291kB      30
            
Difficulty: Major
Potential mentors:
Kamil Bregula, mail: kamil.bregula (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add SalesForce connection to UI

Airflow has SalesForceHook but it doesn't have a distinct connection.

In order to create a Connection one must expose it's secret token as text :

https://stackoverflow.com/questions/53510980/salesforce-connection-using-apache-airflow-ui

Also it's not very intuitive that the Conn Type should remain blank.

It would be easier and also user friendly if there will be salesforce connection in the UI which has a security_token field that is encrypted.

Difficulty: Major
Potential mentors:
Elad, mail: eladk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

One of Celery executor tests is flaky

tests/executors/test_celery_executor.py::TestCeleryExecutor::test_celery_integration_0_amqp_guest_guest_rabbitmq_5672


Log attached.

Difficulty: Major
Potential mentors:
Jarek Potiuk, mail: potiuk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Request for OktopostToGoogleStorageOperator

Difficulty: Major
Potential mentors:
HaloKu, mail: HaloKu (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

add GDriveToGcsOperator

There is GcsToGDriveOperator but there isn't the equivalent in the other direction



Difficulty: Major
Potential mentors:
lovk korm, mail: lovk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

dag_processor_manager/webserver/scheduler logs should be created under date folder

dag level logs are written under separate date folders. This is great because the old dates are not 'modified/accessed' so they can be easily purged by utilities like tmpwatch

This JIRA is about making other logs (such as dag_processor_manager/webserver/scheduler.etc) go under separate date folders to allow easy purging. the log from redirecting 'airflow scheduler' to stdout grows over 100mb a day in my env

Difficulty: Major
Potential mentors:
t oo, mail: toopt4 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Check and document that docker-compose >= 1.20 is needed to run breeze


Difficulty: Major
Potential mentors:
Jarek Potiuk, mail: potiuk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Airflow UI should also display dag_concurrency reached

Currently, in the main view, the schedule column box is highlighted in red if the max. number of DAG runs is achieved. In this case no more DAG runs can be started until a DAG run completes.

I think it should also display in red when the dag_concurrency (i.e. max concurrent tasks) is achieved. In this case also, no more tasks can be started until a task completes. However there is currently nothing in the UI showing that (currently running 1.10.5).

Difficulty: Major
Potential mentors:
Bas Harenslak, mail: basph (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Support for emptyDir volume in KubernetesExecutor

Currently It seems that K8 Executor expects the dags_volume_claim or git_repo to be always defined through airflow.cfg. Otherwise it does not come up.
Though there is support for "emptyDir" volume in worker_configuration.py but kubernetes_executor fails in _validate function if these configs are not defined.
Our dag files are stored in some remote location which can be synced to worker pod through init/side-car container. We are exploring if it makes sense to allow K8 executor to come up for cases where dags_volume_claim are git_repo are not defined. In such cases worker pod would look for the dags in emptyDir and worker_airflow_dags path (like it does for git-sync). Dag files can be made available in worker_airflow_dags path through init/side-car container.


Difficulty: Major
Potential mentors:
raman, mail: ramandumcs (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Allow filtering by all columns in Browse Logs view

The "Browse Logs" UI currently allows filtering by "DAG ID", "Task ID", "Execution Date", and "Extra".

For consistency and flexibility, it would be good to allow filtering by any of the available columns, specifically "Datetime", "Event", "Execution Date", and "Owner". 

Difficulty: Minor
Potential mentors:
Brylie Christopher Oxley, mail: brylie (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Snowflake Connector cannot run more than one sql from a sql file

I am getting an error when passing in a SQL file with multiple SQL statements to snowflake operator

            snowflake.connector.errors.ProgrammingError: 000006 (0A000): 01908236-01a3-b2c4-0000-f36100052686: Multiple SQL statements
            in a single API call are not supported; use one API call per statement instead.
            

It only fails if you pass a file with multiple statements. A file with just one statement or list of statements to the operator works fine.

After looking at the current snowflake operator implementation it seems like a list of SQL statements work because it executes one statement at a time. Whereas multiple statements in a SQL file fails because all of them are read as one continuous string.


How can we fix this:

There is an API call in Snowflake python connector that supports multiple SQL statements.

https://docs.snowflake.net/manuals/user-guide/python-connector-api.html#execute_string

This can be fixed by overriding the run function in Snowflake Hook to support multiple sql statements in a file.

Difficulty: Major
Potential mentors:
Saad, mail: saadk (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

prevent autocomplete of username in login UI

Login page of the UI has autocomplete for username field. This should be disabled for security

Difficulty: Major
Potential mentors:
t oo, mail: toopt4 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add ability to sort dag list in the UI by Dag, Owner, Last Run

Currently the dag list is sorted by the DAG name 

It should be possible to sort the dag list in the UI by: Dag, Owner, Last Run and allow ASC /DESC ordering.

This functionality already exist in the UI for example in: Security -> List Users.
You can sort the table by any column.

Difficulty: Major
Potential mentors:
Roster, mail: RosterInn (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add ability to specify a maximum modified time for objects in GoogleCloudStorageToGoogleCloudStorageOperator

The fact that I can specify a minimum modified time to filter objects on in GoogleCloudStorageToGoogleCloudStorageOperator but not a maximum seems rather arbitrary. Especially considering the typical usage scenario of running a copy on a schedule, I would like to be able to find objects created within a particular schedule interval for my execution, and not just copy all of the latest objects.

Difficulty: Major
Potential mentors:
Joel Croteau, mail: TV4Fun (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Expose generate_presigned_url of boto3 to S3Hook

boto3 has generate_presigned_url which should be exposed in the Hook:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.generate_presigned_url

generate_presigned_url(ClientMethodParams=NoneExpiresIn=3600HttpMethod=None)

Generate a presigned url given a client, its method, and arguments

Parameters

  • ClientMethod (string) – The client method to presign for
  • Params (dict) – The parameters normally passed to ClientMethod.
  • ExpiresIn (int) – The number of seconds the presigned url is valid for. By default it expires in an hour (3600 seconds)
  • HttpMethod (string) – The http method to use on the generated url. By default, the http method is whatever is used in the method's model.

Returns The presigned url

Difficulty: Major
Potential mentors:
korni, mail: korni (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

security - hide all password/secret/credentials/tokens from log

I am proposing a new config flag. It will enforce a generic override in all airflow logging to suppress printing any lines containing case-insensitive match on any of: password|secret|credential|token


If you do a

            grep -iE 'password|secret|credential|token' -R <airflow_logs_folder>

you may be surprised with what you find :O


ideally could replace only the sensitive value but there are various formats like:  

            key=value, key'=value, key value, key"=value, key = value, key"="value, key:value

..etc

Difficulty: Major
Potential mentors:
t oo, mail: toopt4 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Add FacebookAdsHook

Add hook to interact with FacebookAds

Difficulty: Major
Potential mentors:
jack, mail: jackjack10 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Allow GoogleCloudStorageToBigQueryOperator to accept source_objects as a string or otherwise take input from XCom

`GoogleCloudStorageToBigQueryOperator` should be able to have its `source_objects` dynamically determined by the results of a previous workflow. This is hard to do with it expecting a list, as any template expansion will render as a string. This could be implemented either as a check for whether `source_objects` is a string, and trying to parse it as a list if it is, or a separate argument for a string encoded as a list.

My particular use case for this is as follows:

  1. A daily DAG scans a GCS bucket for all objects created in the last day and loads them into BigQuery.
  2. To find these objects, a `PythonOperator` scans the bucket and returns a list of object names.
  3. A `GoogleCloudStorageToBigQueryOperator` is used to load these objects into BigQuery.

The operator should be able to have its list of objects provided by XCom, but there is no functionality to do this, and trying to do a template expansion along the lines of `source_objects='{{ task_instance.xcom_pull(key="KEY") }}'` doesn't work because this is rendered as a string, which `GoogleCloudStorageToBigQueryOperator` will try to treat as a list, with each character being a single item.

Difficulty: Major
Potential mentors:
Joel Croteau, mail: TV4Fun (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Allow wild-cards in the search box in the UI

In the UI there is a search box.

If you search DAG name you will see the result for the search as you type.

Please allow support of wild-cards. Mainly for : *


So if I have a Dag called :abcd and I'm searching for ab* I will see it in the list.


This is very helpful for systems with 100+ dags.

Difficulty: Major
Potential mentors:
jack, mail: jackjack10 (at) apache.org
Project Devs, mail: dev (at) airflow.apache.org

Apache Fineract

Upgrade Fineract 1.x to Java 11 and Upgrade Dependencies to latest versions

Upgrade Fineract 1.x from Java 8 to Java 11 so we can start using the latest LTS Java version and features.

This will also require you to upgrade other Fineract 1.x dependencies from their current to latest possible versions.

Difficulty: Major
Potential mentors:
Awasum Yannick, mail: awasum (at) apache.org
Project Devs, mail: dev (at) fineract.apache.org

Strengthen/Harden Fineract 1.x to LTS Version by Upgrading Java & Improving Code Coverage of Tests

Overview & Objectives: The Fineract 1.x code base powering Mifos X and dozen of cloud-based core banking and fintech solutions around the world supporting millions of clients is very robust and feature-rich. With the wide functional footprint, there comes greater complexity in the code that makes maintainability more difficult. Additionally, as new features have been added the test coverage hasn't been extended at the same rate. The combination of these multiple factors - a large and varied user base that is reliant upon this vast codebase, the high maintenance burden, the need for increased testing coverage, and the need for a stable point for migration to the new architecture - merits this very important project will consist of these major tasks, documented in the following issues:
Description:

Helpful Skills: Spring, Hibernate, REST, Java, SQL
Impact: Improved functionality and increased stability of the core Fineract platform
Other Resources: Getting Started with Apache Fineract: https://cwiki.apache.org/confluence/display/FINERACT/Getting+Started+Docs

Difficulty: Major
Potential mentors:
Sanyam Goel, mail: sanyam96 (at) apache.org
Project Devs, mail: dev (at) fineract.apache.org

Scalability & Performance Enhancements for Supporting Millions of Clients, High TPS, and Concurrent Users

Overview & Objectives
As Mifos X has matured as a core banking platform, it's been adopted and used by larger institutions serving hundreds of thousands and even millions of clients. Partners operating cloud-hosted subscription models are also supporting hundreds of thousands of clients across their multi-tenant installations. Most recently, more and more digital-first fintechs are using the platforms for highly scalable wallet accounts needing hundreds and thousands of TPS. We need to benchmark, analyze and improve the performance and scalability of the system.
Description
Enhancements to the back-end platform will include parallelization of all the jobs with a configurable amount of concurrency, look at the explain plans of the queries being used in the jobs, paginate input queries for jobs, put lazy fetching where required, node-aware scheduler and cache, office-wise configurable jobs to distribute job-load across servers and write some tests to prove that the concurrency will work for a decent amount of scale.
In addition, you'll provide some metrics which can help mid-sized MFIs (those having around a million active loans) in adopting Mifos X.
 
Helpful Skills
Java, Javascript, Spring, JAX-RS, JPA,
Impact
Higher outreach to the unbanked by supporting larger institutions and scaling more rapidly.
Difficulty: Major
Potential mentors:
Sanyam Goel, mail: sanyam96 (at) apache.org
Project Devs, mail: dev (at) fineract.apache.org

In line with the rationale for choosing EclipseLink as the ORM replacement for Hibernate in FineractCN, we have broad consensus across the community to swap out OpenJPA with EclipseLink.

OpenJPA seems to have reached its end of life with community activity withering and the trade-offs between Hibernate and EclipseLink are much lower. We also have community members who are migrating Fineract 1.x to PostGreSQL and would benefit from the increased performance with EclipseLink.

Difficulty: Major
Potential mentors:
Ed Cable, mail: edcable (at) apache.org
Project Devs, mail: dev (at) fineract.apache.org

Improve Robustness of Mifos X and Apache Fineract by Fixing Issues/Feature Requests in Backlog

Overview & Objectives
Mifos X and Apache Fineract is widely used by financial institutions of all different sizes and methodologies around the world. With that widespread user base there is a vast array of different processes and procedures that would like to be supported as slight modifications over the common functionality provided. Over the past several years, we have captured these minor enhancements in our issue tracker as feature requests. Also included in this backlog or additional minor and less critical bugs that have been reported but have not been fixed yet.  This backlog has grown and it would be a very impactful project for an intern to work on completing as many of these bug fixes and minor enhancement as possible.
The difficult level of these issues ranges from low to higher and touch all componets of the platform - most don't require too much domain knowledge but some will. 
Description
We have groomed the backlog and tagged issues and feature requests that are relevant for this project with the labels gsoc and/or Volunteer.  Priority level of tasks is measured by p1 being the highest priority. Tasks with an assigned fix version of either 1.4.0 or 1.5.0 have a higher priority. 
There are more than 120 tickets in the saved filter. You are not expected to complete all of the tasks in the backlog but throughout the internship you should fix as many issues/feature requests as possible. You will work with your mentor to deliver a plan for each sprint and adjust velocity as you get scaled up.
Issues to be worked on can be found at https://issues.apache.org/jira/issues/?filter=12345785 - the saved filter is named 2019 Intern Project. 
Helpful Skills:
HTML, Spring, Hibernate, REST, Java, AngularJS, Javascript, SQL
Impact:
Better internal control and financial transparency
Other Resources:
Getting Started with Apache Fineract: https://cwiki.apache.org/confluence/display/FINERACT/Getting+Started+Docs
Difficulty: Major
Potential mentors:
Sanyam Goel, mail: sanyam96 (at) apache.org
Project Devs, mail: dev (at) fineract.apache.org

Apache Gora

Hazelcast IMap backed datastore

Current implementation of JCache datastore is written in a way that it will work with any JCache provider. Even though we have made explicitly available Hazelcast JCache provider to the classpath. This implementation should be based on the native interfaces of IMap.

Difficulty: Major
Potential mentors:
Kevin Ratnasekera, mail: djkevincr (at) apache.org
Project Devs, mail: dev (at) gora.apache.org

Add datastore for ArangoDB

May be we should consider extending our multimodal datastore support with ArangoDB. [1]

[1] https://www.arangodb.com/why-arangodb/multi-model/

Difficulty: Major
Potential mentors:
Kevin Ratnasekera, mail: djkevincr (at) apache.org
Project Devs, mail: dev (at) gora.apache.org

Implement RethinkDB datastore module

Technical comparison to MongoDB is available. [1]

[1] https://rethinkdb.com/docs/comparison-tables/

Difficulty: Major
Potential mentors:
Kevin Ratnasekera, mail: djkevincr (at) apache.org
Project Devs, mail: dev (at) gora.apache.org

Implement ScyllaDB Datastore module

Difficulty: Major
Potential mentors:
Madhawa Kasun Gunasekara, mail: madhawa (at) apache.org
Project Devs, mail: dev (at) gora.apache.org

Apache IoTDB

Apache IoTDB integration with more powerful aggregation index

IoTDB is a highly efficient time series database, which supports high speed query process, including aggregation query.

Currently, IoTDB pre-calculates the aggregation info, or called the summary info, (sum, count, max_time, min_time, max_value, min_value) for each page and each Chunk. The info is helpful for aggregation operations and some query filters. For example, if the query filter is value >10 and the max value of a page is 9, we can skip the page. For another example, if the query is select max(value) and the max value of 3 chunks are 5, 10, 20, then the max(value) is 20. 

However, there are two drawbacks:

1. The summary info actually reduces the data that needs to be scanned as 1/k (suppose each page has k data points). However, the time complexity is still O(N). If we store a long historical data, e.g., storing 2 years data with 500KHz, then the aggregation operation may be still time-consuming. So, a tree-based index to reduce the time complexity from O(N) to O(logN) is a good choice. Some basic ideas have been published in [1], while it can just handle data with fix frequency. So, improving it and implementing it into IoTDB is a good choice.

2. The summary info is helpless for evaluating the query like where value >8 if the max value = 10. If we can enrich the summary info, e.g., storing the data histogram, we can use the histogram to evaluate how many points we can return. 

This proposal is mainly for adding an index for speeding up the aggregation query. Besides, if we can let the summary info be more useful, it could be better.

Notice that the premise is that the insertion speed should not be slow down too much!

You should know:
• IoTDB query process
• TsFile structure and organization
• Basic index knowledge
• Java 

difficulty: Major
mentors:
hxd@apache.org

Reference:

[1] https://www.sciencedirect.com/science/article/pii/S0306437918305489
 
 
 

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

A complete Apache IoTDB JDBC driver and integration with JDBC driver based tools (DBeaver and Apache Zeppelin)

Apache IoTDB is a database for time series data management that written in Java. It provides a SQL-like query language and a JDBC driver for users. Current IoTDB JDBC driver has implemented some important interfaces of Statement, Connection, ResultSet, etc., which works well for most users' requirements.

However, we know there are many tools supporting integrating with a database if the database has a standard JDBC driver, e.g., DBeaver, Apache Zeppelin, Tableau, etc..


This proposal is for implementing a standard JDBC driver for IoTDB, and using the driver to integrate with DBeaver and Apache Zeppelin.


Because Apache Zeppelin supports customized Interpreter, we can also implement an IoTDB interpreter for integration with Zeppelin.


You should know:

  • how JDBC works.
  • learn to use IoTDB session API.
  • understand Zeppelin Interpreter interface. 


Difficulty: Major 

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB integration with MiNiFI/NiFi

IoTDB is a database for storing time series data.

MiNiFI is a data flow engine to transfer data from A to B, e.g., from PLC4X to IoTDB.

This proposal is for integration IoTDB with MiNiFi.

  • let MiNiFi/NiFi to support writing data into IoTDB.


Difficulty:  major

mentors:

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB website supports static documents and search engine

Apache IoTDB currently uses VUE to develop the website (iotdb.apache.org), and show the markdown documents from GitHub to the website using JS.

However, there are two drawbacks now:

  1. if we display documents from GitHub to the website using JS, then Google crawler will never index the content of the documents.
  2. when users read the documents on the website, they may do not know where the content is. For example, someone wants to find the syntax of 'show timeseries', but  he or she may do not know whether it is in chapter 5-1 or 5-4. So, a search engine embedded in the website is a good choice. 

You should learn:

  • VUE
  • Other Website developing technology.

Mentors:

hxd@apache.org

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB Database Connection Pool and integration with some web framework

IoTDB is a time series database.

When using a database in an application, the database connection pool is much helpful for  high performance and saving resources.

Besides, when developing a website using Spring or some other web framework, now many developers do not control the database connection manually. Instead, developers just need to tell what database they will use and the web framework can handle everything well.

This proposal is for

  • letting IoTDB supports some database connection pools like Apache Commons DBCP, C3P0.
  • integration IoTDB with one web framework (e.g., Spring)


You should know:

  • IoTDB
  • At least one DB connection pool
  • Know Spring or some other web framework

mentors:

hxd@apache.org

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB's Peer Tracker for The Raft Protocol [distributed]

IoTDB is a highly efficient time-series database, which supports high-speed query process, including aggregation query.

Currently, clustered IoTDB is under close development. It now supports leader election, log replication, cluster membership change, and log compaction. We are testing and optimizing these features these days.

However, we have not yet implemented log status tracking for peers, which leads to the fact that when sending the logs, the logs required by a peer may not be immediately sent correctly, resulting in wasted bandwidth and possible errors.

So there are two improvements about peer tracker need to be done:

1.implement a peer tracker to track follower's log status. You can borrow from other projects or design your own, as long as it's right.

2.you should also dynamically maintain the peer tracker on the current design and handle possible conflicting inconsistencies, this requires a little understanding of IoTDB’s raft log module.

This proposal is mainly for implementing and maintaining a peer tracker in clustered IoTDB.
It is necessary for you to understand that correctness is the most important thing.

You should know:
 - IoTDB cluster structure
 - IoTDB raft RPC module
 - IoTDB raft log module
 - Raft
 - Java

difficulty: Major

Mentor:

jt2594838@163.com, hxd@apache.org



Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB raft log persistence in the distributed version

IoTDB is a highly efficient time series database, which supports high-speed query process, including aggregation query.

Currently, IoTDB has supported shared-nothing cluster which using raft mechanism and raft logs to communicate among all nodes. So raft logs are very important in communication, consistency keeping, and fail-over.

However, the current logs are just stored in memory which means raft logs will lost when the nodes down and then recover. Secondly, the raft logs may overlap with current WAL which means we may do some unnecessary log writing works.

So there are two improvements about raft logs need to be done:

1. Store the raft logs in a durable material such as a disk. You need to design a serializable form of logs and then put them to disk.

2. Find a way of using raft logs in the IoTDB recovery process. That's means we just write raft logs rather than both raft logs and WAL. This will avoid some unnecessary log writing works and improve insertion performance.

This proposal is mainly for improving raft logs in clustered IOTDB. Besides, if we can let the summary info be more useful, it could be better.

Notice that the premise is that the raft logs writing process should not be slow down too much. That means the serializable form should be high efficiency enough.

You should know:
• IoTDB cluster structure
• IoTDB WAL
• IoTDB insertion process
• Raft
• Java

difficulty: Major
mentors:
jt2594838@163.com

Difficulty: Major
Potential mentors:
Tian Jiang, mail: jt2594838 (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB Kubernates Deployment Feature Design

Apache IoTDB is a time-series database, and provides many components for connecting with systems like Grafana. 

IoTDB also provides a tool for synchronizing the data from one IoTDB instance to another. 


Currently, IoTDB has a docker image for deploying an IoTDB instance, while lacking the following features:

  1. If the IoTDB instance is down, then we can automatically restart a new one. As IoTDB is a database and requires efficient disk IO operations, we'd like to use stateful service to start IoTDB, i.e., writing data locally, rather than on an NFS.
  2. Startup two IoTDB instances and sync data from one to another.  Only one can accept data insertions and once the writable instance is down, the rest IoTDB becomes writable. 
  3. Make IoTDB's configuration files easier to be modified for a docker/k8s based container.
  4. build an IoTDB-Grafana docker image. (We may have more middle-ware in the following three months, so that it is better to build all middle-wares)


This task is for running components along with IoTDB, and supporting a double-alive instance deployment using file sync module. (Though IoTDB's cluster mode is ongoing, the above deployment is also meaningful in some case).


You may finish it by using docker compose, or K8S. Maybe some service detection service is needed. For better implementation, maybe K8S operator is needed.


Skills:

  • Docker
  • K8S (K8S operator)
  • Java
  • Shell


As I am also not very familiar with K8S, I'd like to find someone who are interested in this task as a co-mentor.


Mentors:

hxd@apache.org








Difficulty: Minor
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB integration with Prometheus

IoTDB is a highly efficient time series database.

Prometheus is a monitoring and alerting toolkit, which supports collecting data from other systems, servers, and IoT devices, saving data into a DB, visualizing data and provides some query APIs.


Prometheus allows users to use their database rather than just Prometheus DB for storing time series databases. 

This proposal is for integrating IoTDB with Prometheus.


You should know:

  • How to use Prometheus
  • How to use IoTDB
  • Java and Go language

difficulty: Major

mentors:

hxd@apache.org

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB trigger module for streaming cumputing

IoTDB is a time-series data management system and the data usually comes in a streaming way.

In the IoT area, when a data point comes, a trigger can be called because of the following scenario:

  • (single data point calculation) the data point is an outlier point, or the data value reaches a warning threshold. IoTDB needs to publish the data point to those who subscribed the event.
  • (multiply time series data point calculation) a device sends several metrics data to IoTDB, e.g., vehicle d1 sends average speed and running time to IoTDB. Then users may want to get the mileage of the vehicle (speed x time). IoTDB needs to calculate the result and save it to another time series.
  • (Time window calculation) a device reports its temperature every second. Though the temperature is not too high, if it keeps increasing in 5 seconds, IoTDB needs to report the event to those who subscribe that.


As there are many streaming computing projects already, we can integrate one of them into IoTDB.

  • If IoTDB runs on Edge, we can integrate Apache StreamPipes or Apache Edgent.
  • If IOTDB runs on a Server, the above also works  and Apache Flink is also a good choice.

The process is:

  • User registers a trigger into IoTDB.
  • When a data comes, IoTDB save it and check whether there are triggers on it
  • If so, call a streaming computing framework to do something;


You may need to know:

  • At least one streaming computing project.
  • SQL parser or some other DSL parser tool.

You have to modify the source codes of IoTDB server engine module.

Difficulty: A little hard

mentors:

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB with Heterogeneous Replica [distributed]

Apache IoTDB is a high-performance time-series database. Its cluster mode is in development. 

As we know, IoTDB uses a columnar file format, called TsFile, which is similar with Parquet. In such a columnar file, the order of columns will impact the query performance hugely. We call the order of columns in the file as the physical layout of the file.


In the distributed version of IoTDB, the data is replicated multiple times for data reliability, and a read operation can be routed to any one of them such that the query load is spread across the nodes.

If a query runs slowly on one node because of an unsuitable physical layout, rather than the overhead of the node, routing the query to other nodes is of no use. This is because the physical layout of the data on the disk on all nodes is the same.


The proposal is for:

Accelerate queries by organizing different replicas into different layout according to the query history. 

Then we need to:

  • collect the query history and find out which queries are frequent;
  • find an algorithm to get the best physical layout for the queries.

It is totally predictable that this feature will improve the performance of IoTDB and make it unique with other distributed systems.


You need to know:

  • Java
  • Quorum based replica control
  • Some stream algorithms

Mentor:

hxd@apache.org



Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache Nemo

Beam portability layer support for Apache Nemo

The Beam portability layer is being developed on Spark and Flink runners. Beam portability layer enables for programming in Java, Go, and Python, and enables running Beam programs programmed with the different languages on Nemo. This will be great for attracting more users to Apache Nemo. Related link: https://beam.apache.org/roadmap/portability/

  • Watch https://www.youtube.com/watch?v=I2ZqWAbbjUk for an overview and to get an idea
  • Refer to files under /runners/portability package, check requirements for the PortableRunner.java class
  • Flink is the system that provides most functionalities regarding portability, we should refer to PortableExecutionTest class, as well as FlinkPortableClientEntryPoint, FlinkPortablePipelineTranslator (for both batch and streaming), FlinkPortableRunnerResult classes, under /runners/flink, which implement the portability layer.
  • Spark runner also has points that we could refer to: PortableBatchMode,
  • We will need to make our `Nemo PortablePipelineOptions , as well as ClientEntryPoint, PipelineTranslator and PortableRunnerResult, just
  • ClientEntryPoint, PipelineTranslator and PortableRunnerResult, just
  • PipelineTranslator and PortableRunnerResult, just
  • PortableRunnerResult, just
Difficulty: Major
Potential mentors:
Won Wook Song, mail: wonook (at) apache.org
Project Devs, mail: dev (at) nemo.apache.org

Dynamic Task Sizing on Nemo

This is an umbrella issue to keep track of the issues related to the dynamic task sizing feature on Nemo.

Dynamic task sizing needs to consider a workload and try to decide on the optimal task size based on the runtime metrics and characteristics. It should have an effect on the parallelism and the partitions, on how many partitions an intermediate data should be divided/shuffled into, and to effectively handle skews in the meanwhile.

Difficulty: Major
Potential mentors:
Won Wook Song, mail: wonook (at) apache.org
Project Devs, mail: dev (at) nemo.apache.org

Optimize Parallelism Of SourceVertex

While intermediate vertices use partition concept to split data into tasks, source vertices use Readables instead. Extend sampling logic to cover source vertices. This will increase the effect of DTS to another level.

Difficulty: Major
Potential mentors:
Hwarim Hyun, mail: hwarim (at) apache.org
Project Devs, mail: dev (at) nemo.apache.org

Efficient Caching and Spilling on Nemo

In-memory caching and spilling are essential features in in-memory big data processing frameworks, and Nemo needs one.

  • Identify and persist frequently used data and unpersist it when its usage ended
  • Spill in-memory data to disk upon memory pressure
Difficulty: Major
Potential mentors:
Jeongyoon Eo, mail: jeongyoon (at) apache.org
Project Devs, mail: dev (at) nemo.apache.org

Beam

Implement an Azure blobstore filesystem for Python SDK

This is similar to BEAM-2572, but for Azure's blobstore.

Difficulty: Major
Potential mentors:
Pablo Estrada, mail: pabloem (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

Add Daffodil IO for Apache Beam

From https://daffodil.apache.org/:

Daffodil is an open source implementation of the DFDL specification that uses these DFDL schemas to parse fixed format data into an infoset, which is most commonly represented as either XML or JSON. This allows the use of well-established XML or JSON technologies and libraries to consume, inspect, and manipulate fixed format data in existing solutions. Daffodil is also capable of the reverse by serializing or “unparsing” an XML or JSON infoset back to the original data format.

We should create a Beam IO that accepts a DFDL schema as an argument and can then produce and consume data in the specified format. I think it would be most natural for Beam users if this IO could produce Beam Rows, but an initial version that just operates with Infosets could be useful as well.

Difficulty: Major
Potential mentors:
Brian Hulette, mail: bhulette (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

Kickstart the C# .net SDK for Beam

The idea of this GSoC project is to kickstart the creation of the Beam SDK for C# (.net). The goal is to create the minimal set of pieces required to allow a user to write and execute a WordCount type of pipeline in C# with Beam.

To do this we will need to implement the minimum set of abstractions of the SDK: ParDo (Beam’s Map with super powers) + GroupByKey, as well as a Harness capable of writing to the data channel, and some internal data representations (WindowedValue and others) to be able to run the pipeline using portable runners.

Don’t worry if the Beam specific details are not clear, just familiarity with the Big Data WordCount concepts are a prerequisite, as well as probably reading some of the Beam introductory material [1-3]. Good knowledge of C# and its idioms, as well as familiarity with the recent .net ecosystem are required for the student who wants to apply for this project.

[1] https://beam.apache.org/
[2] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
[3] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102


Difficulty: Minor
Potential mentors:
Ismaël Mejía, mail: iemejia (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

Implement Nexmark (benchmark suite) in Python and integrate it with Spark and Flink runners

Apache Beam [1] is a unified and portable programming model for data processing jobs (pipelines). The Beam model [2, 3, 4] has rich mechanisms to process endless streams of events.

Nexmark [5] is a benchmark for streaming jobs, basically a set of jobs (queries) to test different use cases of the execution system. Beam implemented Nexmark for Java [6, 7] and it has been succesfully used to improve the features of multiple Beam runners and discover performance regressions.

Thanks to the work on portability [8] we can now run Beam pipelines on top of open source systems like Apache Spark [9] and Apache Flink [10]. The goal of this issue/project is to implement the Nexmark queries on Python and configure them to run on our CI on top of open source systems like Apache Spark and Apache Flink. The goal is that it helps the project to track and improve the evolution of portable open source runners and our python implementation as we do for Java.

Because of the time constraints of GSoC we will adjust the goals (sub-tasks) depending on progress.

[1] https://beam.apache.org/
[2] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
[3] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
[4] https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43864.pdf
[5] https://web.archive.org/web/20100620010601/http://datalab.cs.pdx.edu/niagaraST/NEXMark/
[6] https://beam.apache.org/documentation/sdks/java/testing/nexmark/
[7] https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark
[8] https://beam.apache.org/roadmap/portability/
[9] https://spark.apache.org/
[10] https://flink.apache.org/

Difficulty: Minor
Potential mentors:
Ismaël Mejía, mail: iemejia (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

BeamSQL Pattern Recognization Functionality

The goal of this Jira is to support the following syntax in BeamSQL:

            SELECT T.aid, T.bid, T.cid
            FROM MyTable
            MATCH_RECOGNIZE (
            PARTITION BY userid
            ORDER BY proctime
            MEASURES
            A.id AS aid,
            B.id AS bid,
            C.id AS cid
            PATTERN (A B C)
            DEFINE
            A AS name = 'a',
            B AS name = 'b',
            C AS name = 'c'
            ) AS T
            

match_recognize is in SQL standard 2016. Currently Calcite also supports it. A good reference to match_recognize is [1]

This will requires touch core components of BeamSQL:
1. SQL parser to support the syntax above.
2. SQL core to implement physical relational operator.
3. Distributed algorithms to implement a list of functions in a distributed manner.

other references:
Calcite match_recognize syntax [2]

[1]: https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/match_recognize.html
[2]: https://calcite.apache.org/docs/reference.html#syntax-1

Difficulty: Major
Potential mentors:
Rui Wang, mail: amaliujia (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

BeamSQL aggregation analytics functionality

Mentor email: ruwang@google.com. Feel free to send emails for your questions.

Project Information
---------------------
BeamSQL has a long list of of aggregation/aggregation analytics functionalities to support.

To begin with, you will need to support this syntax:

            analytic_function_name ( [ argument_list ] )
            OVER (
            [ PARTITION BY partition_expression_list ]
            [ ORDER BY expression [{ ASC
            | DESC }] [, ...] ]
            [ window_frame_clause ]
            )
            

As there is a long list of analytics functions, a good start point is support rank() first.

This will requires touch core components of BeamSQL:
1. SQL parser to support the syntax above.
2. SQL core to implement physical relational operator.
3. Distributed algorithms to implement a list of functions in a distributed manner.
4. Enable in ZetaSQL dialect.

To understand what SQL analytics functionality is, you could check this great explanation doc: https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts.

To know about Beam's programming model, check: https://beam.apache.org/documentation/programming-guide/#overview

Difficulty: Major
Potential mentors:
Rui Wang, mail: amaliujia (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

Camel

Expose OData4 based service as consumer endpoint

Right now, only polling consumer is available for olingo4 component. It's better to have a real listening consumer for this.

The method may have name like 'listen' to be able to create a listening consumer.

Difficulty: Major
Potential mentors:
Dmitry Volodin, mail: dmvolod (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Camel grpc component doesn't transfer the Message headers

Headers that are added to the Message in the camel Exchange before making a call to the camel-grpc component are not received at the grpc consumer. The expectation is that these headers would be added to the grpcStub before sending over the wire (like other components like http4 etc).

Our team has come up with a workaround for this but it is extremely cumbersome. We had to extend the GrpcProducer to introduce a custom GrpcExchangeForwarder that would copy header from exchange to the stub before invoking the sync/async method.

At the consumer side we had to extend the GrpcConsumer to use a custom ServerInterceptor to capture the grpc headers and custom MethodHandler to transfer the grpc headers to the Camel exchange headers.

Difficulty: Major
Potential mentors:
Vishal Vijayan, mail: vijayanv (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

camel-snmp - Support for multiple security mechanisms in SNMP v3

Allow to add multiple users for SNMP v3 i.e. the SnmpTrapConsumer should support multiple combinations of authentication and privacy protocols and different passphrases. We cannot have a route per security mechanism.


Consider the below scenario.

I have multiple SNMP devices which have multiple authentication protocols and privacy protocols with different passphrases. Moreover, they can send any version of SNMP traps from v1 to v3. I must be able to configure those in a properties file or a DSL (i.e. the snmp version, the USM users etc).

Example:


            snmp.getUSM().addUser(
            new OctetString("MD5DES"),
            new UsmUser(new OctetString("MD5DES"),
            AuthMD5.ID,
            new OctetString("UserName"), PrivDES.ID,
            new OctetString("PasswordUser")));
            snmp.getUSM().addUser(
            new OctetString("MD5DES"),
            new UsmUser(new OctetString("MD5DES"),
            null, null, null,
            null));
             
            

.. other users with different auth, priv mechanisms (i.e. different security mechanisms). I must be able to receive traps from all of them.

Difficulty: Minor
Potential mentors:
Gowtham Gutha, mail: gowthamgutha (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

java-dsl - Add support for method references bean::methodName

Hi

This is not related only to spring integration.
I like to be able to use a spring service annotated class or bean directly from route but without using the method name as string, i.e. .bean(instance, "<method name") and instead use a method reference: .bean(instance::method)

But why?: 
1. not being able to navigate quickly(open) that method from the IDE. Need to do some intermediary steps to reach that method.
2. using of reflection internally by Camel to call that method.
3. not being able to rename the method without breaking the route.
4. not being able to see quickly (Alt+F7) who calls a methods in the IDE.
5. using strings to reference a method when we have method references seems not right.

As a workaround I had to add a helper class to simulate passing of method references and then internally to translate to method.

In case it helps explaining I am attaching the helper Bean.java class (you can use it for free or to do better).

You can use the class in any route like this:

from (X)
.bean(call(cancelSubscriptionService::buildSalesforceCase))
.to(Y)
.routeId(Z);

As you see I am forced to use the intermediary helper 'call' in order to translate to an Expression.
I would like to not have to use my helper and have the support built directly into Camel if possible. Let me know if there is a better solution to my problem.

Thanks

Difficulty: Major
Potential mentors:
Cristian Donoiu, mail: doncristiano (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Add tool to generate swagger doc at build time

We do not have at this moment a tool that can generate the swagger doc at build time. However I think it would be possible to develop such tool. We have existing tooling that parses the Java or XML source code (Camel routes) which we use for validating endpoints, or do route-coverage reports etc.
 
https://github.com/apache/camel/blob/master/tooling/maven/camel-maven-plugin/src/main/docs/camel-maven-plugin.adoc
 
We could then make that tool parse the rest-dsl and build up that model behind the scene and feed that into the swagger-java library for it to spit out the generated swagger doc.
 
We could make it as a goal on the existing camel-maven-plugin, or build a new maven plugin: camel-maven-swagger or something. Then people could use it during build time to generate the swagger doc etc. 
 
We maybe should allow to override/configure things as well from the tooling, so you can maybe set/mask hostname or set descriptions or other things that may not all be detailed in the rest-dsl.

Difficulty: Major
Potential mentors:
Claus Ibsen, mail: davsclaus (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Ability to load an SSLContextParameter with a Keystore containing multiple keys (aliases)

Hello,
I wish I could initialize a single SSLContextParameters at camel startup containing my truststore.jks (> 1 alias) and my keystore.jks (> 1 alias) in order to call it (refer to) in Routes (FTPs, HTTPs) without have to redefine a new SSLContextParameter for each EndPoint.

<camel:sslContextParameters id="sslIContextParameters">
<camel:trustManagers>
<camel:keyStore password="${truststore.jks.file.password}"
resource="${truststore.jks.file.location}" />
</camel:trustManagers>
<camel:keyManagers >
<camel:keyStore password="${keystore.jks.file.password}"
resource="${keystore.jks.file.location}" />
</camel:keyManagers>
</camel:sslContextParameters>

When my Keystore contains more than 1 alias, I have the following error when creating the Route at startup : 

Caused by: org.apache.camel.ResolveEndpointFailedException: Failed to resolve endpoint: https4://<host>:<port>/<address>?authPassword=RAW(password)&authUsername=login&authenticationPreemptive=true&bridgeEndpoint=true&sslContextParameters=sslContextParameters&throwExceptionOnFailure=true due to: Cannot recover key

due to

Caused by: java.security.UnrecoverableKeyException: Cannot recover key


When my keystore contains only one key, it works very well.

<camel:sslContextParameters id="sslIContextParameters">
<camel:trustManagers>
<camel:keyStore password="${truststore.jks.file.password}"
resource="${truststore.jks.file.location}" />
</camel:trustManagers>
<camel:keyManagers keyPassword="keyPassword">
<camel:keyStore password="${keystore.jks.file.password}"
resource="${keystore.jks.file.location}" />
</camel:keyManagers>
</camel:sslContextParameters>


So I would like to be able to call my SSLContextParameter for different EndPoint by specifying (if necessary) the alias of the Keystore needed (by specifying the alias and / or password of the key)


Objectif in my project :

  • 1 TrustStore.jks 
  • 1 Keystore.jsk
  • 1 unique SSLContextParameter
  • > 200 camelRoutes FTPs/HTTPs (ssl one way ou two way)


Thank a lot



Difficulty: Major
Potential mentors:
Florian B., mail: Boosy (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Introduce a SPI to automatic bind data format to transports

Some data formats such as the future CloudEvent one (https://issues.apache.org/jira/browse/CAMEL-13335) have specifications that describe how to bind them to specific transports (https://github.com/cloudevents/spec) so we should introduce a SPI to make this binding automatic so in a route like:

            from("undertow://http://0.0.0.0:8080")
                .unmarshal().cloudEvents()
            .to("kafka:my-topic");
            

the exchange gets automatically translated to a Kafka message according to the CloudEvent binding specs for Kafka.

Difficulty: Minor
Potential mentors:
Luca Burgazzoli, mail: lb (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Create a camel component for etcd v3

Difficulty: Minor
Potential mentors:
Luca Burgazzoli, mail: lb (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Camel website

This is an issue to track the work on the new Camel website.

If you wish to contribute to building the new Camel website please look in the website component issues labelled with help-wanted.

Difficulty: Major
Potential mentors:
Zoran Regvart, mail: zregvart (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

camel-minio - Component to store/load files from blob store

min.io is a s3 like blob store. So users have more freedom than being locked into aws

We can create a camel-minio component for it
https://github.com/minio/minio-java

Difficulty: Major
Potential mentors:
Claus Ibsen, mail: davsclaus (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

camel component options - Favour annotation based options

We should favour options on component classes to be annotation based, eg with @Metadata so we mark up only the options that are options. As other delegates and getter/setters may get mixed up.

Then in the future we will drop support and only require marked up options, just like endpoints where you must use @UriParam etc.

At first we can make our tool log a WARN and then we can see how many of our own components suffer from this.

Difficulty: Major
Potential mentors:
Claus Ibsen, mail: davsclaus (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

camel-restdsl-swagger-plugin - create camel routes for generated rest DSL

camel-restdsl-swagger-plugin can generate CamelRoutes.java from a Swagger / OpenAPI spec, which includes the REST DSL with to("direct:restN") stubs. Would be nice if it also autogenerated the equivalent from("direct:restN").log() placeholders to help jump start coding.

Difficulty: Major
Potential mentors:
Scott Cranton, mail: scranton (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Add call options to the camel-grpc component

Add advanced call options related to the one operation and not overriding channel option:

  • deadline
  • compression
  • etc.
Difficulty: Major
Potential mentors:
Dmitry Volodin, mail: dmvolod (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

camel-microprofile-opentracing

A camel module for this spec
https://github.com/eclipse/microprofile-opentracing

Its likely using the existing camel-opentracing and then implement the spec API and use smallrye implementation
https://github.com/smallrye/smallrye-opentracing

Difficulty: Major
Potential mentors:
Claus Ibsen, mail: davsclaus (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Support for OpenTelemetry

OpenTelemetry is becoming more and more relevant and would be nice to support it in camel

Difficulty: Major
Potential mentors:
Luca Burgazzoli, mail: lb (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Create a component for Kafka-Stream


Difficulty: Minor
Potential mentors:
Andrea Cosentino, mail: acosentino (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

Upgrade to JUnit 5

See http://junit.org/junit5/

Note: it provides a junit-vintage module so we should be able to migrate stuffs easily

Most users should now be able to write JUnit 5 tests using the modules created in CAMEL-13342.
Concerning the migration of camel own tests to JUnit5, the last blocker is that migrating flaky tests to JUnit 5 is not handled until mavensurefire 3 has been released or until open discussions in the junit team has converged.

Difficulty: Major
Potential mentors:
Luca Burgazzoli, mail: lb (at) apache.org
Project Devs, mail: dev (at) camel.apache.org

RocketMQ

RocketMQ Connect Hive

Content

The Hive sink connector allows you to export data from Apache RocketMQ topics to HDFS files in a variety of formats and integrates with Hive to make data immediately available for querying with HiveQL. The connector periodically polls data from RocketMQ and writes them to HDFS.

The data from each RocketMQ topic is partitioned by the provided partitioner and divided into chunks. Each chunk of data is represented as an HDFS file with topic, queueName, start and end offsets of this data chunk in the filename.

So, in this project, you need to implement a Hive sink connector based on OpenMessaging connect API, and run it on RocketMQ connect runtime.

You should learn before applying for this topic
Hive/Apache RocketMQ/Apache RocketMQ Connect/ OpenMessaging Connect API

Mentor

chenguangsheng@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

RocketMQ Connect Hbase

Content

The Hbase sink connector allows moving data from Apache RocketMQ to Hbase. It writes data from a topic in RocketMQ to a table in the specified HBase instance. Auto-creation of tables and the auto-creation of column families are also supported.

So, in this project, you need to implement an Hbase sink connector based on OpenMessaging connect API, and will execute on RocketMQ connect runtime.

You should learn before applying for this topic
Hbase/Apache RocketMQ/Apache RocketMQ Connect/ OpenMessaging Connect API

Mentor

chenguangsheng@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

RocketMQ Connect Cassandra

Content

The Cassandra sink connector allows writing data to Apache Cassandra. In this project, you need to implement a Cassandra sink connector based on OpenMessaging connect API, and run it on RocketMQ connect runtimeh3. You should learn before applying for this topic
Cassandra/[Apache RocketMQ

https://rocketmq.apache.org/]/[Apache RocketMQ Connecthttps://github.com/apache/rocketmq-externals/tree/master/rocketmq-connect]/ OpenMessaging Connect APIh3. Mentor
duhengforever@apache.orgvongosling@apache.org

 
 

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

RocketMQ Connect InfluxDB

Content

The InfluxDB sink connector allows moving data from Apache RocketMQ to InfluxDB. It writes data from a topic in Apache RocketMQ to InfluxDB. While The InfluxDB source connector is used to export data from InfluxDB Server to RocketMQ.

In this project, you need to implement an InfluxDB sink connector(source connector is optional) based on OpenMessaging connect API.

You should learn before applying for this topic

InfluxDB/[Apache RocketMQ|https://rocketmq.apache.org/]/[Apache RocketMQ Connect|https://github.com/apache/rocketmq-externals/tree/master/rocketmq-connect]/ OpenMessaging Connect API

Mentor

duhengforever@apache.orgwlliqipeng@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

The Operator for RocketMQ Exporter

he exporter exposes the endpoint of monitoring data collection to Prometheus server in the form of HTTP service. Prometheus server can obtain the monitoring data to be collected by accessing the endpoint endpoint provided by the exporter. RocketMQ exporter is such an exporter. It first collects data from rocketmq cluster, and then normalizes the collected data to meet the requirements of Prometheus system with the help of the third-party client library provided by Prometheus. Prometheus regularly pulls data from the exporter. This topic needs to implement an operator of rocketmq exporter to facilitate the deployment of the exporter in kubenetes platform.

You should learn before applying for this topic

RocketMQ-Exporter Repo
RocketMQ-Exporter Overview
Kubetenes Operator
RocketMQ-Operator

Mentor

wlliqipeng@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

RocketMQ Connect IoTDB

Content

The IoTDB sink connector allows moving data from Apache RocketMQ to IoTDB. It writes data from a topic in Apache RocketMQ to IoTDB.

IoTDB (Internet of Things Database) is a data management system for time series data, which can provide users specific services, such as, data collection, storage and analysis. Due to its lightweight structure, high performance and usable features together with its seamless integration with the Hadoop and Spark ecology, IoTDB meets the requirements of massive dataset storage, high throughput data input and complex data analysis in the industrial IoTDB field.

In this project, there are some update operations for historical data, so it is necessary to ensure the sequential transmission and consumption of data via RocketMQ. If there is no update operation in use, then there is no need to guarantee the order of data. IoTDB will process these data which may be disorderly.

So, in this project, you need to implement an IoTDB sink connector based on OpenMessaging connect API, and run it on RocketMQ connect runtime.

You should learn before applying for this topic

IoTDB/[Apache RocketMQ|https://rocketmq.apache.org/]/[Apache RocketMQ Connect|https://github.com/apache/rocketmq-externals/tree/master/rocketmq-connect]/ OpenMessaging Connect API

Mentor

hxd@apache.org, duhengforever@apache.orgwlliqipeng@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Apache RocketMQ Schema Registry

Content

In order to help RocketMQ improve its event management capabilities, and at the same time better decouple the producer and receiver, keep the event forward compatible, so we need a service for event metadata management is called a schema registry.

Schema registry will provide a GraphQL interface for developers to define standard schemas for their events, share them across the organization and safely evolve them in a way that is backward compatible and future proof.

You should learn before applying for this topic

Apache RocketMQ/Apache RocketMQ SDK/

Mentor

duhengforever@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Apache RocketMQ Channel for Knative

Context

Knative is a kubernetes based platform for building, deploying and managing modern serverless applications. Knative to provide a set of middleware components that are essential to building modern, source-centric, and container-based applications that can run anywhere: on-premises, in the cloud, or even in a third-party data centre. Knative consists of the Serving and Eventing components. Eventing is a system that is designed to address a common need for cloud-native development and provides composable primitives to enable late-binding event sources and event consumers. Eventing also defines an event forwarding and persistence layer, called a Channel. Each channel is a separate Kubernetes Custom Resource. This topic requires you to implement rocketmqchannel based on Apache RocketMQ.

You should learn before applying for this topic

How Knative works
RocketMQSource for Knative
Apache RocketMQ Operator

Mentor

wlliqipeng@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Apache RocketMQ Ingestion for Druid

Context

Druid is a real-time analytics database designed for fast slice-and-dice analytics ("OLAP" queries) on large data sets. In this topic, you should develop the RocketMQ indexing service enables the configuration of supervisors on the Overlord, which facilitate ingestion from RocketMQ by managing the creation and lifetime of RocketMQ indexing tasks. These indexing tasks read events using RocketMQ's own partition and offset mechanism. The supervisor oversees the state of the indexing tasks to coordinate handoffs, manage failures, and ensure that the scalability and replication requirements are maintained.

You should learn before applying for this topic

Apache Druid Data Ingestion

Mentor

vongosling@apache.orgduhengforever@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Apache RocketMQ CLI Admin Tool Developed by Golang

Apache rocketmq provides a cli admin tool developed by Java to querying, managing and diagnosing various problems. At the same time, it also provides a set of API interface, which can be called by Java application program to create, delete, query, message query and other functions. This topic requires the realization of CLI management tool and a set of API interface developed by golang language, through which go application can realize the creation, query and other operations of topic.

You should learn before applying for this topic

Apache RocketMQ
Apache RocketMQ Go Client

Mentor

wlliqipeng@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

RocketMQ Connect Elasticsearch

Content

The Elasticsearch sink connector allows moving data from Apache RocketMQ to Elasticsearch 6.x, and 7.x. It writes data from a topic in Apache RocketMQ to an index in Elasticsearch and all data for a topic have the same type.

Elasticsearch is often used for text queries, analytics and as an key-value store (use cases). The connector covers both the analytics and key-value store use cases.

For the analytics use case, each message is in RocketMQ is treated as an event and the connector uses topic+message queue+offset as a unique identifier for events, which then converted to unique documents in Elasticsearch. For the key-value store use case, it supports using keys from RocketMQ messages as document ids in Elasticsearch and provides configurations ensuring that updates to a key are written to Elasticsearch in order.

So, in this project, you need to implement a sink connector based on OpenMessaging connect API, and will executed on RocketMQ connect runtime.

You should learn before applying for this topic

Elasticsearch/[Apache RocketMQ|https://rocketmq.apache.org/]/[Apache RocketMQ Connect|https://github.com/apache/rocketmq-externals/tree/master/rocketmq-connect]/ OpenMessaging Connect API

Mentor

duhengforever@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

CloudEvents support for RocketMQ

Context

Events are everywhere. However, event producers tend to describe events differently.

The lack of a common way of describing events means developers must constantly re-learn how to consume events. This also limits the potential for libraries, tooling and infrastructure to aide the delivery of event data across environments, like SDKs, event routers or tracing systems. The portability and productivity we can achieve from event data is hindered overall.

CloudEvents is a specification for describing event data in common formats to provide interoperability across services, platforms and systems.
RocketMQ as an event streaming platform, also hopes to improve the interoperability of different event platforms by being compatible with the CloudEvents standard and supporting CloudEvents SDK. In this topic, you need to improve the binding spec. and implement the RocketMQ CloudEvents SDK(Java、Golang or others).

You should learn before applying for this topic

Apache RocketMQ/Apache RocketMQ SDK/CloudEvents

Mentor

duhengforever@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Apache RocketMQ Connect Hudi

Context

Hudi could ingest and manage the storage of large analytical datasets over DFS (hdfs or cloud stores). It can act as either a source or sink for streaming processing platform such as Apache RocketMQ. it also can be used as a state store inside a processing DAG (similar to how rocksDB is used by Flink). This is an item on the roadmap of the Apache RocketMQ. This time, you should implement a fully hudi source and sink based on RocketMQ connect framework, which is a most important implementation of the OpenConnect.

You should learn before applying for this topic

Apache RocketMQ Connect Framework
Apache Hudi
.

Mentor

vongosling@apache.orgduhengforever@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Context

There are many ways that Apache Flink and Apache RocketMQ can integrate to provide elastic data processing at a large scale. RocketMQ can be used as a streaming source and streaming sink in Flink DataStream applications, which is the main implementation and popular usage in RocketMQ community. Developers can ingest data from RocketMQ into a Flink job that makes computations and processes real-time data, to then send the data back to a RocketMQ topic as a streaming sink. More details you could see from https://github.com/apache/rocketmq-externals/tree/master/rocketmq-flink.

With more and more DW or OLAP engineers using RocketMQ for their data processing work, another potential integration needs arose. Developers can take advantage of as both a streaming source and a streaming table sink for Flink SQL or Table API queries. Also, Flink 1.9.0 makes the Table API a first-class citizen. It's time to support SQL in RocketMQ. This is the topic for Apache RocketMQ connect Flink.

You should learn before applying for this topic

Apache RocketMQ Flink Connector
Apache Flink Table API

Extension

For some expertise students in the streaming field, you could continue to implements and provides an exactly-once streaming source and at-least-once(or exactly-once)streaming sink, like the issue #500 said.

Mentor

nicholasjiang@apache.org ,   duhengforever@apache.orgvongosling@apache.org

Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

RocketMQ Source Connect Cassandra


The Cassandra source connector allows reading data from Apache Cassandra and writing data to Apache RocketMQ. In this project, you need to implement a Cassandra source connector based on OpenMessaging connect API, and run it on RocketMQ connect runtime You should learn before applying for this topic
Cassandra/[Apache RocketMQ|https://rocketmq.apache.org/]/[OpenMessaging Connect API|https://github.com/openmessaging/openmessaging-connect]

Difficulty: Major
Potential mentors:
Ding Lei, mail: dinglei (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Apache RocketMQ Scaler for KEDA

Context

KEDA allows for fine-grained autoscaling (including to/from zero) for event-driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition. KEDA has a number of “scalers” that can both detect if a deployment should be activated or deactivated, and feed custom metrics for a specific event source. In this topic, you need to implement the RocketMQ scalers.

You should learn before applying for this topic

Helm/Apache RocketMQ Operator/Apache RocketMQ Docker Image
Apache RocketMQ multi-replica mechanism(based on DLedger)
How KEDA works

Mentor

wlliqipeng@apache.orgvongosling@apache.org


Difficulty: Major
Potential mentors:
duheng, mail: duheng (at) apache.org
Project Devs, mail: dev (at) rocketmq.apache.org

Fineract Cloud Native

Static Analysis and Vulnerability Scanning of Apache Fineract CN

Overview & Objectives
As our product is core banking platform and our clients are financial institutions, we strive hard to make our code base as secure as possible. However, due to ever increasing security threats and vulnerabilities, it is the need of hour that we analyze our code base in depth for security vulnerabilities. During pull request merge process, we have a process in place wherein we do peer code review,QA and integration tests. This practice has been very effective and our community is already reaping the benefits of such a strong code review process. However, we should test our code against the standard vulnerabilities which have been identified by reputed organisations like Mitre to gain more confidence. It has become a critical part of independent and partner-led deployments
Description
We can make use of opensource tools like JlintFindbugs , SonarQube or frameworks like  Total output Integration Framework (TOIF) - used by companies dedicated to produce military grade secure systems. As our environments become more containerized we can also utilize tools like: AnchoreSnyk.io, and Docker Bench for Security
It would be worthwhile, if we can dedicate one GSOC project for this analysis. The student would be responsible to analyse the findings, generate reports, identify if it is really a bug and then submit a fix after consultation from the community. Of course, the student needs to demonstrate some basic understanding of security vulnerabilities( like buffer overflow etc) and should have some academic level of experience working with static analysis tools.
 
Helpful Skills
Java (Spring/JPA/Jersey), SQL , JavaScript , Git, Apache POI
Impact
Improved security keeping the integrity and privacy of the underbank's financial data intact.
[Other Resources
Static Analysis of Apache Fineract Project- A GSOC project idea
https://mifosforge.jira.com/wiki/spaces/projects/pages/183063580/Static+Analysis+of+Apache+Fineract+Project-+A+GSOC+project+idea]
Difficulty: Major
Potential mentors:
Sanyam Goel, mail: sanyam96 (at) apache.org
Project Devs, mail: dev (at) fineract.apache.org

Fineract CN Mobile 4.0

Overview & Objectives
Just as we have a mobile field operations app on Apache Fineract 1.0, we have recently built out on top of the brand new Apache Fineract CN micro-services architecture, an initial version of a mobile field operations app with an MVP architecture and material design. Given the flexibily of the new architecture and its ability to support different methodologies - MFIs, credit unions, cooperatives, savings groups, agent banking, etc - this mobile app will have different flavors and workflows and functionalities. 
Description
In 2019, our Google Summer of Code intern worked on additional functionality in the Fineract CN mobile app. In 2020, the student will work on the following tasks: * Add in support for creation of Centers
  • Extend Kotlin support in app and continue changing the retrofit models in kotlin.
  • Offline mode via Couchbase support
  • Integrate with Payment Hub to enable disbursement via Mobile Money API
  • Add GIS features like location tracking, dropping of pin into the app
  • Add Task management features into the app. 
  • Enable collection of data in field via the app.
  • Build and design interface for bulk collections 
Helpful Skills
Android development, SQL, Java, Javascript, Git, Spring, OpenJPA, Rest,
Impact
Allows staff to go directly into the field to connect to the client. Reduces cost of operations by enabling organizations to go paperless and be more efficient.
[Other Resources
https://github.com/apache/fineract-cn-mobile
https://github.com/apache/fineract-cn-mobile]
Difficulty: Major
Potential mentors:
Sanyam Goel, mail: sanyam96 (at) apache.org
Project Devs, mail: dev (at) fineract.apache.org

OpenWebBeans

Implement lightweight CDI-centric HTTP server + allow build-time CDI proxy generation

Apache OpenWebBeans (OWB) is a IoC container implementing CDI (Java Contexts and Dependency Injection) specification.



With the rise of Kubernetes and more generally the Cloud adoption, it becomes more and more key to be able to have fast, light and reliable servers.
That ecosystem is mainly composed of Microprofile servers.
However their stack is quite huge for most applications and OpenWebBeans-based Microprofile server are not CDI centric (Meecrowave and TomEE are Tomcat centric).

This is why the need of a light HTTP server (likely Netty based), embeddable in CDI context (as a bean) comes.
It will be close to a light embedded servlet container but likely more reactive in the way the server will need to scale.
It must handle fixed size payload (with Content-Length header) but also chunking.

This task will require:

  1. define a lightweight HTTP API
    • start with the most essential features
    • at least supporting filter like interception, even interceptor based but in a reactive fashion (CompletionStage)
    • optional, if there is enough time left: add features like fileupload support
  2. implement the API (marry our API / CDI / OWB / Netty)
  3. make it configurable
    • via code / builder pattern
    • optional, if there is enough time left: via Micoprofile Config



Once this light server is ready, the next step for a Java application to embrace the cloud is to make it native runnable.
Today OWB generates the class proxies, which are required per CDI specs to enable features like Interception and Decoration, lazy in runtime-mode.
A native-image can be generated via the "native-image" cmd tool from GraalVM, where you can include the classpath. This classpath must contain the generated class-proxies as the generated nativeimage can't generate bytecode anymore.
It's not a trivial task to enable OWB to create proxies in buildtime.

This task will require:

  1. change the "dynamic" classname generation to "static", otherwise we can't rely on the classname when lookup the proxyclass
  2. implement a proxy SPI in OWB, to enable to load pre-generated proxies instead of generate them in runtime
  3. implement a BuildTimeProxyGenerator class
    • it should accept a list of beans (bean class + interceptor classes + decorator classes)
    • optional, if there is enough time left: later we can add a more complex solution which also scanns the classpath for beans

In scope of this project, it's enough to manually call the BuildTimeProxyGenerator via a Runnable (with a companion main(String[])) and add the generated proxies in the classpath of the "native-image" cmd.


 
You should know:
• Java
• HTTP


Difficulty: Major

mentors: tandraschko@apache.org, rmannibucau@apache.org
Potential mentors:
Project Devs, mail: dev (at) openwebbeans.apache.org

Difficulty: Major
Potential mentors:
Thomas Andraschko, mail: tandraschko (at) apache.org
Project Devs, mail: dev (at) openwebbeans.apache.org