Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Contents

...

Apache ShardingSphere: Proofread the DML SQL definitions for ShardingSphere Parser

Apache ShardingSphere

Apache ShardingSphere is a distributed database middleware ecosystem, including 2 independent products, ShardingSphere JDBC and ShardingSphere Proxy presently. They all provide functions of data sharding, distributed transaction, and database orchestration.
Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere

Background

ShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer` and `Oracle`, which means we have to understand different database dialect SQLs.
More details: https://shardingsphere.apache.org/document/current/en/features/sharding/principle/parse/

Task

This issue is to proofread the DML(SELECT/UPDATE/DELETE/INSERT) SQL definitions for Oracle. As we have a basic Oracle SQL syntax definitions but do not keep in line with Oracle DOC, we need you to find out the vague SQL grammar definitions and correct them referring to Oracle DOC.

Notice, when you review these DML(SELECT/UPDATE/DELETE/INSERT) SQLs, you will find that these definitions will involve some basic elements of Oracle SQL. No doubt, these elements are included in this task as well.

Relevant Skills

1. Master JAVA language
2. Have a basic understanding of Antlr g4 file
3. Be familiar with Oracle SQLs

Targets files

1. DML SQLs g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/DMLStatement.g4
2. Basic elements g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/BaseRule.g4

References

1. Oracle SQL quick reference: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlqr/SQL-Statements.html#GUID-1FA35EAD-AED2-4619-BFEE-348FF05D1F4A
2. Detailed Oracle SQL info: https://docs.oracle.com/pls/topic/lookup?ctx=en/database/oracle/oracle-database/19/sqlqr&id=SQLRF008

Mentor

Juan Pan, PMC of Apache ShardingSphere, panjuan@apache.org

Difficulty: Major
Potential mentors:
Juan Pan, mail: panjuan (at) apache.org
Project Devs, mail: dev (at) shardingsphere.apache.org

...

IoTDB

Implement PISA index in Apache IoTDB

Apache IoTDB is a highly efficient time series database, which supports high speed query process, including aggregation query.

Currently, IoTDB pre-calculates the aggregation info, or called the summary info, (sum, count, max_time, min_time, max_value, min_value) for each page and each Chunk. The info is helpful for aggregation operations and some query filters. For example, if the query filter is value >10 and the max value of a page is 9, we can skip the page. For another example, if the query is select max(value) and the max value of 3 chunks are 5, 10, 20, then the max(value) is 20. 

However, there are two drawbacks:

1. The summary info actually reduces the data that needs to be scanned as 1/k (suppose each page has k data points). However, the time complexity is still O(N). If we store a long historical data, e.g., storing 2 years data with 500KHz, then the aggregation operation may be still time-consuming. So, a tree-based index to reduce the time complexity from O(N) to O(logN) is a good choice. Some basic ideas have been published in [1], while it can just handle data with fix frequency. So, improving it and implementing it into IoTDB is a good choice.

2. The summary info is helpless for evaluating the query like where value >8 if the max value = 10. If we can enrich the summary info, e.g., storing the data histogram, we can use the histogram to evaluate how many points we can return. 

This proposal is mainly for adding an index for speeding up the aggregation query. Besides, if we can let the summary info be more useful, it could be better.

Notice that the premise is that the insertion speed should not be slow down too much!

By the way, IoTDB provides an index framework already. So, the PISA index should be compatible with the index framework.

You should know:
• IoTDB query process
• TsFile structure and organization
• Basic index knowledge
• Java 

difficulty: Major
mentors:
hxd@apache.org

Reference:

[1] https://www.sciencedirect.com/science/article/pii/S0306437918305489
 
 
 

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB Integration Test

Apache IoTDB is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.

Now, IoTDB uses JUnit for its UT/IT test.

However, there are two drawbacks:

1. There are many singleton class instances in IoTDB. Therefore, modifying something in a test may impact others, and it requires us do many cleanup work after a test.

Especially, after we open an serversocket (by Thrift), though we have called the socket.close, the socket may be not closed quickly (controlled by Thrift). But, if the next test begins, then a "the port is already used" error will occur.

2. when testing IoTDB's cluster module, we may need to start at least 3 IoTDB instances in one server.
Using JUnit, the 3 instances will be in one JVM, which is conflicted with the reality "IoTDB has many singleton instances".

So, next, we want to use TestContainer, which is a combiner of Docker and JUnit.

This task is for:

1. using TestContainer to re-implement all IT codes of IoTDB;
2. using TestContainer to add some IT codes for IoTDB's cluster module.

Needed skills:

  • Java
  • Docker (Docker-Compose better)
  • Know or learn Junit and TestContainer

[1] iotdb.apache.org
[2] https://www.testcontainers.org/

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB C# library

Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.

IoTDB has two kinds of client interfaces: SQL and native API (also called as session API.)

This task is for the native API.

IoTDB uses Apache Thrift[2] as its RPC framework, so all native API can be generated by Thrift. However, to accelerate the performance, we may use some byte array in Thrift, rather than a Struct, which is not quite friendly to users.

That is why we provide our session API. Session API just wraps the interfaces of the generated thrift codes. Now we have Java[4], Python and c++ version[3]. The C# version is left.

This task hopes you can provide a c# library for IoTDB.

Needed skills:

  • Thrift
  • C#
  • know Java

[1] iotdb.apache.org
[2] http://thrift.apache.org/
[3] https://iotdb.apache.org/UserGuide/Master/Client/Programming%20-%20Other%20Languages.html
[4] https://iotdb.apache.org/UserGuide/Master/Client/Programming%20-%20Native%20API.html

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB: Metadata (Schema) Storage Engine

Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.

Different with traditional relational databases, IoTDB uses tree-based structure in memory to manage the schema (a.k.a, metadata), and use a Write-Ahead-Log-like file structure to persist the schema.

Now, each time series will take about 300 Bytes in memory. However, an IoTDB instance may manage more than 100 million time series, which may take more than 30GB memory.

Therefore, we'd like to re-design the schema management module.
1. File: Persist the tree on disk like a b-tree.
2. WAL: implement the WAL of the metadata. So we can update the tree on disk in batch, rather than one operation by one.
3. Cache: we may have no memory to load the whole tree. So a cache is needed and query from the tree on disk is needed.

What knowledge you need to know:
1. Java
2. Basic design idea about Database [2]

[1] https://iotdb.apache.org
[2] http://pages.cs.wisc.edu/~dbbook/openAccess/firstEdition/slides/pdfslides/mod2l1.pdf

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Integrating Apache IoTDB and Apache Superset

Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.

Apache Superset [2] is fast, lightweight, intuitive, and loaded with options that make it easy for users of all skill sets to explore and visualize their data, from simple line charts to highly detailed geospatial charts.

We hope that Superset can be used as a data display and analysis tool of IoTDB, which will bring great convenience to analysts of the IoT and IIoT.

For a database engine to be supported in Superset, it requires having a Python compliant SQLAlchemy dialect [3] as well as a DBAPI driver [4] defined. The current Python client of IoTDB is packaged by Apache Thrift generated code and does not follow a certain interface specification. Therefore, the first thing you need to do is to implement a standard SQLAlchemy connector based on the current Python client (or some new interfaces defined and generated by Thrift).

Next, you need to explore how to integrate IoTDB and Superset and document the usage in a user-friendly way. The integration documentation for Apache Kylin and Superset is here [5] for your reference.

What knowledge you need to know:

  • Basic database knowledge (SQL)
  • Python

[1] https://iotdb.apache.org
[2] https://superset.apache.org/
[3] https://docs.sqlalchemy.org/en/13/dialects/
[4] https://www.python.org/dev/peps/pep-0249/
[5] http://kylin.apache.org/blog/2018/01/01/kylin-and-superset/

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

Apache IoTDB: GUI workbench

Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.

As a database, it is good to have a workbench to operate IoTDB using a GUI.

For example, there is a 3rd-part web-based workbench for Apache Cassandra [2]. MySQL supports a more complex workbench application [3].

We also want to IoTDB has a workbench.

Task:
1. execute SQL and show results in Table or Chart.
2. view the schema of IoTDB (how many Storage groups, how many time series etc..)
3. View and modify IoTDB's configuration
4. View IoTDB's dynamic status (e.g., info that JMX can get)

(As we have integrated IOTDB with Apache Zeppelin, task 1 has done. So, we hope this workbench can be more lightweight than using Zeppelin.)

Better to use Java. (Python or some others are also ok).

Needed Skills:

  • Java
  • Web application development

[1] iotdb.apache.org
[2] https://github.com/avalanche123/cassandra-web
[3] https://www.mysql.com/cn/products/workbench/

Difficulty: Major
Potential mentors:
Xiangdong Huang, mail: hxd (at) apache.org
Project Devs, mail: dev (at) iotdb.apache.org

TrafficControl

GSOC: Varnish Cache support in Apache Traffic Control

Background
Apache Traffic Control is a Content Delivery Network (CDN) control plane for large scale content distribution.

Traffic Control currently requires Apache Traffic Server as the underlying cache. Help us expand the scope by integrating with the very popular Varnish Cache.

There are multiple aspects to this project:

  • Configuration Generation: Write software to build Varnish configuration files (VCL). This code will be implemented in our Traffic Ops and cache client side utilities, both written in Go.
  • Health Monitoring: Implement monitoring of the Varnish cache health and performance. This code will run both in the Traffic Monitor component and within Varnish. Traffic Monitor is written in Go and Varnish is written in C.
  • Testing: Adding automated tests for new code

Skills:

  • Proficiency in Go is required
  • A basic knowledge of HTTP and caching is preferred, but not required for this project.
Difficulty: Major
Potential mentors:
Eric Friedrich, mail: friede (at) apache.org
Project Devs, mail: dev (at) trafficcontrol.apache.org

...