Page History

Apache Doris
Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.
Page: https://doris.apache.org

Github: https://github.com/apache/doris

Background

In Apache Doris, dictionary encoding is performed during data writing and compaction. Dictionary encoding will be implemented on string data types by default. The dictionary size of a column for one segment is 1M at most. The dictionary encoding technology accelerates strings during queries, converting them into INT, for example.

Task

Phase One: Get familiar with the implementation of Apache Doris dictionary encoding; learning how Apache Doris dictionary encoding accelerates queries.
Phase Two: Evaluate the effectiveness of full dictionary encoding and figure out how to optimize memory in such a case.

Learning Material

Page: https://doris.apache.org
Github: https://github.com/apache/doris

Mentor

Mentor: Chen Zhang, Apache Doris Committer, zhangechen@apachezhangchen@apache.org
Mentor: Zhijing Lu, Apache Doris Committer, luzhijing@apache.org
Mailing List: dev@doris.apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Zhijing Lu, mail: luzhijing (at) apache.org

Project Devs, mail: dev (at) doris.apache.org

[GSoC][Doris] Supports BigQuery/Apache Kudu/Apache Cassandra/Apache Druid in Federated Queries

Apache Doris
Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.
Page: https://doris.apache.org
Github: https://github.com/apache/doris

Background

Apache Doris supports acceleration of queries on external data sources to meet users' needs for federated queries and analysis.
Currently, Apache Doris supports multiple external catalogs including those from Hive, Iceberg, Hudi, and JDBC. Developers can connect more data sources to Apache Doris based on a unified framework.

Objective

Enable Apache Doris to access one or more of these data sources via the Multi-Catalog feature: BigQuery/Kudu/Cassandra/Druid;
Compile relevant documentation. See an example here: https://doris.apache.org/docs/dev/lakehouse/multi-catalog/hive

Task
Phase One:

Get familiar with the Multi-Catalog structure of Apache Doris, including the metadata synchronization mechanism in FE and the data reading mechanism of BE.
Investigate how metadata should be acquired and how data access works regarding the picked data source(s); produce the corresponding design documentation.

Phase Two:

Develop connections to the picked data source(s) and implement access to metadata and data.

Learning Material

Page: https://doris.apache.org
Github: https://github.com/apache/doris

Mentor

Mentor: Mingyu Chen, Apache Doris PMC Member & Committer, morningman@apache.org ^{Image Added}
Mentor: Calvin Kirs, Apache Geode PMC & Committer, Kirs@apache.org^{Image Added}
Mailing List: dev@doris.apache.org

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Zhijing Lu, mail: luzhijing (at) apache.org

Project Devs, mail: dev (at) doris.apache.org

...

Dubbo GSoC 2023 - Refactor Connection

Background

At present, the abstraction of connection by client in different protocols in Dubbo is not perfect. For example, there is a big discrepancy between the client abstraction of connection in dubbo and triple protocols. As a result, the enhancement of connection-related functions in the client is more complicated, and the implementation cannot be reused. At the same time, the client also needs to implement a lot of repetitive code when extending the protocol.

Target

Reduce the complexity of the client part when extending the protocol, and increase the reuse of connection-related modules.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Albumen Kevin, mail: albumenj (at) apache.org

Project Devs, mail:

Dubbo GSoC 2023 - IDL management

Background

Dubbo currently supports protobuf as a serialization method. Protobuf relies on proto (Idl) for code generation, but currently lacks tools for managing Idl files. For example, for java users, proto files are used for each compilation. It is more troublesome, and everyone is used to using jar packages for dependencies.

Target

Implement an Idl management and control platform, support idl files to automatically generate dependency packages in various languages, and push them to relevant dependency warehouses

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Albumen Kevin, mail: albumenj (at) apache.org

Project Devs, mail:

Dubbo GSoC 2023 - Refactor the http layer

Background

Dubbo currently supports the rest protocol based on http1, and the triple protocol based on http2, but currently the two protocols based on the http protocol are implemented independently, and at the same time, they cannot replace the underlying implementation, and their respective implementation costs are relatively high.

Target

In order to reduce maintenance costs, we hope to be able to abstract http. The underlying implementation of the target implementation of http has nothing to do with the protocol, and we hope that different protocols can reuse related implementations.

Difficulty: Major

Project size: ~350 hour (large)

Potential mentors:

Albumen Kevin, mail: albumenj (at) apache.org

Project Devs, mail:

...

Space shortcuts

Child pages

Versions Compared

Old Version 23

New Version 24

Key

[GSoC][Doris]Dictionary Encoding Acceleration

Background

Task

Learning Material

Mentor

[GSoC][Doris] Supports BigQuery/Apache Kudu/Apache Cassandra/Apache Druid in Federated Queries

Background

Objective

Learning Material

Mentor

Dubbo GSoC 2023 - Refactor Connection

Background

Target

Dubbo GSoC 2023 - IDL management

Background

Target

Dubbo GSoC 2023 - Refactor the http layer

Background

Target