All proposed and accepted APE proposals should be put under this page. They should also follow the template below:

Status: Proposed

dev@asterixdb.a.o thread	(permalink to discuss thread)
JIRA issue(s)	Unable to render Jira issues macro, execution error.
Release target	(release number)

Members

Hari Kishore Chaparala
Ian Maxon
Mike Carey

Motivation

Overview:

Apache Iceberg is a popular datalake solution like Hudi and Delta Lake that helps process and query unorganized, unstructured, structured, and semi-structured data. Data lakes offer high scalability and flexibility compared to data warehouses. Additionally, users are not limited to a specific database query language and popular tools like Spark and Hive provide a wider range of query functionality. Some applications of data lakes include Business analytics, data discovery, log processing, and machine learning.

In this project, we are going to integrate Iceberg support into AsterixDB, specifically for select queries i.e., AsterixDB is going to be a query engine that supports data stored in Iceberg format.

Apache Iceberg:

A table format specification
A set of APIs and libraries for engines to interact with tables following that specification
Iceberg tables are stored in 3 layers,

Catalog – A catalog is essentially a collection of tables and provides a way to manage tables in a consistent and standardized way. Stores pointers to tables’ metadata file locations. The catalog also supports atomic operations to update metadata pointers. Some implementations: Hive, Nessie, and Hadoop.
Metadata –

Metadata file: contains table schema, snapshot information, partition spec, table stats, etc.
Manifest List: list of manifest files that make up a snapshot.
Manifest files: Tracks data files and stats about each file.

Data – Table data is stored in multiple data files in Parquet, Avro, and ORC. This is called the data layer.

Planned Iceberg support:

Feature	AsterixDB Support	Iceberg availability	Comments
Iceberg format version	Version 1	Iceberg currently has 2 format versions: 1 & 2. The primary change in version 2 adds delete files to encode the rows that are deleted in existing data files. This version can be used to delete or replace individual rows in immutable data files without rewriting the files.	Currently, AsterixDB parallelizes external data file reading using various Hyracks workers across all nodes. Delete files can be large, so the ideal flow should be to process the delete-files parallelly in the first stage and in the second stage, process the data files parallelly with a filter on the collated delete-file data. This 2-step process requires some flow changes in the query execution and hence skipped for the first cut.
Datafile format	Parquet	Iceberg spec supports Parquet, Avro, and ORC	We currently only support Parquet file processing.
Table types	File System Tables	File system tables – Metadata is only stored in file system Metastore tables – Metadata is also stored in Metastores like Hive, Glue, and Nessie.	Initially we are planning to support file system tables but it is possible to add support for meta store tables sign the interfaces provided by the Iceberg SDK. But ideally, we would want AsterixDB itself to act as the metastore in the long run.
File Systems	S3 and HDFS	Localfs, HDFS, S3, and any file system supported by the Hadoop library.	AsterixDB supports parquet file processing in S3 and HDFS
CRUD operations	Read	Iceberg allows all CRUD operations including schema and partition evolution.	Complete CRUD support can be achieved using the Iceberg SDK with which we can take control of updating the Iceberg metadata files. But the current use case is, we already have iceberg tables managed by other engines and AsterixDB is used as a query (read-only) engine. So we are restricting ourselves to read support. Ideally, we would want AsterixDB to act as a metastore for iceberg tables and perform all CRUD operations using SQL++. This can be picked as part of the initiative where we start supporting updates on external data. ASterixDB is currently only read-only on all external datasets.
Time travel	Only the latest snapshot is queried	Iceberg allows time travel queries.	Every time we run a query, we use the latest iceberg snapshot. In order to support time travel, we would need new keywords in SQL++. Ex: Select * from XYZ AS OF 2023-01-01.

Proposed changes

Interface:

We introduce additional properties to tell the compiler that we are querying an Iceberg table.

“table-format”

Specifies the table format of the external dataset.
Required for all Iceberg tables
Value: “apache-iceberg”

“metadata-path”

Specifies the table metadata location
Required for HDFS adapter
Redundant for S3 adapter as container and definition can be used to construct the path.

Example DDL:

CREATE EXTERNAL DATASET IcebergDataset(IcebergTableType) USING S3 (
("accessKeyId"="-----"),
("secretAccessKey"="--------"),
("region"="us-east-1"),
("serviceEndpoint"="https://s3.amazonaws.com/"),
("container"="my-bucket"),
("definition"="my-table"),
("table-format"="apache-iceberg"),
("format"="parquet")
);

CREATE EXTERNAL DATASET IcebergDataset(IcebergTableType) USING hdfs
(
("hdfs"="hdfs://127.0.0.1:31888"),
("input-format"="parquet-input-format"),
("table-format"="apache-iceberg"),
("metadata-path"="asterix/my_table/")
);

Changes

During the compilation phase of the iceberg external datasets queries, an adapter gets created which is configured to use a path parameter (a comma-separated string of data file locations). We modify this path parameter so that it uses the data files from the latest Iceberg table snapshot. This is done in ExternalDataUtils.prepare call which is common for all external data adapters.

The data file locations corresponding to the latest snapshot are collected using the HadoopTables interface provided by Iceberg. This interface can load the table at a given location and provides the current snapshot details.

AsterixDB uses this updated “path” parameter to spilt the file processing into several parallel tasks. This flow is not changed.

Dependencies changes:

iceberg-core

Add the below dependency in the module: asterix-external-data (compile scope) and asterix-app ( test scope)

<dependency>
<groupId>org.apache.iceberg</groupId>
<artifactId>iceberg-core</artifactId>
<version>1.1.0</version>
</dependency>

Compatibility/Migration

The proposed changes don’t modify any existing flow or behavior and can be seen as an extension.

Dependency conflicts

Iceberg-core uses a newer version of apache-avro which is in conflict with kite-sdk, a unit test scope dependency that we are currently using to infer AVRO schema from JSON data files. Kite-sdk is last updated in 2015 and so we are planning to remove it as a dependency in AsterixDB and build our own schema inference utility with kite-sdk as the reference.

Testing

E2E tests on HDFS and S3 external adapters for

Reading iceberg data
Reading from the latest snapshot
Reading when the table is empty
Reading when there are multiple datafiles

Negative cases:

Should throw an appropriate error on iceberg format version 2
Should throw an appropriate error on invalid metadata path
Should throw an appropriate error if the iceberg table is a Metastore table
Should throw an appropriate error if the data file format is not parquet.

Performance:

Measure performance difference with and without iceberg when querying large files.

Page tree

APE 1: Iceberg API Integration