All proposed and accepted APE proposals should be put under this page. They should also follow the template below:


Status: Proposed

dev@asterixdb.a.o thread(permalink to discuss thread)
JIRA issue(s)

Unable to render Jira issues macro, execution error.

Release target(release number)

Members

  • Hari Kishore Chaparala
  • Ian Maxon
  • Mike Carey

Motivation

Overview:

Apache Iceberg is a popular datalake solution like Hudi and Delta Lake that helps process and query unorganized, unstructured, structured, and semi-structured data. Data lakes offer high scalability and flexibility compared to data warehouses. Additionally, users are not limited to a specific database query language and popular tools like Spark and Hive provide a wider range of query functionality. Some applications of data lakes include Business analytics, data discovery, log processing, and machine learning. 

In this project, we are going to integrate Iceberg support into AsterixDB, specifically for select queries i.e., AsterixDB is going to be a query engine that supports data stored in Iceberg format.

Apache Iceberg:


  • A table format specification
  • A set of APIs and libraries for engines to interact with tables following that specification
  • Iceberg tables are stored in 3 layers, 
    • Catalog – A catalog is essentially a collection of tables and provides a way to manage tables in a consistent and standardized way. Stores pointers to tables’ metadata file locations. The catalog also supports atomic operations to update metadata pointers. Some implementations: Hive, Nessie, and Hadoop.
    • Metadata – 
      • Metadata file: contains table schema, snapshot information, partition spec, table stats, etc.
      • Manifest List: list of manifest files that make up a snapshot.
      • Manifest files: Tracks data files and stats about each file. 
    • Data – Table data is stored in multiple data files in Parquet, Avro, and ORC. This is called the data layer.


Planned Iceberg support:


Feature

AsterixDB Support

Iceberg availability

Comments

Iceberg format version

Version 1

Iceberg currently has 2 format versions: 1 & 2.

  • The primary change in version 2 adds delete files to encode the rows that are deleted in existing data files. This version can be used to delete or replace individual rows in immutable data files without rewriting the files.

Currently, AsterixDB parallelizes external data file reading using various Hyracks workers across all nodes. 

Delete files can be large, so the ideal flow should be to process the delete-files parallelly in the first stage and in the second stage, process the data files parallelly with a filter on the collated delete-file data. This 2-step process requires some flow changes in the query execution and hence skipped for the first cut.

Datafile format

Parquet

Iceberg spec supports Parquet, Avro, and ORC

We currently only support Parquet file processing.

Table types

File System Tables

  1. File system tables – Metadata is only stored in file system
  2. Metastore tables – Metadata is also stored in Metastores like Hive, Glue, and Nessie.

Initially we are planning to support file system tables but it is possible to add support for meta store tables sign the interfaces provided by the Iceberg SDK.

 

But ideally, we would want AsterixDB itself to act as the metastore in the long run. 

File Systems

S3 and HDFS

Localfs, HDFS, S3, and any file system supported by the Hadoop library.

AsterixDB supports parquet file processing in S3 and HDFS

CRUD operations

Read

Iceberg allows all CRUD operations including schema and partition evolution.

Complete CRUD support can be achieved using the Iceberg SDK with which we can take control of updating the Iceberg metadata files. 

But the current use case is, we already have iceberg tables managed by other engines and AsterixDB is used as a query (read-only) engine. So we are restricting ourselves to read support. 

Ideally, we would want AsterixDB to act as a metastore for iceberg tables and perform all CRUD operations using SQL++. This can be picked as part of the initiative where we start supporting updates on external data. ASterixDB is currently only read-only on all external datasets.

Time travel

Only the latest snapshot is queried

Iceberg allows time travel queries.

Every time we run a query, we use the latest iceberg snapshot. 

In order to support time travel, we would need new keywords in SQL++.

Ex: Select * from XYZ AS OF 2023-01-01.



Proposed changes


Interface:


We introduce additional properties to tell the compiler that we are querying an Iceberg table.

  1. “table-format”
    1. Specifies the table format of the external dataset.
    2. Required for all Iceberg tables
    3. Value: “apache-iceberg
  2. “metadata-path”
    1. Specifies the table metadata location
    2. Required for HDFS adapter
    3. Redundant for S3 adapter as container and definition can be used to construct the path.


Example DDL:


CREATE EXTERNAL DATASET IcebergDataset(IcebergTableType) USING S3 (
    ("accessKeyId"="-----"),
    ("secretAccessKey"="--------"),
    ("region"="us-east-1"),
    ("serviceEndpoint"="https://s3.amazonaws.com/"),
    ("container"="my-bucket"),
    ("definition"="my-table"),
    ("table-format"="apache-iceberg"),
    ("format"="parquet")
);



CREATE EXTERNAL DATASET IcebergDataset(IcebergTableType) USING hdfs
(
  ("hdfs"="hdfs://127.0.0.1:31888"),
  ("input-format"="parquet-input-format"),
  ("table-format"="apache-iceberg"),
  ("metadata-path"="asterix/my_table/")
);



Changes

During the compilation phase of the iceberg external datasets queries, an adapter gets created which is configured to use a path parameter (a comma-separated string of data file locations). We modify this path parameter so that it uses the data files from the latest Iceberg table snapshot. This is done in ExternalDataUtils.prepare call which is common for all external data adapters.

The data file locations corresponding to the latest snapshot are collected using the HadoopTables interface provided by Iceberg. This interface can load the table at a given location and provides the current snapshot details.


AsterixDB uses this updated “path” parameter to spilt the file processing into several parallel tasks. This flow is not changed.

Dependencies changes:

  • iceberg-core 

Add the below dependency in the module: asterix-external-data (compile scope) and asterix-app ( test scope)

<!-- https://mvnrepository.com/artifact/org.apache.iceberg/iceberg-core -->
<dependency>
    <groupId>org.apache.iceberg</groupId>
    <artifactId>iceberg-core</artifactId>
    <version>1.1.0</version>
</dependency>



Compatibility/Migration

The proposed changes don’t modify any existing flow or behavior and can be seen as an extension.

Dependency conflicts


Iceberg-core uses a newer version of apache-avro which is in conflict with kite-sdk, a unit test scope dependency that we are currently using to infer AVRO schema from JSON data files. Kite-sdk is last updated in 2015 and so we are planning to remove it as a dependency in AsterixDB and build our own schema inference utility with kite-sdk as the reference. 




Testing

E2E tests on HDFS and S3 external adapters for 

  • Reading iceberg data

  • Reading from the latest snapshot

  • Reading when the table is empty

  • Reading when there are multiple datafiles


Negative cases:

  • Should throw an appropriate error on iceberg format version 2

  • Should throw an appropriate error on invalid metadata path

  • Should throw an appropriate error if the iceberg table is a Metastore table

  • Should throw an appropriate error if the data file format is not parquet.


Performance:

  • Measure performance difference with and without iceberg when querying large files.



  • No labels