All proposed and accepted APE proposals should be put under this page. They should also follow the template below:
Status: Proposed
dev@asterixdb.a.o thread | (permalink to discuss thread) |
---|---|
JIRA issue(s) |
|
Release target | (release number) |
Members
- Hari Kishore Chaparala
- Ian Maxon
- Mike Carey
Motivation
Overview:
Apache Iceberg is a popular datalake solution like Hudi and Delta Lake that helps process and query unorganized, unstructured, structured, and semi-structured data. Data lakes offer high scalability and flexibility compared to data warehouses. Additionally, users are not limited to a specific database query language and popular tools like Spark and Hive provide a wider range of query functionality. Some applications of data lakes include Business analytics, data discovery, log processing, and machine learning.
In this project, we are going to integrate Iceberg support into AsterixDB, specifically for select queries i.e., AsterixDB is going to be a query engine that supports data stored in Iceberg format.
Apache Iceberg:
- A table format specification
- A set of APIs and libraries for engines to interact with tables following that specification
- Iceberg tables are stored in 3 layers,
- Catalog – A catalog is essentially a collection of tables and provides a way to manage tables in a consistent and standardized way. Stores pointers to tables’ metadata file locations. The catalog also supports atomic operations to update metadata pointers. Some implementations: Hive, Nessie, and Hadoop.
- Metadata –
- Metadata file: contains table schema, snapshot information, partition spec, table stats, etc.
- Manifest List: list of manifest files that make up a snapshot.
- Manifest files: Tracks data files and stats about each file.
- Data – Table data is stored in multiple data files in Parquet, Avro, and ORC. This is called the data layer.
Planned Iceberg support:
Feature | AsterixDB Support | Iceberg availability | Comments |
Iceberg format version | Version 1 | Iceberg currently has 2 format versions: 1 & 2.
| Currently, AsterixDB parallelizes external data file reading using various Hyracks workers across all nodes. Delete files can be large, so the ideal flow should be to process the delete-files parallelly in the first stage and in the second stage, process the data files parallelly with a filter on the collated delete-file data. This 2-step process requires some flow changes in the query execution and hence skipped for the first cut. |
Datafile format | Parquet | Iceberg spec supports Parquet, Avro, and ORC | We currently only support Parquet file processing. |
Table types | File System Tables |
| Initially we are planning to support file system tables but it is possible to add support for meta store tables sign the interfaces provided by the Iceberg SDK.
But ideally, we would want AsterixDB itself to act as the metastore in the long run. |
File Systems | S3 and HDFS | Localfs, HDFS, S3, and any file system supported by the Hadoop library. | AsterixDB supports parquet file processing in S3 and HDFS |
CRUD operations | Read | Iceberg allows all CRUD operations including schema and partition evolution. | Complete CRUD support can be achieved using the Iceberg SDK with which we can take control of updating the Iceberg metadata files. But the current use case is, we already have iceberg tables managed by other engines and AsterixDB is used as a query (read-only) engine. So we are restricting ourselves to read support. Ideally, we would want AsterixDB to act as a metastore for iceberg tables and perform all CRUD operations using SQL++. This can be picked as part of the initiative where we start supporting updates on external data. ASterixDB is currently only read-only on all external datasets. |
Time travel | Only the latest snapshot is queried | Iceberg allows time travel queries. | Every time we run a query, we use the latest iceberg snapshot. In order to support time travel, we would need new keywords in SQL++. Ex: Select * from XYZ AS OF 2023-01-01. |
Proposed changes
Interface:
We introduce additional properties to tell the compiler that we are querying an Iceberg table.
- “table-format”
- Specifies the table format of the external dataset.
- Required for all Iceberg tables
- Value: “apache-iceberg”
- “metadata-path”
- Specifies the table metadata location
- Required for HDFS adapter
- Redundant for S3 adapter as container and definition can be used to construct the path.
Example DDL:
CREATE EXTERNAL DATASET IcebergDataset(IcebergTableType) USING S3 (
("accessKeyId"="-----"),
("secretAccessKey"="--------"),
("region"="us-east-1"),
("serviceEndpoint"="https://s3.amazonaws.com/"),
("container"="my-bucket"),
("definition"="my-table"),
("table-format"="apache-iceberg"),
("format"="parquet")
);
CREATE EXTERNAL DATASET IcebergDataset(IcebergTableType) USING hdfs |
Changes
During the compilation phase of the iceberg external datasets queries, an adapter gets created which is configured to use a path parameter (a comma-separated string of data file locations). We modify this path parameter so that it uses the data files from the latest Iceberg table snapshot. This is done in ExternalDataUtils.prepare call which is common for all external data adapters.
The data file locations corresponding to the latest snapshot are collected using the HadoopTables interface provided by Iceberg. This interface can load the table at a given location and provides the current snapshot details.
AsterixDB uses this updated “path” parameter to spilt the file processing into several parallel tasks. This flow is not changed.
Dependencies changes:
- iceberg-core
Add the below dependency in the module: asterix-external-data (compile scope) and asterix-app ( test scope)
<!-- https://mvnrepository.com/artifact/org.apache.iceberg/iceberg-core --> |
Compatibility/Migration
The proposed changes don’t modify any existing flow or behavior and can be seen as an extension.
Dependency conflicts
Iceberg-core uses a newer version of apache-avro which is in conflict with kite-sdk, a unit test scope dependency that we are currently using to infer AVRO schema from JSON data files. Kite-sdk is last updated in 2015 and so we are planning to remove it as a dependency in AsterixDB and build our own schema inference utility with kite-sdk as the reference.
Testing
E2E tests on HDFS and S3 external adapters for
Reading iceberg data
Reading from the latest snapshot
Reading when the table is empty
Reading when there are multiple datafiles
Negative cases:
Should throw an appropriate error on iceberg format version 2
Should throw an appropriate error on invalid metadata path
Should throw an appropriate error if the iceberg table is a Metastore table
Should throw an appropriate error if the data file format is not parquet.
Performance:
Measure performance difference with and without iceberg when querying large files.