Status

dev@asterixdb.a.o thread

(permalink to discuss thread)

JIRA issue(s)

Unable to render Jira issues macro, execution error.

Release target

(release number)

Members

  • Wail Alkowaileet
  • Peeyush Gupta
  • Mike Carey

Motivation

Currently, AsterixDB do not have a clean way to extract query result or dump a dataset to a storage device. The only channel provided currently is running a query and write it somehow at the client side. We need to support a way to write query results (or dump a dataset) in parallel to a storage device.

Proposed changes

To illustrate we want to do the following:

USE CopyToDataverse;

COPY ColumnDataset
TO localfs
PATH("localhost:///media/backup/CopyToResult")
WITH {
    "format" : "json"
};

In this example, the data in ColumnDataset will be written in each node at the provided path localhost:///media/backup/CopyToResult. Simply, each node will write its own partitions of the data of ColumnDataset locally. The written files will be in raw JSON format.

Another example:

USE CopyToDataverse;

COPY (SELECT cd.uid uid, 
             cd.sensor_info.name name, 
             to_bigint(cd.sensor_info.battery_status) battery_status
      FROM ColumnDataset cd
) toWrite
TO s3 
PATH("CopyToResult/" || to_string(b))
OVER (
   PARTITION BY toWrite.battery_status b
   ORDER BY toWrite.name
)
WITH {
    "format" : "json",
    "compression": "gzip",
    "max-objects-per-file": 100,
    "container": "myBucket",
    "accessKeyId": "<access-key>",
    "secretAccessKey": "<secret-key>",
    "region": "us-west-2"
};

The second example shows how to write the result of a query and also partition the result so that each partition will be written to a certain path. In this example, we partition by the battery_status (say an integer value from 0 to 100). The final result will be written to myBucke in Amazon S3. 

Each partition will have the path CopyToResult/<battery_status>. For example CopyToResult/0, CopyToResult/1 ..., CopyToResult/99, CopyToResult/100). This partitioning scheme can be useful if a user wants to exploit dynamic prefixes (external data filters) (see ASTERIXDB-3073)

Additionally, the records in each partition will be ordered by the sensor_name (toWrite.name). Note that this ordering isn't global but per partition.

Also, the written files will be compressed using gzip and each file should have at most 100 records max (max-objects-per-file).

Compatibility/Migration

No changes are required for existing Hyracks users, and the code path for existing Hyracks / AsterixDB users remains unchanged

  • No labels