Motivation

As the data size increase, the IO becomes the main bottleneck for many of the analytical queries. To address this issue, we introduce block-compression at the storage level. 

User Model

There are two ways to enable the compression:

  1. DDL:
    During creating a dataset, the user can specify the compression scheme as in the following example
    CREATE DATASET Compressed(CompressedType)
    PRIMARY KEY id
    with {"storage-block-compression": {"scheme": "snappy"}};

  2. Configuration File:
    Under [nc] in the configuration file, the user can specify the compression scheme which would be applied to all datasets by default. The user can override it by specifying a new value using DDL as in (1).
    Example:
    [nc]
    storage.compression.block=snappy


Supported compression schemes: 

Configuration ValueDescription
noneNo compression
snappySnappy compression

Design

This section details the added components to AsterixDB.

Compressor/Decompressor API:

Two interfaces are introduced at the Hyracks level: ICompressorDecompressorFactory and ICompressorDecompressor. Hyracks offers three built-in compression schemes: Snappy, LZ4 and LZ4HC (LZ4 with High Compression). The compressor/decompressor API is targeted to compress/decompress any byte arrays. It's not only for storage but also could be adopted elsewhere (such as network exchange). 

Contract: ICompressorDecompressor has to be stateless to be thread-safe.

Storage Compression

Currently AsterixDB compresses only the primary index. We can extend it to include the secondary indexes as well. 

Look Aside File (LAF):

In AsterixDB, indexes are referenced and stored on-disk pages of a fixed size in a file. This offers a deterministic way of accessing any arbitrary pagei (offset = i * pageSize). However, the compressor produces pages with variable sizes and we need a way to get the proper offset and the size for the required pagei. Therefore, each compressed file (index) has a companion file (called Look Aside File or LAF for short). The LAF file consists of multiple entries each of which stores the offset and the size of a page (entryi corresponds to the pagei), see Figure 1. Each entry has two 64-bit integer pair <offset, size>. 

The entry of the compressed pagei is:

1- At LAF-page: i * ENTRY_LENGTH / pageSize

2- In the LAF-page, the entry offset is at: i * ENTRY_LENGTH % pageSize

  • where ENTRY_LENGTH = 16

Note that we can infer the size of a compressed page by looking to the next offset. However, the last entry in a LAF page requires to read the next page of the LAF file to determine the size of the compressed page. To make that simple, we store the compressed page size as well.

 

Figure 1: Compressed file of n pages

Ingestion Workflow:

Before a page is queued for write, the "page-owner" is responsible to prepare the CompressedFileWriter first. This is needed to confiscate the required page(s) of the LAF file before the write can happen (see Figure 2). CompressedFileWriter can be obtained by calling IBufferCache.getCompressedFileWriter(fileId). In addition to the bulkloader, it is also required for the MetadataPageManager to prepare the CompressedFileWriter before writing the metadata page (during marking the index as valid). The BufferCache will sync the LAF file before the index file to make sure everything is synced to disk.

During writing a page, the BufferCache may find that the compression did not save any space. Therefore, it will write the page as is without compression. Currently, we implement a naïve policy which requires at least 1-byte of saving. 

Figure 2: Ingestion workflow for compressed indexes


Query Workflow:

During read operations, the whole workflow is managed by the buffer cache, Figure 3. Reading a compressed page may incur two IO operations (reads). One for the LAF page and one for the compressed pages. However, the LAF file is usually small (few pages) and mostly will be cached during sequential reads.

Figure 3: Read compressed page

Additional Compress/Uncompress Buffers:

For compress/decompress operations, the buffer cache utilizes the buffer of BufferCacheHeaderHelper. Sometimes, it may increase the header buffer size as needed by the compressor (this can be determined by calling ICompressorDecompressor.computeCompressBufferSize(uncompressedBufferSize)). That means the compressor/decompressor does not allocate any additional buffers for each read/write operations.

Large Pages:

During reading/writing large pages, the BufferCache will decompose the large page into multiple pages and process each page separately to cap the size of the compressor/decompressor buffers.

Experiments

This section details the experiments using compressed storage. The experiments was done using two datasets: 1) SocialGen dataset and 2) Synthetic Tweets and was conducted using two type of hard drives (External HDD using USB 3.0 and SSD).

Configuration Setup:

  • OS: OSX 10.11.6 (El Capitan)
  • Memory: 16GB
  • Hard drives read/write peaks:
    • SSD (Read: ~715MB/s Write: ~640MB/s) 
    • HDD: HDD (Read: ~100MB/s Write: ~100MB/s)

AsterixDB Configuration:

  • Buffer cache: 7GB
  • Buffer cache page size: 256KB
  • Memory component budget: 2GB
  • Memory component page size: 64KB
  • Max writable datasets: 2

Social Gen (Data Scan):

  • GleambookMessages raw size: 46GB 
  • Comparing: Uncompressed and Compressed with: Snappy,LZ4 and LZ4HC
  • Indexes: authorId (B-Tree)
  • Load: Bulkload
  • # of IODevices: 2
Data Loading Time:

Time took for bulkload (lower is better)

On-disk size:

Execution time:

Query: SELECT COUNT(*) FROM GleambookMessage

The query is executed 7 times and we dropped the first two.

  • SSD Result (lower is better)

  • HDD Result (lower is better)

Twitter (Secondary index queries)

  • Raw size: 50GB 
  • Comparing: Uncompressed and Compressed with: Snappy (referred as Compressed in the charts below)
  • Indexes: timestamp (B-Tree)
  • Load: Socket feed
  • # of IODevices: 1 (ONLY SSD)

This experiment is intended to show any impact from the compression on queries with very selective predicate.

Data Loading Time (lower is better):

On-disk size:

Execution time:

Point Lookup

Query: SELECT COUNT(*) FROM Tweets WHERE timestamp_ms = <TIMESTAMP>

    • Ordered Access:

We run the query with 3000 different timestamp in an increasing order (timestamp1 < timestamp2).

      • Each timestamp is corresponding to 1000 record.


      • Each timestamp correspond to one record

    • Random Access

We run the query with 500 different timestamp in a random order (the timestamps are randomly shuffled).

      • Each timestamp is corresponding to 1000 record.


    • Each timestamp is corresponding to one record.



  • No labels