Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Design Proposal of Kite Connector

  • Author: Qian Xu
  • Revision: 2
  • Date: 2015/1/16

Background

Kite SDK is an open source set of libraries for building data-oriented systems and applications. With the Kite dataset API, you can perform tasks such as reading a dataset, defining and reading views of a dataset and using MapReduce to process a dataset.

Recently Sqoop 1 has supported Parquet file format for HDFS/Hive using Kite SDK (SQOOP-1366). The JIRA (SQOOP-1529) proposes to create a Kite dataset connector for Sqoop 2, which is able to access HDFS and Hive dataset. The behavior is expected similar to what Kite CLI does.

Requirements

  1. Ability to write to a new HDFS/Hive dataset by choosing Kite connector in diverse file storage formats (Avro, Parquet and experimentally CSV) and compression codecs (Uncompressed, Snappy, Deflate, etc.).
  2. Ability to read an entire HDFS/Hive dataset by choosing Kite connector.
  3. Ability to indicate the partition strategy.
  4. Ability to support delta writes to HDFS/Hive dataset.
  5. Ability to read partially from an HDFS/Hive dataset with constraints.

Design

  1. Config objects:
    • ToJobConfig includes arguments that Kite CLI provides for import. 
      1. Dataset uri is mandatory. 
      2. Output Storage format (Enum: Avro, Parquet or experimentally CSV) is mandatory. 
      3. Compression Codec (Enum: Default, Avro or Deflate) is optional (No JIRA yet) 
      4. Path to a JSON file which defines partition strategy is optional (No JIRA yet) 
      5. User input validation will happen in-place.
    • FromJobConfig includes arguments that Kite CLI provides for export. 
      1. Dataset uri is mandatory. 
      2. User input validation will happen in-place.
    • LinkConfig intends to store credential properties. 
      1. E.g. the host and port of namenode, the host and port of hive metastore. Imagine we build a role based access control. User is able to access particular ToJobConfig and FromJobConfig, but only admin is able to access the LinkConfig. Admin does not want user to know/change the address of namenode, so LinkConfig is the right place to put credential properties.
      2. SQOOP-1751 has some discussion about that
  2. Write data into a new dataset:
    • The job will fail, if target dataset exists.
    • Every KiteDatasetLoader will create a temporary dataset and write data into it. The name of temporary dataset is expected to be unique and new.
    • If the job is done successfully, all temporary datasets will be merged as one.
    • If the job is failed, all temporary datasets will be removed.
    • As Kite uses Avro, data records will be converted from Sqoop objects (FixedPoint, Text, etc.) to Avro objects. (See also future work #3)
  3. Read data from a dataset:
    • The job will fail, if target dataset does not exist or it is not accessible.
    • Every KiteDatasetPartition should contain partition strategy information. If it is not specified, there will be only one partition.
    • Every KiteDatasetExtractor will read data from its partition.
    • If error is occurred during reading, SqoopException will be thrown.
    • As Kite uses Avro, data records will be converted from Avro to Sqoop objects (FixedPoint, Text, etc.) (See also future work #3)
  4. Partition strategy handling:
  5. Incremental Import:
    • A ToJobConfig property "bool: AppendMode" is required.
    • If target dataset does not exist, it will fail.
    • If target dataset exists, the implementation details will check dataset metadata (e.g. schema, partition strategy) defensively.
    • It will only append records to existing dataset. If it is failed due to a duplicate, we do not handle.
    • The most implementation should follow section 2.
  6. Read data from a dataset with constraints:
    • A FromJobConfig property "str: Constraint" is required.
    • Build a view query to read data.
    • The most implementation should follow section 3.

Testing

  1. Unit testing to ensure the correctness of utils.
  2. Integration testing to ensure data can be moved from JDBC to HDFS/Hive.

Future Work

  1. As Sqoop 2 does not allow to specify InputFormat and OutputFormat, data reading can be Inefficient as we cannot create concurrent data readers, especially for a un-partitioned dataset. Still need some investigation with Kite team for a solution.
  2. HBase support (SQOOP-1744) will be an individual improvement to the original design.
  3. The current implementation uses the default IDF class (CSVIDF) for data conversion. Recently we have introduced AvroIDF. As Kite uses Avro internally, it makes sense to use AvroIDF instead of CSVIDF. This will involve two things:
    1. Clean up AvroTypeUtil and KiteDataTypeUtil.
    2. AvroIDF will be responsible to convert every Sqoop data type (FixedPoint, Text, etc.) to corresponding Avro representation.