You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Design Proposal of Kite Connector

Background

Kite SDK is an open source set of libraries for building data-oriented systems and applications. With the Kite dataset API, you can perform tasks such as reading a dataset, defining and reading views of a dataset and using MapReduce to process a dataset.

Recently Sqoop 1 has supported Parquet file format for HDFS/Hive using Kite SDK (SQOOP-1366). The JIRA proposes to create a Kite dataset connector for Sqoop 2, which is able to access HDFS and Hive dataset.

Requirements

  1. Ability to write to a new HDFS/Hive dataset by choosing Kite connector in diverse file storage formats (Avro, Parquet and CSV) and compression codecs (Uncompressed, Snappy, Deflate, etc.).
  2. Ability to read an entire HDFS/Hive dataset by choosing Kite connector.
  3. Ability to indicate the partition strategy.
  4. Ability to support delta writes to HDFS/Hive dataset.
  5. Ability to read partially from an HDFS/Hive dataset with constraints.

Design

  1. Config objects:
    • LinkConfig intends to store arguments that will overwrites environment variables. E.g. the host and port of namenode. (see also SQOOP-1751)
    • ToJobConfig includes arguments that Kite CLI provides for import. Dataset uri is mandatory. User input validation will happen in-place.
    • FromJobConfig includes arguments that Kite CLI provides for export. Dataset uri is mandatory. User input validation will happen in-place.
  2. Write data into a new dataset (SQOOP-1588):
    • The job will fail, if target dataset exists.
    • Every KiteDatasetLoader will create a temporary dataset and write data into it. The name of temporary dataset is expected to be unique and new.
    • If the job is done successfully, all temporary datasets will be merged as one.
    • If the job is failed, all temporary datasets will be removed.
    • As Kite uses Avro, data records will be converted from Sqoop to Avro objects.
  3. Read data from a dataset (SQOOP-1647):
    • The job will fail, if target dataset does not exist or it is not accessible.
    • Every KiteDatasetPartition should contain partition strategy information. If it is not specified, there will be only one partition.
    • Every KiteDatasetExtractor will read data from its partition.
    • The fault handling is not an obligation of Kite connector in read mode.
    • As Kite uses Avro, data records will be converted from Avro to Sqoop objects.
  4. Partition strategy handling (SQOOP-1942):
  5. Delta writing:
    • If target dataset does not exist, fallback to section 2.
    • If target dataset exists, the schema should be consist, or fail.
    • Should handle overwrite/ignore on duplicate
    • The most implementation should follow section 2.
  6. Read data from a dataset with constraints:
    • Build a view query to read data.
    • The most implementation should follow section 3.

Testing

  1. Unit testing to ensure the correctness of utilities.
  2. Integration testing to ensure data can be moved from JDBC to HDFS/Hive.
  3. Performance testing is expected to do an HDFS connector and Kite connector comparision.

Known Limitations

  1. As Sqoop 2 does not allow to specify InputFormat and OutputFormat, data reading can be Inefficient as we cannot create concurrent data readers, especially for a un-partitioned dataset.
  2. The implementation of HBase support (SQOOP-1744) will be different.

 

  • No labels