Design Proposal of Kite Connector
Background
Kite SDK is an open source set of libraries for building data-oriented systems and applications. With the Kite dataset API, you can perform tasks such as reading a dataset, defining and reading views of a dataset and using MapReduce to process a dataset.
Recently Sqoop 1 has supported Parquet file format for HDFS/Hive using Kite SDK (SQOOP-1366). The JIRA proposes to create a Kite dataset connector for Sqoop 2, which is able to access HDFS and Hive dataset.
Requirements
- Ability to write to a new HDFS/Hive dataset by choosing Kite connector in diverse file storage formats (Avro, Parquet and CSV) and compression codecs (Uncompressed, Snappy, Deflate, etc.).
- Ability to read an entire HDFS/Hive dataset by choosing Kite connector.
- Ability to indicate the partition strategy.
- Ability to support delta writes to HDFS/Hive dataset.
- Ability to read partially from an HDFS/Hive dataset with constraints.
Design
- Config objects:
- LinkConfig intends to store arguments that will overwrites environment variables. E.g. the host and port of namenode. (see also SQOOP-1751)
- ToJobConfig includes arguments that Kite CLI provides for import. Dataset uri is mandatory. User input validation will happen in-place.
- FromJobConfig includes arguments that Kite CLI provides for export. Dataset uri is mandatory. User input validation will happen in-place.
- Write data into a new dataset (SQOOP-1588):
- The job will fail, if target dataset exists.
- Every KiteDatasetLoader will create a temporary dataset and write data into it. The name of temporary dataset is expected to be unique and new.
- If the job is done successfully, all temporary datasets will be merged as one.
- If the job is failed, all temporary datasets will be removed.
- As Kite uses Avro, data records will be converted from Sqoop to Avro objects.
- Read data from a dataset (SQOOP-1647):
- The job will fail, if target dataset does not exist or it is not accessible.
- Every KiteDatasetPartition should contain partition strategy information. If it is not specified, there will be only one partition.
- Every KiteDatasetExtractor will read data from its partition.
- The fault handling is not an obligation of Kite connector in read mode.
- As Kite uses Avro, data records will be converted from Avro to Sqoop objects.
- Partition strategy handling (SQOOP-1942):
- For writing data, if no partition strategy is specified, the dataset will be unpartitioned.
- For reading data, if given dataset has a partition strategy, it should be used.
- Reference: http://kitesdk.org/docs/0.17.1/Partitioned-Datasets.html
- Delta writing:
- If target dataset does not exist, fallback to section 2.
- If target dataset exists, the schema should be consist, or fail.
- Should handle overwrite/ignore on duplicate
- The most implementation should follow section 2.
- Read data from a dataset with constraints:
- Build a view query to read data.
- The most implementation should follow section 3.
Testing
- Unit testing to ensure the correctness of utilities.
- Integration testing to ensure data can be moved from JDBC to HDFS/Hive.
- Performance testing is expected to do an HDFS connector and Kite connector comparision.
Known Limitations
- As Sqoop 2 does not allow to specify InputFormat and OutputFormat, data reading can be Inefficient as we cannot create concurrent data readers, especially for a un-partitioned dataset.
- The implementation of HBase support (SQOOP-1744) will be different.