Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Modify input record: This may involve dropping fields from input data if they don’t have corresponding table columns, adding nulls in case of missing fields for certain columns, and changing the order of incoming fields to match the order of fields in the table. This task requires understanding of incoming data format.
  2. Encode modified record: The encoding often involves serialization using an appropriate Hive SerDe. This task is agnostic of the incoming data format.

DelimitedInputWriter

Class DelimitedInputWriter provides support for writing out input data that is accepts input records that in delimited formats (such as CSV) .  Class DelimitedInputWriter only implements the input format specific task of modifying the input record.  See the Javadoc for details.

AbstractLazySimpleRecordWriter

and writes them to Hive. It reorders the fields if needed, and serializes the record into an Object using LazySimpleSerde, which is then passed on to the underlying AcidOutputFormat's record updater for the appropriate bucket.  See Javadoc.

StrictJsonWriter

StrictJsonWriter  accepts input records that in strict JSON format and writes them to Hive. It serializes the JSON record directly into an Object using LazySimpleSerde, which is then passed on to the underlying AcidOutputFormat's record updater for the appropriate bucket.  See Javadoc.

AbstractRecordWriter

This is a base class that handles some of the common code needed by RecordWriter objects such as schema lookup and computing the bucket into which a record should belongThe input format agnostic task of encoding the modified record is delegated to its abstract base class AbstractLazySimpleRecordWriter. This class provides the bulk of the RecordWriter implementation. It internally uses a LazySimpleSerDe to transform the incoming byte[] into an Object required by the underlying writer APIs.

Error Handling

It's imperative for proper functioning of the system that the client of this API handle errors correctly.  Once a TransactionBatch is obtained, if any exception is thrown from TransactionBatch (except SerializationError) should cause the client to call TransactionBatch.abort() to abort current transaction and then TransactionBatch.close() and start a new batch to write more data and/or redo the work of the last transaction during which the failure occurred.  Not following this may, in rare cases, cause file corruption.  Furthermore, StreamingException should ideally cause the client to perform exponential back off before starting new batch.  This will help the cluster stabilize since the most likely reason for these failures is HDFS overload.

...