Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Some clarifications and correction regarding Record Rewriters

...

RecordWriter is the base interface implemented by all Writers. A Writer is responsible for taking a record in the form of a byte[] containing data in a known format (such as CSV) and writing it out in the format supported by Hive streaming. A RecordWriter may reorder or drop fields from the incoming record if necessary to map them to the corresponding columns in the Hive Table.  A streaming client will instantiate an appropriate RecordWriter type and pass it to TransactionBatch. The streaming client does not directly interact with RecordWriter therafter. The TransactionBatch will thereafter use and manage the RecordWriter instance to perform I/O.  See the Javadoc for details.

A RecordWriter has two 's primary functions .are:

  1. Modify input record: This may involve dropping fields from input data if they don’t have corresponding table columns, adding nulls in case of missing fields for certain columns, and changing the order of incoming fields to match the order of fields in the table. This task requires understanding of incoming data format. Not all formats (for example JSON, which includes field names in the data) need this step.
  2. Encode modified record: The encoding involves serialization using an appropriate Hive SerDe.
  3. Identify the bucket to which the record belongs
  4. Write encoded record to Hive using the AcidOutputFormat's record updater for the appropriate bucket.

DelimitedInputWriter

Class DelimitedInputWriter accepts input records that in delimited formats (such as CSV) and writes them to Hive. It reorders the fields if needed, and serializes the record into an Object using LazySimpleSerde, which is then passed on to the underlying AcidOutputFormat's record updater for the appropriate bucket.  See Javadoc.

...

This is a base class that handles contains some of the common code needed by RecordWriter objects such as schema lookup and computing the bucket into which a record should belong.

...