Target release
Epic
Document statusDRAFT
Document owner

Joe Witt

DesignerBryan Bende
Developers
QA

Goals

  • Provide a nice user experience and feature set for supporting dataflows involving Avro formatted data including the ability to easily view, edit, split, combine, and route such data.

Background and strategic fit

Usage of Avro in and around Big Data projects is increasingly common.  We should build a content viewer for Avro data, which allows a user to look at the content of a given Avro message based on the schema.  We should also provide a mechanism to manipulate the content of Avro messages to insert or update values but also to perform schema evolution or transformation.  Avro data can tend to arrive in bundles so splitting them is useful to enable individual message handling/routing cases.  The reverse then is also true which is it is useful to be able to merge Avro messages based on like-schema.  Finally, being able to run queries against avro data to make routing decisions is also valuable and given the JSON-based schema design is quite doable.

Assumptions

Requirements

#TitleUser StoryImportanceNotes
1Convert to AvroConvert common data formats to Avro, such as CSV and JSON.N/A
  • Existing functionality in kite-bundle.
2Convert from AvroConvert from Avro to common data formats, such as CSV, XML, and JSONMedium
  • ConvertAvroToJSON Ready for 0.3.0
  • Need to consider that some Avro schemas may not be possible to represent as CSV. 
    For example, Avro supports nested lists and maps that have no good representation in CSV so we'll have to be careful about that conversion.
3Convert Between Avro SchemasConvert Avro records from original schema to a destination schema allowing for user defined field mappings.N/A
  • Existing functionality in kite-bundle.
  • Consider addition of transformation expressions. For example, there might be a timestamp 
    in seconds that needs to be converted to the Avro timestamp-millis type by multiplying the value by 1000
4Merge Avro FilesMerge Avro records with compatible schemas into a single file so that appropriate sized files can be delivered to downstream systems such as HDFS. Support similar semantics to existing MergeContent processor, such as merging based on size, time, number of entries, etc.High
  • NIFI-821 Ready for 0.3.0
  • Consider merging between files and bare records with schema header, not done in NIFI-821
5Split Avro FilesSplit an Avro file with multiple records into individual files so that each record can be processed independently by downstream processors. An example of downstream processing would be routing based on the value of a field in a given record.High
  • Consider splitting Avro files into different sizes
  • Consider splitting to Avro files or bare records
  • NIFI-919
6Extract Schema FingerprintExtract the schema fingerprint of a given Avro file so that downstream processors can make decisions based on the schema, such as when merging together records of compatible schemas (i.e. the correlation attribute).Medium
  • Information on obtaining a Schema Fingerprint
  • Could be more general and populate a few fields from the Avro header:
      - Schema definition (full, not fp)
      - Schema fingerprint
      - Schema root record name (if schema is a record)
      - Key/value metadata, like compression codec
  • NIFI-912
7Evaluate Avro PathsEvaluate a set of Avro paths against an incoming file, and extract the results to FlowFile attributes, or to the content of the FlowFile, similar to EvaluateJson. This would allow downstream processors to easily make decisions based on values in an Avro record, such as RouteOnAttribute.High
8Update Avro RecordsModify Avro records by inserting, updating, or removing fields.Medium
  • This is really similar to the processor to convert between Avro schemas, #3. 
  • Suggest merging the two and making it easy to work with either a file or a record via record-level callback. 
  • Maybe tell the difference between file and record by checking for the filename attribute?
9Avro Content ViewerProvide the ability to view an Avro record based on it's schema when clicking to view the content from a provenance event.Medium 

User interaction and design

Questions

Below is a list of questions to be addressed as a result of this requirements document:

QuestionOutcome

Not Doing