Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

NiFi supports files of all sizes and formats.  However, when considering bulk ingest and handling of large files, this can create suboptimal patterns on ingest of data.  Consider the case of a multi-gigabyte file being transported/routed by NiFi to something like HDFS with minimal inspection or modification to the source data.  In such scenarios, there is little value in introducing the footprint of this content into the NiFi repository itself for sheer routing decisions.  What is prescribed is a means of providing an extended pass by reference to the source of the data itself avoiding duplication of data and, possibly, IO , unless it is needed.  This approach would deprecate/alter the mechanism by which the List/Fetch, Get and related source processors behave.  Effectively, one could locate a series of files that are to be delivered to some consumer, performing the intermediary routing and then streaming the content to its destination(s) .

 

as needed without introducing a duplicate copy into NiFi's configured content repository.  This would be an opt-in and configurable mechanism for those dataflow paths that deal with simple movement of large files while still benefitting from many of NiFi's core values like provenance and event level processing.

Assumptions

  • Content modification to an external file would introduce changes into a new content claim in NiFi's internal repository
  • Source processors (those that introduce/create flow files) are the key point of this feature's incorporation into NiFi and would work in tandem with the framework to provide an appropriate URI to access the data

...

#TitleUser StoryImportanceNotes
1Transparency of External File ApproachUsers need to be able to manage external files in flow in a homogeneous manner with "classic" FlowFiles/content 
2 Extensible to support varied protocolsThe largest files where simple replication and transport occurs could include varied sources such as local file, HDFS, S3, and SFTP.   

User interaction and design

...