Target release
Epic 
Document statusDRAFT
Document owner

Aldrin Piri

Designer
Developers
QA

Goals

  • Be more resource efficient for handling very large files (multi-gigabyte) to avoid unnecessarily duplication
  • To provide a framework level mapping to external content from within NiFi FlowFiles
  • Establish an API for source processors that introduce content/flowfiles into a dataflow to provide a dereferencable URI to content, creating a pass by reference for the entirety of dataflow.

Background and strategic fit

NiFi supports files of all sizes and formats.  However, when considering bulk ingest and handling of large files, this can create suboptimal patterns on ingest of data.  Consider the case of a multi-gigabyte file being transported/routed by NiFi to something like HDFS with minimal inspection or modification to the source data.  In such scenarios, there is little value in introducing the footprint of this content into the NiFi repository itself for sheer routing decisions.  What is prescribed is a means of providing an extended pass by reference to the source of the data itself avoiding duplication of data and, possibly, IO unless it is needed.  This approach would deprecate/alter the mechanism by which the List/Fetch, Get and related source processors behave.  Effectively, one could locate a series of files that are to be delivered to some consumer, performing the intermediary routing and then streaming the content to its destination(s) as needed without introducing a duplicate copy into NiFi's configured content repository.  This would be an opt-in and configurable mechanism for those dataflow paths that deal with simple movement of large files while still benefitting from many of NiFi's core values like provenance and event level processing.

Assumptions

  • Content modification to an external file would introduce changes into a new content claim in NiFi's internal repository
  • Source processors (those that introduce/create flow files) are the key point of this feature's incorporation into NiFi and would work in tandem with the framework to provide an appropriate URI to access the data

Requirements

#TitleUser StoryImportanceNotes
1Transparency of External File ApproachUsers need to be able to manage external files in flow in a homogeneous manner with "classic" FlowFiles/content 
2Extensible to support varied protocolsThe largest files where simple replication and transport occurs could include varied sources such as local file, HDFS, S3, and SFTP.  

User interaction and design

  • Extension configuration will need an additional property to enable the treatment of fetched/received files as external
  • After the point where a FlowFile with a reference to an external repository enters a flow, its handling should fall under the same mechanisms as traditional FlowFiles and should be relatively transparent

Other enhancements enabled with this feature

  • Depending on source and sink relationships for a dataflow, it might be possible to leverage/integrate other tools.  As an example, consider the case where we are doing a bulk HDFS to HDFS transfer.  Given this kind of agreement, it might be possible to delegate to a tool like distcp to provide movement of this information with NiFi orchestrating this arrangement.  NiFi could manage the totality of the dataflow but provide an optimized transport mechanism.  

Prior Art and Related Work

Questions

Below is a list of questions to be addressed as a result of this requirements document:

QuestionOutcome
If a content is to be delivered to multiple endpoints, and we can determine this, are there optimizations available to avoid the issues of 
How do we know when/handle changes to an external file?
How are reads handled (something like ExtractText)? Does this content then get introduced into a NiFi content repository at this time? 
Can we apply our offset & length mechanism for splittling parts of a component (SplitText)? 
What, if any, are the commonalities of this with the ideas of the High Availability Processing? 

Not Doing

1 Comment

  1. This would solve a user story I currently have where very large files need to be copied from one repository to another, neither of which is local to the Nifi installation.