You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Overview

New contributors to NiFi often have the same question: Where can I start?

 

When you look at a project like NiFi, one can imagine that there are quite a few moving parts, so understanding how

all of these parts fits together can in itself become pretty daunting.

 

This documentation is intended to be a very high level description of the components that make up NiFi and how those

components fit together. It is not intended to go into great detail about any of these components. There are other documents

that exist (and more to come) that delve a bit deeper into the design of some of these individual components. The goal of

this document is to help developers who are new to NiFi come up to speed on some of terminology and understand how

the different components that make up the platform interact with one another.

 

FlowFile

We will begin the discussion with the FlowFile. This is the abstraction that NiFi provides around a single piece of data.

A FlowFile may represent structured data, such as a JSON or XML message, or may represent unstructured data, such as

an image. A FlowFile is made up of two parts: content and attributes. The content is simply a stream of bytes, which can

represent any type of data. The content itself is not stored within the FlowFile, however. Rather, the FlowFile simply stores a

reference to the content, which lives within the Content Repository. Content is accessed only by means of the Process Session,

which is capable of communicating with the Content Repository itself. The attributes are key-value pairs that are associated

with the data. These attributes provide context along with the data, which allows the data to be efficiently routed and reasoned

about without parsing the content.

 

Processor

This is the most commonly used component in NiFi and tends to be the easiest place for newcomers to jump in.

A Processor is a component that is responsible for bringing data into the system, pushing data out to other systems,

or performing some sort of enrichment, extraction, transformation, splitting, merging, or routing logic for a particular piece of data. 

Common Design Patterns for Processors are discussed in the Developer Guide.

 

The Processor is an extension point, and its API will not change from one minor release of NiFi to another but may change with a new

major release of NiFi.

 

Processor Node

The Processor Node is essentially a wrapper around a Processor and maintains state about the Processor itself. The Processor

Node is responsible for maintaining, among other things, state about a Processor's positioning on the graph, the configured

properties and settings of the Processor, its scheduled state, and the annotations that are used to describe the Processor.

 

By abstracting these things away from the Processor itself, we are able to ensure that the Processor is unable to change things

that it should not, such as the configured values for properties, as allowing a Processor to change this information can lead to

confusion. Additionally, it allows us to simplify the code required to create a Processor, as this state information is automatically

managed by the framework.

 

Content Repository

The Content Repository is responsible for storing the content of FlowFiles and providing mechanisms for reading the contents

of a FlowFile. This abstraction allows the contents of FlowFiles to be stored independently and efficiently based on the underlying

storage mechanism. The default implementation is the FileSystemRepository, which persists all data to the underlying file system.

 

Note: While the Content Repository is pluggable, it is considered a 'private API' and its interface could potentially be changed between

minor versions of NiFi. It is, therefore, not recommended that implementations be developed outside of the NiFi codebase.

 

FlowFile Repository

The FlowFile Repository is responsible for storing the FlowFiles' attributes and state, such as creation time and which FlowFile Queue

the FlowFile belongs in. The default implementation is the WriteAheadFlowFileRepository, which persists the information to a write-ahead

log that is periodically "checkpointed". This allows extremely high transaction rates, as the files that it writes to are "append-only," so the

OutputStreams are able to be kept open. Periodically, the repository will checkpoint, meaning that it will begin writing to new write-ahead logs,

write out the state of all FlowFiles at that point in time, and delete the old write-ahead logs. This prevents the write-ahead logs from growing

indefinitely.

 

Note: While the FlowFile Repository is pluggable, it is considered a 'private API' and its interface could potentially be changed between

minor versions of NiFi. It is, therefore, not recommended that implementations be developed outside of the NiFi codebase.

 

Provenance Repository

The Provenance Repository is responsible for storing, retrieving, and querying all Data Provenance Events. Each time that a FlowFile is

received, routed, cloned, forked, modified, sent, or dropped, a Provenance Event is generated that details this information. The event contains

information about what the Event Type was, which FlowFile(s) were involved, the FlowFile's attributes at the time of the event, details about the

event, and a pointer to the Content of the FlowFile before and after the event occurred (which allows a user to understand how that particular

event modified the FlowFile).

 

The Provenance Repository allows this information to be stored about each FlowFile as it traverses through the system and provides a mechanism

for assembling a "Lineage view" of a FlowFile, so that a graphical representation can be shown of exactly how the FlowFile was handled. In order

to determine which lineages to view, the repository exposes a mechanism whereby a user is able to search the events and associated FlowFile

attributes.

 

The default implementation is PersistentProvenanceRepository. This repository stores all data immediately to disk-backed write-ahead log and

periodically "rolls over" the data, indexing and compressing the data. The search capabilities are provided by an embedded Lucene engine.

 

Note: While the Provenance Repository is pluggable, it is considered a 'private API' and its interface could potentially be changed between

minor versions of NiFi. It is, therefore, not recommended that implementations be developed outside of the NiFi codebase.

 

Process Session

The Process Session (often referred to simply as a "session") provides Processors access to FlowFiles and provides transactional behavior across

the tasks that are performed by a Processor. The session provides get() methods for obtaining access to FlowFiles that are queued up for a Processor,

methods to read from and write to the contents of a FlowFile, add and remove FlowFiles from the flow, add and remove attributes from a FlowFile,

and route a FlowFile to a particular relationship. Additionally, the session provides access to the ProvenanceReporter that is used by Processors to

emit Provenance Events.

 

Once a Processor is finished performing its task, the Processor has the ability to either commit or rollback the session. If a Processor rolls back the session,

the FlowFiles that were accessed during that session will all be reverted to their previous states. Any FlowFile that was added to the flow will be destroyed.

Any FlowFile that was removed from the flow will be re-queued in the same queue that it was pulled from. Any FlowFile that was modified will have both its

contents and attributes reverted to their previous values, and the FlowFiles will all be re-queued into the FlowFile Queue that they were pulled from. Additionally,

any Provenance Events will be discarded.

 

If a Processor instead chooses to commit the session, the session is responsible for updating the FlowFile Repository and Provenance Repository with

the relevant information. The session will then add the FlowFiles to the Processor's outbound queues (cloning as necessary, if the FlowFile was transferred to

a relationship for which multiple connections have been established). 

 

Process Context

The Process Context provides a bridge between a Processor and its associated Processor Node. It provides information about about the Processor's current

configuration, as well as the ability to "yield," or signal to the framework that it is unable to perform any work for a short period of time so the framework should not

not waste resources scheduling the Processor to run. The Process Context also provides mechanisms for accessing the Controller Services that are available,

so that Processors are able to take advantage of shared logic or shared resources.

 

Reporting Task

A Reporting Task is a NiFi extension point that is capable of reporting and analyzing NiFi's internal metrics in order to provide the information to external

resources or report status information as bulletins that appear directly in the NiFi User Interface. Unlike a Processor, a Reporting Task does not have access

to individual FlowFiles. Rather, a Reporting Task has access to all Provenance Events, bulletins, and the metrics shown for components on the graph, such

as FlowFiles In, Bytes Read, and Bytes Written.

 

The Reporting Task is an extension point, and its API will not change from one minor release of NiFi to another but may change with a new

major release of NiFi.

 

Controller Service

The Controller Service is a mechanism that allows state or resources to be shared across multiple components in the flow. The SSLContextService, for instance,

allows a user to configure SSL information only once and then configure any number of resources to use that configuration. Other Controller Services are used to

share configuration. For example, if a very large dataset needs to be loaded, it will generally make sense to use a Controller Service to load the dataset. This allows

multiple Processors to make use of this dataset without having to load the dataset multiple times.

 

The Controller Service is an extension point, and its API will not change from one minor release of NiFi to another but may change with a new

major release of NiFi.

 

Process Scheduler

In order for a Processor or a Reporting Task to be invoked, it needs to be scheduled to do so. This responsibility belongs to the Process Scheduler. In addition to

scheduling Processors and Reporting Task, the scheduler is also responsible for scheduling framework tasks to run at periodic intervals and maintaining the

schedule state of each component, as well as the current number of active threads. The Process Scheduler is able to inspect a particular component in order to

determine which Scheduling Strategy to use (Cron Driven, Timer Driven, or Event Driven), as well as the scheduling frequency.

 

FlowFile Queue

Though it sounds sufficiently simple, the FlowFile Queue is responsible for implementing quite a bit of logic. In addition to queuing the FlowFiles for another component

to pull from, the FlowFile Queue must also be able to prioritize the data following the user's prioritization rules. The queue keeps state about the number of FlowFiles

as well as the data size of those FlowFiles and must keep state about the number of "in-flight" FlowFiles - those that have been pulled from the queue but have not yet

been removed from the system or transferred to another queue.

 

When an instance of NiFi has a very large number of active FlowFiles, the attributes associated with those FlowFiles can be quite a burden on the JVM's heap. To alleviate

this problem, the framework may choose to "swap out" some of the FlowFiles, writing the attributes to disk and removing them from the JVM's heap when a queue grows very

large and later "swap in" those FlowFiles. During this process, the FlowFile Queue is also responsible for keeping track of the number of FlowFiles and size of the FlowFiles'

contents so that accurate numbers can be reported to users.

 

Finally, the FlowFile Queue is also responsible for maintaining state about backpressure and FlowFile Expiration. Backpressure is the mechanism

by which a user is able to configure the flow to temporarily stop scheduling a given component to run when its output queue is too full. By doing this,

we are able to cause the flow to stop accepting incoming data for a short period, or to route data differently. This provides us the ability to prevent

resource exhaustion. In a clustered environment, this also allows a particular node that is falling for one reason or another to avoid ingesting data so

that other nodes in the cluster that are more capable can handle the workload.


FlowFile Expiration is the mechanism by which data is eventually purged from the system because it is no longer of value. It can be thought of as a

flow's pressure release valve. This is used, for instance, in an environment when there is not enough bandwidth to send all of the data to its desired

destination. In such a case, the node will eventually run out of resources. In order to avoid this, the user can configure a queue to expire data that

reaches a certain age. For example, the user can indicate that data that is one hour old should automatically be purged. This capability is then coupled

with the ability to prioritize the data so that the most important data is always sent first and the less important data eventually expires.

 

FlowFile Prioritizer

A core tenant of NiFi is that of data prioritization. The user should have the ability to prioritize the data in whatever order makes sense for

a particular dataflow at a particular point in time. This is especially important for time-sensitive data and becomes even more important when

processing data in an environment where the rate of data acquisition exceeds that rate at which the data can be egressed. In such an environment,

it is important to be able to egress the data in such a way that the most important data is sent first, leaving the less important data to be sent

when the bandwidth allows for it, or eventually be aged off.

 

This tenant is realized through the user of FlowFile Prioritizers. The FlowFile Queue is responsible for ensuring the data is ordered in the way that

the user has chosen. This is accomplished by applying FlowFile Prioritizers to the queue. A FlowFile Prioritizer has access to all FlowFile information

but does not have access to the data that it points to. The Prioritizer can then compare two FlowFiles in order to determine which should be made

available first.

 

The Prioritizer is an extension point, and its API will not change from one minor release of NiFi to another but may change with a new

major release of NiFi.

 

Flow Controller

In order for NiFi's User Interface to display the wealth of information that it renders, it must have some place to gather that information.

The Flow Controller can be thought of as the bridge between the Web Tier and the back end components of a NiFi node. It is largely

responsible for the initialization of the flow, the creation of components in the flow, and the coordination between those components. In

a more general sense, the Flow Controller is responsible for maintaining system-wide state. This includes the state of the flow itself, and

is responsible for coordinating connection and participation in a NiFi cluster.

 

Cluster Manager

Whereas the Flow Controller is responsible for maintaining system-wide state about a particular node, the Cluster Manager is responsible

for maintaining system-wide state about an entire cluster. This includes information such as which nodes are in the cluster and the states of

those nodes. Additionally, the Cluster Manager is responsible for acting as the bridge between the Web Tier and the cluster state. This includes

replicating requests from the user to all nodes in the cluster and merging their responses.

 

An instance of NiFi will create either a Cluster Manager or a Flow Controller but never both. Which component is created is driven by the

configuration found in the nifi.properties file.

 

Authority Provider

When NiFi is configured, it can be configured to run in secure mode, using SSL to access the web endpoints or to run in non-secure mode,

where all endpoints are access anonymously. If running in secure mode, the Authority Provider is responsible for determining which users

have access to perform which operations. It accomplishes this by providing a list of Authorities, or Roles, that a given user is allowed to

possess. Note, it does not indicate which Roles a user actually has, only which Roles a user is allowed to have. A NiFi user who has the Role

ADMIN is then able to grant each of those Roles to the given user. In this way, a central authority service can be used to dictate which Roles

a given user is allowed to have, but just because a user is allowed to be granted the ADMIN role, for example, does not mean that the user

should have the ADMIN role for a particular instance.

 

Each endpoint that is defined must also define which Roles are authorized to access the endpoint. A user is then able to access an

endpoint if and only if the user has one of the specified Roles. 

 

The Authority Provider is an extension point, and its API will not change from one minor release of NiFi to another but may change with a new

major release of NiFi.

 

*-Resources

NiFi's User Interface shows only information that is available via the NiFi RESTful API. This is accomplished by accessing the different endpoints

that are defined in the *-Resource classes. For example, the ProcessorResource class is responsible for defining the endpoints that are used to

interact with the different Processors in the flow, including the addition and removal of Processors. The ProvenanceResource class is responsible

for defining the endpoints that are used to interact with the Provenance Repository, such as executing queries.

 

All of the Resource components are found in the nifi-web-api module, in the org.apache.nifi.web.api package.

 

Bootstrap

 

 

NarClassLoader

 

 

 

  • No labels