You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 10 Next »

Work In Progress

 

This page describes current design of Sqoop2 entity structures.

Top level Entities

There are currently four top level entities:

Connector

Connectors will drive entire data transfer in Sqoop2. A connector is usually per data-storage. i.e OracleConnector for Oracle data storage and HBaseConnector for HBase storage. It is also possible that a data storage might have multiple connectors for different purpose - for example there might be MySQL JDBC Connector utilizing Java's JDBC interface and MySQL Fastpath connector utilizing MySQL native utilities (mysqldump and mysqlimport).

A connector needs to specify the config parameters that it requires for the data transfer to happen from the source to the destination. In sqoop we have abstracted the source and destination to "FROM" and "TO" objects. Hence a connector has 

Driver

Similarly as connector structure contains metadata required to perform specific actions, also Sqoop 2 framework requires some extra configuration for each connection and job. Main difference between connector and framework structure is that there will be multiple connectors whereas there will be always one single framework structure. 

Connection object contains metadata needed to by both connector and framework to manage connection to remote data storage. This connection is not related in any way to java.sql.Connection object. Because each connector might have different needs, each connection directly depends on connector for which the connection object was created. Connection objects will be created by administrators and will be saved in sqoop 2 metastore. They will be later reused by operators and job objects (explained below).

Job

Job objects directly depends on connection and holds configuration to specific job both from connector and framework perspective - they for example contain information if we need to do import or export or where on HDFS do we need to store our data. They will be filled by operators. Job itself will be executable.

Corresponding Sqoop Object

 

MInput

Correspond to one configuration entity that is requested by connector (for example "JDBC Url" or "Username").

MConfig

Related MInputs are gather together to create set of connected options (for example MForm "Connection to database" would consists of MInputs "JDBC URL", "username" and "password").

MFromConfig

Represents lists of forms that are required for job.

MToConfig
MLinkConfig

Represents list of forms that are required for connection.

MConnector

Top level structure that contain one instance of MJobForms and MConnectionForms specifying which metadata one particular connector needs. All MForms are blank and do not contain any configuration. They serve only as a template that connector is supplying to framework in order to get required configuration options.

MDriver

Top level structure that contain one instance of MJobForms and MConnectionForms specifying which metadata are required from framework perspective. There is one single instance of this class across entire sqoop 2.

Top level structure that contains one instance of MConnectionForms for one corresponding connector. Forms will contained filled values.

MJob

Top level structure that contains one instance of MJobForms for one corresponding connector. Forms will contained filled values.

Object relationship

 

Example

Let's assume some JDBC based Connector.

MConnector
  • For connection: Contains list of one MForm called "Connection" containing three inputs "JDBC URL", "Username" and "Password".
  • For job: Contains list of one MForm called "Source" containing single input "Table".
MDriver
  • For connection: Contains empty list - e.g. no values are required
  • For job: Contains one MForm called "Target" containing single input "HDFS Directory"

Administrators might create two different connections based on this example connector:

Link 1:
  • Connector part: Contain values for connector "JDBC URL" contains "jdbc:mysql://development/test", "Username" contains "letest" and "Password" contains "letest".
  • Framework part: Contain values for framework. As framework did not specified any MForms, this will be empty.
Link 2:
  • Connector part: Contain values for connector "JDBC URL" contains "jdbc:mysql://production/test", "Username" contains "production-user" and "Password" contains "aosdf792r7asfhas8sd-9a7(&(&@#&$(Vosfs9fya9d7(&SD(F*&S(*F&SDF&VChsdfhsdf" (Damn good password).
  • Framework part: Contain values for framework. As framework did not specified any MForms, this will be empty.

Operation did not yet have time to utilize connection 2 as it's still playing with connection 1. He however already created couple of jobs based on this connection: 

Job 1:
  • Is based on Connection 1
  • Connector part: Contain values for connector, "Table" contains "traffic_details".
  • Framework part: Contain values for framework, "HDFS Directory" contains "/storage/traffic_details"
Job 2:
  • Is based on Connection 1
  • Connector part: Contain values for connector, "Table" contains "log".
  • Framework part: Contain values for framework, "HDFS Directory" contains "/storage/log"
  • No labels