Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

DRAFT

 

Table of Contents

 

Title: DFM support in  Sqoop

JIRA : SQOOP-1168 and its sub tickets

The discussion also happened on the SQOOP-1168 and SQOOP-1804 and SQOOP-2025

 

Table of Contents

 

Summary

Provide a way for sqoop jobs to read and write a subset of records i.e transfer only new records added since the last transfer in case of immutable data sources and/or  transfer the delta modifications across the entire data set in case of mutable data sources 

...

Design Features Revision 2 

( After some discussions on the https://issues.apache.org/jira/browse/SQOOP-1168, adding more details to resolved open questions)UPDATEThe discussion happened on the SQOOP-1168 and SQOOP-1804 and SQOOP-2025 JIRA items. The design doc will soon be updated with the final details.

NOTE : We will use the DFM  term to broadly mean reading sub set of records and write them either my appending or reconciling/ overwriting with existing values/ deleting. It can mean any of the above or all as the connector implementation defines it.

...

  • The FromJobConfiguration and ToJobConfiguration will expose the config objects in the JOB to the connectors. Connectors have to be aware of the configs they expose. In case of more than one DF or DM config they may have to check which exists like a switch case and then use the one that exists. Sqoop will be smart to insert only one config that the user populates. For this we need the conditions on the config annotation as stated above:  https://issues.apache.org/jira/browse/SQOOP-1919
  • In addition to the configs, we will insert the info such as the last_value that the connector needs from previous run ( not limited to last value alone ) in the ExtractorContext and LoaderContext, we may have one field as EXTRACTOR_JOB_INFO that will hold all the dat from previous run. Again, how do we get these values from repo to the Extractor context, nothing fancy, just abuse the job.getCredentials().addSecretKey(..) in either use the distributed cache or some similar way to write the output back on the context in case of MR. Spark has other ways to get data into the spark job.

...

We can store the values from the Extractor context and Loader Context, that are passed to sqoop in one of the 2 placessome ways

 

  1. SQ_JOB_INPUT ( with a new  MInputType, INTERNAL or OUTPUT, or other relevant name to say that this is not what connectors exposed, but resulted from JOB execution for FROM and TO), the real advantage is that we have separation per CONFIG per DIRECTION.

...

  1. It is less effort, no new table, but it sounds a bit odd to store the output in the input table
  1. We can dump everything in SQ_COUNTER_SUBMISSION related to submission table, but what we may want to give back may be not always counter or long value, it can be a boolean saying if it was a delta fetch or merge , it wont have the distinction per direction. If we are to go this route, the SQ_COUNTER_SUBMISSION needs

...

  1. to generalize to support any form of job output values to be stored in it and

...

  1. exposed via submission API. A simiple way to think about it is to mirror what we have in MInputType for MOutputType
  2. Store the output in a BLOB inside SQ_SUBMISSION itself. Cons include we need to store this as JSON and allowing to edit one field in JSON might need some extra work, still doable. The other major con, is if we ever itend to migrate away from a single BLOB to a more typed way to store output in the DB, such as a INTEGER, FLOAT, BOOLEAN, like in the SQ_JOB_INPUT, we need to do the extra work of data migration and since we may not have types in the JSON, it might be really hard. We can mandate the output to be in avro to have the types preserved via schema.
  3. Possibly mirror the option #1 and create SQ_JOB_OUTPUT table, to store the key/values and its types. Not too much effort, cleaner and easy to expose a edit API for any field in the output.

Choosing option 4 at this point since it has more pros than the other 3 options. 

 

 

Design Features Revision 1

...

Backwards Compat & Upgrade

  • The configs upgrade will follow the existing solution for connectors.
  • Repository changes if done will have the repo upgrade

...

  • code.
  • APIs have addition of methods, existing methods have not been modified.

...

Testing strategy

Unit tests for the internal sqoop and API changes and Integration tests for POC connectors

...