Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Added note about possibility of adding very minimal transformations

...

  • Focus on copying data only – Focus on reliable, scalable data copying; leave transformation, enrichment, and other modifications of the data up to frameworks that focus solely on that process. Because some very simple transformations may be broadly applicable (e.g. drop a field, obfuscate sensitive data), it may make sense to include a very minimal hook or settings to perform these modifications. However, these additions must be added carefully and not come at the cost of other key design goals.

  • Copy broadly by default – Endless configuration tweaking can quickly destroy the effectiveness of a tool like Copycat. Jobs should be quick to define and be able to copy vast quantities of data between systems. For example, the default unit of work should be an entire database, even if it is possible to define jobs that copy individual tables.

  • Parallel – Parallelism should be included in the core abstractions, providing a clear avenue for the framework to provide automatic scalability. Although some sources or sinks may naturally have no parallelism (e.g. a database change log), many others have significant parallelism (e.g. metrics, logs), and the framework should be capable of -- and encourage -- taking advantage of that parallelism.

  • Strong semantics – If possible, it is better to provide exactly-once delivery than weaker semantics; it is preferable to provide at-least or at-most once delivery than best effort. The framework should make it easy to provide stronger semantics when the connector system supports it.

  • Capture metadata – Many systems provide data with a well-defined structure and types. The framework should be able to capture this metadata and preserve it through an entire data pipeline as long as connectors also preserve it. However, the framework should also be able to handle systems which do not include or do not provide this metadata.

  • Accessible connector API – It must be easy to develop new connectors. The API and runtime model for implementing new connectors should make it simple to use the best library for the job, quickly get data flowing between systems, and still get all the benefits of the framework. Where the framework requires support from the connector, e.g. for recovering from faults, all the tools required should be included in the Copycat APIs.

  • Streaming and batch – Copycat must be able to integrate well with both streaming and batch-oriented systems. Kafka’s ability to interact efficiently with both these types of systems is one of its unique features and one which Copycat can take advantage of to make integrating these types of systems seamless and easy.

  • Scale to the application – Although Copycat should support copying large scale data, it should also scale easily to the application or environment. It should be easy to run a single process with a single copy job in development, testing or a small production environment, but also scale up to an organization-wide service for copying data between a wide variety of large scale systems.

...