Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. DStream, the abstraction of a data stream in Spark Streaming, is a sequence of RDDs. This is called a "micro-batch" architecture. New batches are created at regular time intervals (determined by a batch interval parameter).
  2. Recovery from a fault uses recomputation of RDDs.
  3. API supports a comprehensive set of collection functions (map, flatMap, ...) and also windowed operations by a sliding window.
  4. Stateless transformations include map, filter, reduceByKey, joins etc. They are RDD transformations applied to a batch. 
  5. Stateful transformations include sliding window based transformation and  state tracking across time (updateStateByKey). 
  6. Sliding window has two parameters, window duration and sliding duration. Both of them are multiple of the batch interval.
    1. The window duration defines the size of the window.
    2. The sliding duration defines how frequently the computation happens.
  7. Checkpointing saves the application state to a reliable storage system, such as HDFS, S3. 
  8. Spark Streaming guarantees exactly-once semantics.
  9. Spark has a local mode for quick experiment (Spark shell)
  10. Spark has own cluster management and also works with other cluster managers (YARN, Mesos). 
  11. Spark doesn't seem to have an embedded mode.
  12. A user submits an application (a jar file or a python script) using a provided script (spark-submit) to Spark.

 

Photon from Google

Paper: http://research.google.com/pubs/pub41318.html

...