Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. DStream, the abstraction of a data stream in Spark Streaming, is a sequence of RDDs. This is called a "micro-batch" architecture. New batches are created at regular time intervals (determined by a batch interval parameter).
  2. Recovery from a fault uses recomputation of RDDs.
  3. API supports a comprehensive set of collection functions (map, flatMap, ...) and also windowed operations by a sliding window.
  4. Stateless transformations include map, filter, reduceByKey, joins etc. They are RDD transformations applied to a batch. 
  5. Stateful transformations include sliding window based transformation and  state tracking across time (updateStateByKey). 
  6. Sliding window has two parameters, window duration and sliding duration. Both of them are multiple of the batch interval.
    1. The window duration defines the size of the window.
    2. The sliding duration defines how frequently the computation happens.
  7. Checkpointing saves the application state to a reliable storage system, such as HDFS, S3. 
  8. Spark Streaming guarantees exactly-once semantics.
  9. Spark has a local mode for quick experiment (Spark shell)
  10. Spark has own cluster management and also works with other cluster managers (YARN, Mesos). 
  11. Spark doesn't seem to have an embedded mode.
  12. A user submits an application (a jar file or a python script) using a provided script (spark-submit) to Spark.

 

Photon from Google

Paper: http://research.google.com/pubs/pub41318.html

...

2. Supports cross-DC joins by using Paxos to detect duplicates, etc.

 

Apache Apex (incubating)

Website: http://apex.incubator.apache.org/

  1. Pipeline processing architecture, can be used for real-time and batch processing in unified architecture.
  2. Architected for scalability, low-latency processing, high availability, operability.
  3. Stateful fault tolerance (checkpoints operator state without user having to write code for it).
  4. Runs natively on YARN and HDFS, local mode for development.
  5. Rich library of pre-built operators (Malhar) with many adapters for message buses, databases, file systems etc.
  6. Supports Kafka as source and sink (at any point in the topology), connector with offset management for exactly once semantics / idempotency.

Resource Manager

...

Frameworks

Mesos / Marathon

Website: https://mesosphere.github.io/marathon/

...