THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
...
- DStream, the abstraction of a data stream in Spark Streaming, is a sequence of RDDs. This is called a "micro-batch" architecture. New batches are created at regular time intervals (determined by a batch interval parameter).
- Recovery from a fault uses recomputation of RDDs.
- API supports a comprehensive set of collection functions (map, flatMap, ...) and also windowed operations by a sliding window.
- Stateless transformations include map, filter, reduceByKey, joins etc. They are RDD transformations applied to a batch.
- Stateful transformations include sliding window based transformation and state tracking across time (updateStateByKey).
- Sliding window has two parameters, window duration and sliding duration. Both of them are multiple of the batch interval.
- The window duration defines the size of the window.
- The sliding duration defines how frequently the computation happens.
- Checkpointing saves the application state to a reliable storage system, such as HDFS, S3.
- Spark Streaming guarantees exactly-once semantics.
- Spark has a local mode for quick experiment (Spark shell)
- Spark has own cluster management and also works with other cluster managers (YARN, Mesos).
- Spark doesn't seem to have an embedded mode.
- A user submits an application (a jar file or a python script) using a provided script (spark-submit) to Spark.
Photon from Google
Paper: http://research.google.com/pubs/pub41318.html
...
2. Supports cross-DC joins by using Paxos to detect duplicates, etc.
Apache Apex (incubating)
Website: http://apex.incubator.apache.org/
- Pipeline processing architecture, can be used for real-time and batch processing in unified architecture.
- Architected for scalability, low-latency processing, high availability, operability.
- Stateful fault tolerance (checkpoints operator state without user having to write code for it).
- Runs natively on YARN and HDFS, local mode for development.
- Rich library of pre-built operators (Malhar) with many adapters for message buses, databases, file systems etc.
- Supports Kafka as source and sink (at any point in the topology), connector with offset management for exactly once semantics / idempotency.
Resource Manager
...
Frameworks
Mesos / Marathon
Website: https://mesosphere.github.io/marathon/
...