This is just a rough outline until I complete this page. - Josh

Overview

PySpark is built on top of Spark's Java API. Data is processed in Python and cached / shuffled in the JVM:

In the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext. Py4J is only used on the driver for local communication between the Python and Java SparkContext objects; large data transfers are performed through a different mechanism.

RDD transformations in Python are mapped to transformations on PythonRDD objects in Java. On remote worker machines, PythonRDD objects launch Python subprocesses and communicate with them using pipes, sending the user's code and the data to be processed.

Serialization

User-defined functions (e.g. lambdas or functions passed to map, flatMap) are serialized using PiCloud's cloudpickle library and shipped to remote Python workers. Serializing arbitrary functions / closures is tricky, but this library handles most common cases (including referencing objects in enclosing scopes).

Data is currently serialized using the Python cPickle serializer. As an optimization, we store and serialize objects in small batches.

TODO: link to the early performance tests using different batch sizes, and link to the experimental custom serializers branch.

Process Re-Use

TODO: describe the motivation for using a separate "fork-server" process to launch the Python workers.

Child pages

PySpark Internals

Overview

Serialization

Process Re-Use