Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Add custom serializers section

...

Interestingly, there are some cases where a set of pickles can be combined to be decoded faster, even though this requires manipulation of the pickle bytecode. Although this is a fun result, this bulk de-pickling technique isn't used in PySpark.

Custom serializers

Customs serializers could improve performance for many workloads. This functionality hasn't been implemented yet, but pull requests are welcome!

The first prototype of custom serializers allowed serializers to be chosen on a per-RDD basis. A later prototype of custom serializers only allowed one serializer to be used for all RDDs, set when constructing SparkContext.

Wiki Markup
Even with only one serializer, there are still some subtleties here due to how PySpark handles text files.  PySpark implements {{SparkContext.textFile()}} by directly calling its Java equivalent.  This produces a {{JavaRDD\[String\]}} instead of a {{JavaRDD\[byte\[\]\]}}.  When sending data to the worker processes, PySpark currently sends these strings as pickled UTF-8 strings by prepending the appropriate opcodes.  From the worker's point of view, all of its incoming data is in the same pickled format, reducing complexity.  If we support custom serializers, then we need to handle the bookkeeping to handle RDDs of strings.

For simple cases like sc.textFile(..).map(lambda x: ...).., we can allow a stage's input and output serialization/deserialization functions to come from different serializers, so the first stage of this pipeline would use an InputOutputSerializer(UTF8Deserializer, CustomSerializer)). For more complicated cases involving joins, we can use the lineage graph to generate appropriate composed serializers for each stage.

This doesn't support efficient the union of RDDs storing different types of serialized data, although that's probably an uncommon caseTODO: JoshRosen's GitHub contains some old drafts of a pluggable custom serialization layer. Discuss this design and whether a simplified version of it can be added to PySpark.

Daemon for launching worker processes

...