Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Wiki Markup
Even with only one serializer, there are still some subtleties here due to how PySpark handles text files.  PySpark implements {{SparkContext.textFile()}} by directly calling its Java equivalent.  This produces a {{JavaRDD\[String\]}} instead of a {{JavaRDD\[byte\[\]\]}}.  JavaRDD transfers these strings to Python workers using Java's MUTF-8 encoding. <ac:structured-macro ac:name="footnote" ac:schema-version="1" ac:macro-id="c9040569c7e79428-d708e5a3-43ba4dc7-846eb9c8-f9a3ac2863ad633abb430e88"><ac:parameter ac:name="atlassian-macro-output-type">INLINE</ac:parameter><ac:rich-text-body><p>Prior to this pull request, JavaRDD would send strings to Python as pickled UTF-8 strings by prepending the appropriate pickle opcodes.  From the worker's point of view, all of its incoming data was in the same pickled format.  The pull request removed all Python-pickle-specific code from JavaRDD.</p></ac:rich-text-body></ac:structured-macro> To handle these cases, PySpark allows a stage's input deserialization and output serialization functions to come from different serializers.  For example, in {{sc.textFile(..).map(lambda x: ...).groupByKey()}} the first pipeline stage would use a MUTF8Deserializer and PickleSerializer, and subsequent stages would use PickleSerializers for their inputs and outputs.  PySpark uses the lineage graph to perform the bookkeeping to select the appropriate deserializers.

At the moment, union() requires that its inputs were serialized with the same serializer. When unioning an untransformed RDD created with sc.textFile() with a transformed RDD, cartesian() product, or RDD created with parallelize(), PySpark will force some of the RDDs to be re-serialized using the default serializer. We could add support for this latermight be able to add code to avoid this re-serialization, but it would add extra complexity and these particular union() usages seem uncommon.

...