...
Wiki Markup |
---|
Even with only one serializer, there are still some subtleties here due to how PySpark handles text files. PySpark implements {{SparkContext.textFile()}} by directly calling its Java equivalent. This produces a {{JavaRDD\[String\]}} instead of a {{JavaRDD\[byte\[\]\]}}. JavaRDD transfers these strings to Python workers using Java's MUTF-8 encoding. <ac:structured-macro ac:name="footnote" ac:schema-version="1" ac:macro-id="c9040569c7e79428-d708e5a3-43ba4dc7-846eb9c8-f9a3ac2863ad633abb430e88"><ac:parameter ac:name="atlassian-macro-output-type">INLINE</ac:parameter><ac:rich-text-body><p>Prior to this pull request, JavaRDD would send strings to Python as pickled UTF-8 strings by prepending the appropriate pickle opcodes. From the worker's point of view, all of its incoming data was in the same pickled format. The pull request removed all Python-pickle-specific code from JavaRDD.</p></ac:rich-text-body></ac:structured-macro> To handle these cases, PySpark allows a stage's input deserialization and output serialization functions to come from different serializers. For example, in {{sc.textFile(..).map(lambda x: ...).groupByKey()}} the first pipeline stage would use a MUTF8Deserializer and PickleSerializer, and subsequent stages would use PickleSerializers for their inputs and outputs. PySpark uses the lineage graph to perform the bookkeeping to select the appropriate deserializers. |
At the moment, union() requires that its inputs were serialized with the same serializer. When unioning an untransformed RDD created with sc.textFile() with a transformed RDD, cartesian() product, or RDD created with parallelize(), PySpark will force some of the RDDs to be re-serialized using the default serializer. We could add support for this latermight be able to add code to avoid this re-serialization, but it would add extra complexity and these particular union() usages seem uncommon.
...