Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

Even with only one serializer, there are still some subtleties here due to how PySpark handles text files. PySpark implements SparkContext.textFile() by directly calling its Java equivalent. This produces a JavaRDD[String] instead of a JavaRDD[byte[]]. JavaRDD transfers these strings to Python workers using Java's MUTF-8 encoding.

Wiki Markup
{footnote}Prior to this pull request, JavaRDD would send strings to Python as pickled UTF-8 strings by prepending the appropriate pickle opcodes.  From the worker's point of view, all of its incoming data was in the same pickled format.  The pull request removed all Python-pickle-specific code from JavaRDD.{footnote}

...