Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update broken links

...

In a few cases, PySpark's internal code needs take care to avoid including unserializable objects in function closures. For example, consider this excerpt from rdd.py that wraps a user-defined function to batch its output:

...

Even with only one serializer, there are still some subtleties here due to how PySpark handles text files. PySpark implements SparkContext.textFile() by directly calling its Java equivalent. This produces a JavaRDD[String] instead of a JavaRDD[byte[]]. JavaRDD transfers these strings to Python workers using UTF-8 encoding (For 0.9.0 only, this used MUTF-8 encoding, but that was fixed for 0.9.1; see https://spark-project.atlassian.net/browse/SPARK-1043).

Prior to the custom serializers pull request, JavaRDD would send strings to Python as pickled UTF-8 strings by prepending the appropriate pickle opcodes. From the worker's point of view, all of its incoming data was in the same pickled format. The pull request removed all Python-pickle-specific code from JavaRDD.

...

PySpark pipelines transformations by composing their functions. When using PySpark, there's a one-to-one correspondence between PySpark stages and Spark scheduler stages. Each PySpark stage corresponds to a PipelinedRDD instance.

TODO: describe PythonRDD

...

Many JVMs use fork/exec to launch child processes in ProcessBuilder or Runtime.exec. These child processes initially have the same memory footprint as their parent. When the Spark worker JVM has a large heap and spawns many child processes, this can cause memory exhaustion, leading to "Cannot allocate memory" errors. In Spark, this affects both pipe() and PySpark.

Other developers have run into the same problem and discovered a variety of workarounds, including adding extra swap space or telling the kernel to overcommit memory. We can't use the java_posix_spawn library to solve this problem because it's too difficult to package due to its use of platform-specific native binaries.

...