...
RDD transformations in Python are mapped to transformations on PythonRDD objects in Java. On remote worker machines, PythonRDD objects launch Python subprocesses and communicate with them using pipes, sending the user's code and the data to be processed.
Tips for contributing to PySpark
Unit Testing
PySpark's tests are a mixture of doctests and unittests. The doctests serve as simple usage examples and are a lightweight way to test new RDD transformations and actions. The unittests are used for more involved testing, such as testing job cancellation.
To run the entire PySpark test suite, run ./python/run-tests
. When adding a new file that contains doctests or unittests, make sure to update run-tests
so that the new tests are automatically executed by Jenkins.
To run individual test suites:
- For
unittest
suites runSPARK_TESTING=1 ./bin/pyspark python/pyspark/my_file.py
. Some of ourdoctest
suites (such as the tests inrdd.py
) have a custom__main__
method that sets up a global SparkContext that's shared among the tests. These suites should be also be run with this command. - For pure-
doctest
suites (without special__main__
setup code), runSPARK_TESTING=1 ./bin/pyspark python/pyspark/my_file.py
.
Shipping code across the cluster
...