Authors: Wei Zhong, Dian Fu

Page properties

Discussion thread

Vote thread

JIRA

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	FLINK-14019

Release

1.10

Status

Current state: "Under Discussion"

Discussion thread: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-Python-UDF-Environment-and-Dependency-Management-td33514.html

JIRA: FLINK-14019

Released:

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

It is necessary to support managing dependencies and environment through command line so that the python jobs with additional dependencies can be submitted via "flink run" and web UI or other approached in the future. The PythonDriver class will support several new options as follows:

Short Name	Full Name	Syntax	Description
-pyfs	--pyFiles	-pyfs

<file-path>

<filePaths>	This option already exists but it only appends the file to client side PYTHONPATH currently. Now it will upload the file to cluster and append it to python worker’s PYTHONPATH, which is equivalent to "add_python_file".
-pyexec	--

python-executable-path

pyExecutable

-pyexec

<python-executable-path>

<pythonInterpreterPath>	This option is equivalent to `TableEnvironment#get_config().set_python_executable()`.
-pyreq	--

python-requirements

pyRequirements

-pyreq

<requirements-file-path> <cache-dir-path>

This

option is equivalent to "set_python_requirements". "#" can be used to as the separator if "requirementsCachedDir" exists.
-pyarch	--

python-archive

pyArchives

-pyarch

<archive-file-path> <extract-name>

The

This

option is equivalent to "add_python_archive".

Implementation

Implementation of SDK API

"," can be used as the separator for multiple archives and "#" can be used as the separator if "extractName" exists.

Implementation

Implementation of SDK API

Flink Flink has provided a distributed cache mechanism and allows users to upload their files using "registerCachedFile" method in ExecutionEnvironment/StreamExecutionEnvironment. The python files users specified through "add_python_file", "set_python_requirements" and "add_python_archive" are also uploaded through this method eventually.

...

The PythonDriver will parse those new parameters and store them in a map. When the DependencyManager(see previous section) object creates, it will access the map and register the content of the map into itself. The sequence diagram is as follows:

Implementation

Data Structures used in Operator

Two new java classes roles will be introduced in flink-python, named PythonDependencyManager and ProcessEnvironmentManagerPythonEnvironmentManager separately.

The structure of PythonDependencyManager is as follows:

PythonDependencyManager is used to parse the Python dependencies uploaded from client, and provide that information to PythonEnvironmentManager.

The structure of PythonDependencyManager is as follows:

public class PythonDependencyManager

{

// create PythonDependencyManager from ExecutionConfig.getGlobalJobParameters().toMap() and

// distributedCaches.

public static PythonDependencyManager create(

Map<String, String> dependencyMetaData,

DistributedCache distributedCache) {...}

// key is the absolute path of the files to append to PYTHONPATH, value is the origin file name

public Map<String, String> getPythonFiles() {...}

// absolute path of requirements.txt

public String getRequirementsFilePath() {...}

// absolute path of the cached directory which contains user provided python packages

public String getRequirementsDirPath() {...}

//path of the python executable file

public String getPythonExec() {...}

// key is the name of the environment variable, value is the value of the environment variable

public Map<String, String> getEnvironmentVariable() {...}

// key is the absolute path of the zip file, value is the target directory name to be extracted to

public Map<String, String> getArchives() {...}

}

This class PythonEnvironmentManager is used to parse the information uploaded from client, and provide that information to ProcessEnvironmentManager.manage the execution environment of python worker. The structure of ProcessEnvironmentManager PythonEnvironmentManager is as follows:

public

class ProcessEnvironmentManager

interface PythonEnvironmentManager {

public static ProcessEnvironmentManager create(

PythonDependencyManager dependencyManager,

String tmpDirectoryBase) {...}

public void prepareEnvironment(Map<String, String> systemEnv) {

registerShutdownHook(...);

prepareWorkingDir(...);

updateEnvironmentVariable(systemEnv);

}

public void cleanup() {removeShutdownHook(); ...}

public void updateEnvironmentVariable(Map<String, String> systemEnv) {...}

public void prepareWorkingDir(...) {...}

public Thread registerShutdownHook(String pythonTmpDirectory) {

Thread thread = new Thread(new DeleteTemporaryFilesHook(pythonTmpDirectory));

Runtime.getRuntime().addShutdownHook(thread);

return thread;

}

This class is used to prepare and cleanup the working directory and other temporary directories of python worker. It needs the information provided by PythonDependencyManager and a temporary directory as the root of the python working directory. The configured temporary directory of current task manager can be obtained using "getContainingTask().getEnvironment().getTaskManagerInfo().getTmpDirectories()". In current design, 3 kinds of directory are needed to prepare:

The directories store the files used to append to PYTHONPATH

Flink distributed cache will wipe the origin file name, including the file format suffix of the uploaded file. But different file formats have different logics to append files to PYTHONPATH:

If the target file is .py file, we must restore its origin file name and append its parent directory to PYTHONPATH.
If the target file is egg file or other packaging file which can be imported directly, just append itself to PYTHONPATH.

So it is necessary to restore the original file names of uploaded python files. To avoid naming conflict we should store them in separate directories. Symbolic links can be used here to save copy time and disk space.

2. The directory stores pip install results of the packages listed in the uploaded requirements.txt

Apparently we should not install the users' packages into system python environment. A feasible approach is using "--prefix" param of pip to redirect the install location to a temporary directory, and then append the "bin" directory under the location to PATH variable and append the "site-packages" directory to PYTHONPATH variable.

3. The directory stores the extracted results of the uploaded archives

This directory is used as the working directory of python workers. The contents of uploaded python archives, including users' python environment, will be extracted to the specified sub-directory and can be accessed using relative path in python worker and its launcher script.

This class should create these directories, and remove them when the task is closing. It is also responsible for adding shutdown hook to ensure the created directories can be deleted once the jvm exits unexpectedly, and removing the shutdown hook when the task is closed to prevent memory leaks.

After the above directories are all ready, the shell script to launch python workers will be executed. The installation of required packages and the changing of working directory will be completed in this script. For each line of the requirements.txt file, the following command will be executed:

...

# just indicate the intention of appending the site-packages directory to PYTHONPATH

# actual code are more complicated

PYTHONPATH=${install_directory}/lib/pythonXY/site-packages:${PYTHONPATH}

export PYTHONPATH

PATH=${install_directory}/bin:${PATH}

${python} -m pip install ${every_line_content} --prefix ${install_directory} --ignore-installed --no-index --find-links ${cached_dir}

/**

* Create Apache Beam Environment object of python worker.

*/

RunnerApi.Environment createEnvironment();

/**

* Create the RetrievalToken file which records all the files that need to be transferred via Apache Beam's

* ArtifactService.

*/

String createRetrievalToken();

/**

* Delete generated files during above actions.

*/

void cleanup();

}

Flink Python UDF is implemented based on Apache Beam Portability Framework which uses a RetrievalToken file to record the information of users’ file. We will leverage the power of Apache Beam artifact staging for dependency management in docker mode.

PythonEnvironmentManager has two implementations, ProcessEnvironmentManager for process mode and DockerEnvironmentManager for docker mode.

Implementation of PythonEnvironmentManager

ProcessEnvironmentManager

The structure of ProcessEnvironmentManager is as follows:

public class ProcessEnvironmentManager implements PythonEnvironmentManager {

public static ProcessEnvironmentManager create(

PythonDependencyManager dependencyManager,

String tmpDirectoryBase,

Map<String, String> systemEnv) {

}

public ProcessEnvironmentManager(...) {

prepareEnvironment();

}

@Override

public void cleanup() {

// perform the clean up work

removeShutdownHook();

}

@Override

public RunnerApi.Environment createEnvironment() {

// command = path of udf runner

return Environments.createProcessEnvironment("", "", command, generateEnvironmentVariable());

}

@Override

public String createRetrievalToken() {

// File transfer is unnecessary in process mode,

// just create an empty RetrievalToken.

return emptyRetrievalToken;

}

private Map<String, String> generateEnvironmentVariable() {

// construct the environment variables such as PYTHONPATH, etc

}

private void prepareEnvironment() {

registerShutdownHook();

prepareWorkingDir();

}

private void prepareWorkingDir() {...}

private Thread registerShutdownHook() {

Thread thread = new Thread(new DeleteTemporaryFilesHook(pythonTmpDirectory));

Runtime.getRuntime().addShutdownHook(thread);

return thread;

}

This class is used to prepare and cleanup the working directory and other temporary directories of python worker. It needs the information provided by PythonDependencyManager and a temporary directory as the root of the python working directory. The configured temporary directory of current task manager can be obtained using "getContainingTask().getEnvironment().getTaskManagerInfo().getTmpDirectories()". In current design, 3 kinds of directory are needed to prepare:

The directories store the files used to append to PYTHONPATH

Flink distributed cache will wipe the origin file name, including the file format suffix of the uploaded file. But different file formats have different logics to append files to PYTHONPATH:

If the target file is .py file, we must restore its origin file name and append its parent directory to PYTHONPATH.
If the target file is egg file or other packaging file which can be imported directly, just append itself to PYTHONPATH.