Discussion threadhere (<- link to https://lists.apache.org/list.html?dev@flink.apache.org)
Vote threadhere (<- link to https://lists.apache.org/list.html?dev@flink.apache.org)
JIRA

  Unable to render Jira issues macro, execution error.

Release<Flink Version>


Motivation

During the process of routine program debugging or troubleshooting, analyzing system logs is a common approach. 

Comprehensive and detailed system logs contribute to improved visibility of internal system execution information and also enhance the efficiency of program debugging or issue troubleshooting.However, comprehensive and detailed log settings can lead to the following issues:

  1. A sharp increase in log volume, accelerating disk occupancy.
  2. Potential risks of system performance degradation due to a large volume of log printing.
  3. The need to simplify log configuration subsequently.

Therefore, introducing a mechanism to dynamically adjust the online log output level in the event of diagnosing online issues or debugging programs could be meaningful. 

This mechanism should ideally provide the following two basic capabilities:

  1. Dynamically adjust log levels.
  2. Query the current log levels of the JM/TM in the cluster.


Pre-research for log framework

These examples are primarily used to document the process of dynamically adjusting logs for each logging framework and to illustrate the feasibility of dynamically adjusting log levels.

Public Interfaces

Introduce the REST APIs named: 

  • /logger/level   (PUT)
- METHOD: PUT
- Response code: 200 OK
- Request:
	{
		loggerLevel: {
			“root”:”DEBUG”,
			“akka.xxx”:”INFO”,
			...
		}
	}
- Response:
{}




  • /logger/level    (GET)
- METHOD: GET
- Response code: 200 OK
- Request:
{}
- Response:
{
	“JobManager”:{
		“jm-1@xx.xx.xx.xx”: {
			“rootLogger”: “INFO”,
			……
		}
	},
	“TaskManager”: {

		“tm-1@xx.xx.xx.xx”: {
			“rootLogger”: “INFO”,
			……
		},
		……
	}

}


Items need to note

  • Why only for slf4j ( slf4j & [log4j1/log4j2/logback] )?

The Flink engine uses the bridge interface of Slf4j internally

  • Re-registration of TM

If the current RM has already performed a dynamic log adjustment operation, then the newly registered TM will also perform a log change operation

  • Changes and query interfaces in HA mode do not take effect on the JM component of the slave role

Proposed Changes

Change for ‘/logger/level’ (PUT)

  • Add the rpc method for ResourceManagerGateway
CompletableFuture<List<Acknowledge>> changeLogLevel(@Nonnull ChangeLogLevelRequest changeLogLevelRequest);


  • Add the rpc method for TaskExecutorGateway
CompletableFuture<Acknowledge> changeLogLevel(@Nonnull ChangeLogLevelRequest request);


  • Introduce a class named ChangeLogLevelRequest
class ChangeLogLevelRequest implements Serializable {


	Map<String, String> loggerLevel;

	// other placeholders…

}

Change for ‘/logger/level’ (GET)

  • Add the rpc method for ResourceManagerGateway
CompletableFuture<Map<String, Map<String, String>>> getLogLevel();


  • Add the rpc method for TaskExecutorGateway
CompletableFuture<Map<String, String>> getLogLevel();

Compatibility, Deprecation, and Migration Plan

N.A

Test Plan

Test its with raw rest framework test  suites of Flink.

Rejected Alternatives

N.A

Acknowledgements

Thanks for the inspiration from Rui.

References


  • No labels