Here are some cases of performance issues:

Unreused Not reused connections/threads, extra overhead of resource operations :
- Frequently create/detroy destroy new HA(ZK、K8S) connections for leader retrieval
- Frequently open/close Netty channel for each request
- Frequently create/destroy ThreadPool in RestClusterClient and RestClient
Unreused Not reused instances, extra GC overhead:
- For each operation, Flink Client creates a lot of new instances like: ClusterDescriptor, RestClusterClient, ClientHighAvailabilityServices and RestClient
Concurrency bottlenecks:
- One global ObjectMapper instance for data serialization/deserialization for all http requests and responses
Unnecessary workload:
- For example: fixed collect retry interval(100 ms) in CollectResultFetcher to fetch result from Flink Cluster. This retry operation could be very resource consuming when executing under high concurrency.

...

An agent process will use Flink JDBC Driver to continuously submit short-lived queries to the SQL Gateway service under different concurrency (1 concurrency, 32 concurrency, 64 concurrency and more) and monitor the end-to-end Latency.

Queries

Catalog DDL

Code Block

language	sql

create temporary table table1 (
    val1 STRING
) WITH (
      'connector' = 'datagen',
      'number-of-rows' = '1'
);

create temporary table table2 (
    val2 STRING
) WITH (
      'connector' = 'datagen',
      'number-of-rows' = '1'
);

create temporary table table3 (
    val3 STRING
) WITH (
      'connector' = 'datagen',
      'number-of-rows' = '1'
);

DQL

Query Type

SQL

JobGraph

source

Code Block

language	sql

...

select val1 from table1;

...

Image Added

wourdcount

Code Block

language	sql

select val1, count(*) from table1 
	 group by val1;

...

Image Added

join

Code Block

language	sql

select val1, count(*) from table1 
    left join table2 on val1=val2 
    left join table3 on val2=val3 
group by val1;

Image Added

Benchmark

Notice: To test the performance bottleneck of Flink Client in interactive scenarios, we used a version of Flink Cluster running with more scheduling optimizations than the community(e.g. HA improvement mentioned in FLIP-403) as the baseline.

...

Test Plan

Both Unit Test & Intergration Integration Test will be introduced to verify this change. There will also be performance benchmarks to prove that these optimizations will not cause any performance regression for Flink Client.

...

Page tree

Versions Compared

Old Version 6

New Version 7

Key

Queries

Catalog DDL

DQL

Benchmark

Test Plan

Page tree

Page History

Versions Compared

Old Version 6

New Version 7

Key

Queries

Catalog DDL

DQL

Benchmark

Test Plan