Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Here are some cases of performance issues:

  • Unreused Not reused connections/threads, extra overhead of resource operations :

    • Frequently create/detroy destroy new HA(ZK、K8S) connections for leader retrieval

    • Frequently open/close Netty channel for each request

    • Frequently create/destroy ThreadPool in RestClusterClient and RestClient

  • Unreused Not reused instances, extra GC overhead:

    • For each operation, Flink Client creates a lot of new instances like: ClusterDescriptor, RestClusterClient, ClientHighAvailabilityServices and RestClient

  • Concurrency bottlenecks:

    • One global ObjectMapper instance for data serialization/deserialization for all http requests and responses

  • Unnecessary workload:

    • For example: fixed collect retry interval(100 ms) in CollectResultFetcher to fetch result from Flink Cluster. This retry operation could be very resource consuming when executing under high concurrency.

...

An agent process will use Flink JDBC Driver to continuously submit short-lived queries to the SQL Gateway service under different concurrency (1 concurrency, 32 concurrency, 64 concurrency and more) and monitor the end-to-end Latency.

Queries

Catalog DDL

Code Block
languagesql
create temporary table table1 (
    val1 STRING
) WITH (
      'connector' = 'datagen',
      'number-of-rows' = '1'
);

create temporary table table2 (
    val2 STRING
) WITH (
      'connector' = 'datagen',
      'number-of-rows' = '1'
);

create temporary table table3 (
    val3 STRING
) WITH (
      'connector' = 'datagen',
      'number-of-rows' = '1'
);

DQL

Query Type

SQL

JobGraph

source


Code Block
languagesql

...

select val1 from table1;

...


Image Added

wourdcount


Code Block
languagesql
select val1, count(*) from table1 
	 group by val1;

...



Image Added

join


Code Block
languagesql
select val1, count(*) from table1 
    left join table2 on val1=val2 
    left join table3 on val2=val3 
group by val1;



Image Added

Benchmark

Notice: To test the performance bottleneck of Flink Client in interactive scenarios, we used a version of Flink Cluster running with more scheduling optimizations than the community(e.g. HA improvement mentioned in FLIP-403) as the baseline.

...

Test Plan

Both Unit Test & Intergration Integration Test will be introduced to verify this change. There will also be performance benchmarks to prove that these optimizations will not cause any performance regression for Flink Client.

...