...

Page properties

Discussion thread

https://lists.apache.org/thread/wsz4jgdpnlyw1x781f9qpk7y416b45dj

Vote thread

https://lists.apache.org/thread/m2cqp4mc1rt4hz0sxj2n50lxjs7m9wnl

JIRA

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	FLINK-31854

Release

...

Background and Motivation

Brief Introduction about to Redshift

Redshift is a cloud-based data warehousing service provided by Amazon Web Services (AWS). It is designed to handle large-scale data analytics workloads, processing petabytes of structured and semi-structured data with high performance and scalability. Redshift allows businesses to store and analyze vast amounts of data in a cost-effective way, using a columnar storage format and massively parallel processing. It supports SQL queries, making it easy for users to extract insights from their data (detail reference). One of the key benefits of Redshift is its ability to scale elastically, automatically adding or removing compute nodes as needed to handle changes in workload. It also integrates with a wide range of other AWS services, such as S3, Kinesis, and Lambda, enabling users to build sophisticated data processing pipelines. Overall, Amazon Redshift is a powerful tool for businesses looking to store, process, and analyze large volumes of data in the cloud, with high performance, scalability, and ease of use.

Why is Flink connector to Redshift useful?

...

Real-time data analysis: With Flink, businesses can analyze data as it arrives, enabling them to respond quickly to changes and make data-driven decisions in real - time.
Scalability: The Flink Redshift Connector allows businesses to scale their data processing and analysis up or down as needed, depending on changes in workload.
Easy integration: The Flink Redshift Connector will provide easy to integrate integration with existing Flink and Redshift workflows, enabling users to get up and running quickly.
Community support: As an open-source solution, the Flink Redshift Connector benefits from a vibrant community of developers and users who contribute to its development and provide support.

...

Integrate with Flink Sink API (FLIP-171)
Integrate with Table API (FLIP-95) for Sink. Build upon Flink New DynamicTableSink and DynamicTableSinkFactory Interfaces

Phase 2

Integrate with Flink's new Source API (FLIP-27)

...

Integrate with Table API (FLIP-95)
1for Source. Build upon Flink New DynamicTableSource and DynamicTableSinkFactory Interfaces.2. Build upon Flink New DynamicTableSink and DynamicTableSinkFactory Interfaces

Overall Design

I) Transactional Guarantees

For general information on Redshift's transactional guarantees, please refer to the documentation on
managing concurrent write operations and serializable isolation. According to the Redshift documentation, although any of the four transaction isolation levels can be used, Amazon Redshift processes all isolation levels as serializable. Amazon Redshift also supports a default automatic commit behavior behaviour in which each separately - executed SQL command commits individually. As a result, individual commands such as COPY and UNLOAD are atomic and transactional, while explicit BEGIN and END should only be necessary to enforce the atomicity of multiple commands/queries. When reading from or writing to Redshift, this library reads and writes data in S3. Both Flink and Redshift produce partitioned output, which is stored in multiple files in S3. According to the Amazon S3 Data Consistency Model documentation, S3 bucket listing operations are eventually-consistent. Therefore, flink-connector-redshift needs to take special measures to avoid missing or incomplete data due to this source of eventual-consistency.

...

To customize compression encoding for specific columns when creating a table, the connector provides a way to set the
encoding column metadata field. Please refer to the available encodings in the Amazon Redshift documentation.
Redshift allows descriptions to be attached to columns, which can be viewed in most query tools. You can specify a description for individual columns by setting the description column metadata field. This can be done using the COMMENT command.

IV)

...

Flink connector redshift will offer 2 modes to read from source :-

1. Directly reading from redshift using JDBC driver.
2. Use UNLOAD command to execute a query and save its result to s3 and read from s3.

In flink-connector-redshift, the concept of a connector source that allows users to configure different modes for data retrieval. This functionality is achieved by utilizing the `read.mode` option available in the source configuration. This flexible configuration empowers users to adapt the connector's behavior based on their specific requirements.

The flink-connector-redshift module will leverage and use already existing connectors. When a user specifies the read mode as "jdbc" in the source configuration, the flink-connector-redshift module internally integrates and uses the capabilities of the flink-connector-jdbc. This integration enables seamless interaction with Redshift using the JDBC protocol. Leveraging this connector, the flink-connector-redshift can efficiently retrieve data from the Redshift database, providing a reliable and performant data source for further processing within the Flink framework.

Alternatively, when the selected read mode is unload , the flink-connector-redshift module takes a different approach. In this case, flink-connector-redshift will execute preprocessing and colleting required information from redshift like schema, path to unload size of data and other meta information then it dynamically employs the functionalities of the flink-filesystem module, specifically utilizing the S3 connector. By doing so, the flink-connector-redshift is able to read data from unloaded path in an S3 bucket. This approach leverages the scalability and durability of the S3 storage system, making it an efficient and reliable intermediary for data transfer between Redshift and the Flink framework.

In summary, the flink-connector-redshift module acts as an intelligent mediator that dynamically selects and integrates with the appropriate connectors based on the specified read mode. This flexibility allows users to choose between a direct JDBC-based interaction with Redshift or an optimized data unload mechanism using the S3 connector. By leveraging the capabilities of these connectors, the flink-connector-redshift module effectively achieves the goal of seamless and efficient data transfer between Redshift and the Flink framework, providing users with a comprehensive and adaptable solution for their data processing needs.

...

ii) Source configurations (optional)

...

Code Block
SELECT * from <Table_Name> limit <record_size>;

...

Datatype Mapping

Flink Type	Redshift Type
CHAR	VARCHAR
VARCHAR	VARCHAR
STRING	VARCHAR
BOOLEAN	Boolean
BYTES	Not supported
DECIMAL	Decimal
TINYINT	Int8
SMALLINT	Int16
INTEGER	Int32
BIGINT	Int64
FLOAT	Float32
DOUBLE	Float64
DATE	Date
TIME	Timestamp
TIMESTAMP	Timestamp
TIMESTAMP_LTZ	Timestamp
INTERVAL_YEAR_MONTH	Int32
INTERVAL_DAY_TIME	Int64
ARRAY	Not supported
MAP	Not supported
ROW	Not supported
MULTISET	Not supported
RAW	Not supported

V) Source Design

Flink connector redshift will offer 2 modes to read from the source:-

Directly reading from Redshift using JDBC driver.
Use UNLOAD command to execute a query and save its result to s3 and read from s3.

In flink-connector-redshift, the concept of a connector source allows users to configure different modes for data retrieval. This functionality is achieved by utilizing the `scan.read.mode` option available in the source configuration. This flexible configuration empowers users to adapt the connector's behaviour based on their specific requirements.

The flink-connector-redshift module will leverage and use already existing connectors. When a user specifies the read mode as "jdbc" in the source configuration, the flink-connector-redshift module internally uses redshift jdbc driver. This integration enables seamless interaction with Redshift using the JDBC protocol. Leveraging this connector, the flink-connector-redshift can efficiently retrieve data from the Redshift database, providing a reliable and performant data source for further processing within the Flink framework.

Alternatively, when the selected read mode is unload , the flink-connector-redshift module takes a different approach. In this case, flink-connector-redshift will execute preprocessing and collect the required information from redshift like schema, the path to unload the data and other meta information then it dynamically employs the functionalities of the flink-filesystem module, specifically utilizing the S3 connector. By doing so, the flink-connector-redshift is able to read data from the unloaded path in an S3 bucket. This approach leverages the scalability and durability of the S3 storage system, making it an efficient and reliable intermediary for data transfer between Redshift and the Flink framework.

In summary, the flink-connector-redshift module acts as an intelligent mediator that dynamically selects and integrates with the appropriate connectors based on the specified read mode. This flexibility allows users to choose between a direct JDBC-based interaction with Redshift or an optimized data unload mechanism using the S3 connector. By leveraging the capabilities of these connectors, the flink-connector-redshift module effectively achieves the goal of seamless and efficient data transfer between Redshift and the Flink framework, providing users with a comprehensive and adaptable solution for their data processing needs.

Configuration:

Source Connector Options

Option	Required	Default	Type	Description
hostname	required	none	String	Redshift connection hostname
port	required	5439	Integer	Redshift connection port
username	required	none	String	Redshift user username
password	required	none	String	Redshift user password
database.name	required	dev	String	Redshift database to connect
table.name	required	none	String	Reshift table name
source.batch.size	optional	1000	Integer	The max flush size, over this will flush data.
source.flush.interval	optional	1s	Duration	Over this flush interval mills, asynchronous threads will flush data.
source.max.retries	optional	3	Integer	The max retry times when writing records to the database failed.
scan.read.mode	required	false	Boolean	Using Redshift UNLOAD command.
unload.temp.s3.path	conditional required	none	String	If the unload-mode=true then then Redshift UNLOAD command must need a S3 URI.
iam-role-arn	conditional required	none	String	If the unload.mode=true then then Redshift UNLOAD command must need a IAM role. And this role must have the privilege and attache to the Redshift cluser

VI) Sink Design

Flink connector Redshift will offer 2 modes to write to Redshift:-

Directly writing to redshift using JDBC driver.
Flink stream write to a specified s3 path in the format and schema accepted by Redshift then use COPY command to write data into the redshift table.

Flink connector redshift will provide users with the ability to configure different modes for the connector sink by utilizing the sink.write.mode option in the source configuration. This configuration flexibility allows users to tailor the behaviour of the connector sink to their specific needs. Internally, the flink-connector-redshift module will intelligently select and integrate with the appropriate connectors based on the chosen write mode. If the user specifies the write mode as "jdbc" in the source configuration, the flink-connector-redshift will use custom redshift JDBC driver. This integration will enable seamless interaction with Redshift using the JDBC protocol, ensuring efficient data transfer from Flink to the Redshift database. Similarly, when the write mode is selected as a file-based operation, the flink-connector-redshift module will utilize the flink-connector-filesystem. This connector will facilitate writing data to an S3 bucket, adhering to the specific format and schema requirements outlined by Redshift's COPY command. By utilizing this connector, the flink-connector-redshift module will ensure that data is written in a manner compatible with Redshift's expectations. To provide a streamlined sink solution for Flink and Redshift integration, the flink-connector-redshift module will orchestrate and use flink-filesystem, behind the scenes. It preprocesses the data and seamlessly wraps these connectors to offer a unified sink interface for transferring data from Flink to Redshift.

Sink Connector Options

Option	Required	Default	Type	Description
hostname	required	none	String	Redshift connection hostname
port	required	5439	Integer	Redshift connection port
username	required	none	String	Redshift user username
password	required	none	String	Redshift user password
database-name	required	dev	String	Redshift database to connect
table-name	required	none	String	Reshift table name
sink.batch-size	optional	1000	Integer	The max flush size, over this will flush data.
sink.flush-interval	optional	1s	Duration	Over this flush interval mills, asynchronous threads will flush data.
sink.max-retries	optional	3	Integer	The max retry times when writing records to the database failed.
sink.write.mode	required	JDBC	Enum	Initial support for Redshift COPY and JDBC
sink.write.temp.s3-uri	conditional required	none	String	If `sink.write.mode` = COPY then Redshift COPY command must need a S3 URI.
aws.iam-role	conditional required	none	String	If the copy-mode=true then then Redshift COPY command must need a IAM role. And this role must have the privilege and attache to the Redshift cluser.

Update/Delete Data Considerations: The data is updated and deleted by the primary key.

iii) Sample Source Code:

- - - Sample DataStream API Code:

Code Block

val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream = .....

val prop = new Properties()
prop.setProperty("url","
jdbc:redshift://redshifthost:5439/database?user=<username>&password=<password>")
prop.setProperty("database", "test_db")
prop.setProperty("table-name", "test_sink_table")
prop.setProperty("sink.write.mode", "jdbc")


stream.addSink(new FlinkRedshiftSink(prop))
env.execute()

- - - Sample Streaming Table Write:

Code Block

val env = StreamExecutionEnvironment.getExecutionEnvironment
val tEnv = StreamTableEnvironment.create(env)

val prop = new Properties()
prop.setProperty("url","
jdbc:redshift://redshifthost:5439/database?user=<username>&password=<password>")
prop.setProperty("database", "test_db")
prop.setProperty("querytable-name", "test_sink_table")
prop.setProperty("sink.write.mode", "jdbc")

tEnv
.connect(new Redshift().properties(props))
.inAppendMode()
.registerTableSource("sink-table")

val sql = "INSERT INTO sink-table ....."
tEnv.sqlUpdate(sql)
env.execute()

Create and sink a table in COPY mode

-- register a Redshift table `t_user` in flink sql.
CREATE TABLE t_user (
    `user_id` BIGINT,
    `user_type` INTEGER,
    `language` STRING,
    `country` STRING,
    `gender` STRING,
    `score` DOUBLE,
    PRIMARY KEY (`user_id`) NOT ENFORCED
) WITH (
    'connector' = 'redshift',
    'hostname' = 'xxxx.redshift.awsamazon.com',
    'port' = '5439',
    'username' = 'awsuser',
    'password' = 'passwordxxxx',
    'database.name' = 'tutorial',
    'table.name' = 'users',
    'sink.batch.size' = '500',
    'sink.flush.interval' = '1000',
    'sink.max.retries' = '3',
    'write.mode' = 'copy',
    'copy.temp.s3.uri' = 's3://bucket-name/key/temp',
    'iam-role-arn' = 'arn:aws:iam::xxxxxxxx:role/xxxxxRedshiftS3Rolexxxxx'
);

Create and sink a table in pure JDBC mode

-- register a Redshift table `t_user` in flink sql.
CREATE TABLE t_user (
    `user_id` BIGINT,
    `user_type` INTEGER,
    `language` STRING,
    `country` STRING,
    `gender` STRING,
    `score` DOUBLE,
    PRIMARY KEY (`user_id`) NOT ENFORCED
) WITH (
    'connector' = 'redshift',
    'hostname' = 'xxxx.redshift.awsamazon.com',
    'port' = '5439',
    'username' = 'awsuser',
    'password' = 'passwordxxxx',
    'database-name' = 'tutorial',
    'table.name' = 'users',
    'sink.batch.size' = '500',
    'sink.flush.interval' = '1000',
    'sink.max.retries' = '3'
);

-- write data into the Redshift table from the table `T`
INSERT INTO t_user
SELECT cast(`user_id` as BIGINT), `user_type`, `lang`, `country`, `gender`, `score`) FROM T;

NOTE:

Additional Features that the Flink connector for AWS Redshift can provide on top of using JDBC:

1. Integration with AWS Redshift Workload Management (WLM): AWS Redshift allows you to configure WLM to manage query prioritization and resource allocation. The Flink connector for Redshift will be agnostic to the configured WLM and utilise it for scaling in and out for the source/sink. This means that the connector can leverage the WLM capabilities of Redshift to optimize the execution of queries and allocate resources efficiently based on your defined workload priorities.

2. Abstraction of AWS Redshift Quotas and Limits: AWS Redshift imposes certain quotas and limits on various aspects such as the number of clusters, concurrent connections, queries per second, etc. The Flink connector for Redshift will provide an abstraction layer for users, allowing them to work with Redshift without having to worry about these specific limits. The connector will handle the management of connections and queries within the defined quotas and limits, abstracting away the complexity and ensuring compliance with Redshift's restrictions.

VII

...

V) Sink Design

...

Flink connector redshift will provide users with the ability to configure different modes for the connector sink by utilizing the write.mode option in the source configuration. This configuration flexibility allows users to tailor the behavior of the connector sink to their specific needs.

Internally, the flink-connector-redshift module will intelligently select and integrate with the appropriate connectors based on the chosen write mode. If the user specifies the write mode as "jdbc" in the source configuration, the flink-connector-redshift module will internally leverages the capabilities of the flink-connector-jdbc. This integration will enable seamless interaction with Redshift using the JDBC protocol, ensuring efficient data transfer from Flink to the Redshift database.

Similarly, when the write mode is selected as a file-based operation, the flink-connector-redshift module will utilize the flink-connector-filesystem. This connector will facilitate writing data to an S3 bucket, adhering to the specific format and schema requirements outlined by Redshift's COPY command. By utilizing this connector, the flink-connector-redshift module will ensure that data is written in a manner compatible with Redshift's expectations.

To provide a streamlined sink solution for Flink and Redshift integration, the flink-connector-redshift module will orchestrate and utilize these two connectors, flink-connector-jdbc and flink-filesystem, behind the scenes. It preprocesses the data and seamlessly wraps these connectors to offer a unified sink interface for transferring data from Flink to Redshift.

In summary, the flink-connector-redshift module offers users the flexibility to configure different modes for the connector sink. By intelligently utilizing the appropriate connectors, such as flink-connector-jdbc for direct JDBC-based interaction and flink-connector-filesystem for file-based operations, the module ensures efficient data transfer to Redshift. Through its preprocessing capabilities and seamless integration, the flink-connector-redshift module acts as a comprehensive sink solution, facilitating the smooth transfer of data from Flink to Redshift.

i) Sink configurations`(`mandatory`)`

...

ii) Sink configurations (optional)

...

iii) Sample Source Code:

- - - Sample DataStream API Code:

Code Block

val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream = .....

val prop = new Properties()
prop.setProperty("url","
jdbc:redshift://redshifthost:5439/database?user=<username>&password=<password>")
prop.setProperty("database", "test_db")
prop.setProperty("table", "test_sink_table")
prop.setProperty("write.mode", "jdbc")


stream.addSink(new FlinkRedshiftSink(prop))
env.execute()

- - - Sample streaming Table Write:

Code Block

val env = StreamExecutionEnvironment.getExecutionEnvironment
val tEnv = StreamTableEnvironment.create(env)

val prop = new Properties()
prop.setProperty("url","
jdbc:redshift://redshifthost:5439/database?user=<username>&password=<password>")
prop.setProperty("database", "test_db")
prop.setProperty("table", "test_sink_table")
prop.setProperty("write.mode", "jdbc")

tEnv
.connect(new Redshift().properties(props))
.inAppendMode()
.registerTableSource("sink-table")

val sql = "INSERT INTO sink-table ....."
tEnv.sqlUpdate(sql)
env.execute()

...

) Authentication to S3 and Redshift

The use of this library involves several connections which must be authenticated/secured, all of which are illustrated in the following diagram:

Reference from the above diagram
→ Redshift Configured Connections:- Configuring a connection for JDBC driver version 2.1 for Amazon Redshi...
→ SSL enable in Redshift:- Configuring authentication and SSL - Amazon Redshift

This library reads and writes data to S3 when transferring data to/from Redshift. As a result, it requires AWS credentials with read and write access to a S3 bucket (specified using the temp.s3.path configuration parameter).

The following describes how each connection can be authenticated:

1. 1. Flink to Redshift: The Flink connects to Redshift via JDBC using a username and password.
    Redshift does not support the use of IAM roles to authenticate this connection.
    This connection can be secured using SSL; for more details, see the Encryption section below.
  2. Flink to S3: S3 acts as a middleman to store bulk data when reading from or writing to Redshift.
    Flink connects to S3 using both the Hadoop FileSystem interfaces and directly using the Amazon Java SDK's S3 client.
    This connection can be authenticated using either AWS keys or IAM roles.
  3. Default Authentication: Flink-connector-aws provides different mode modes of Authentication redshift connector connectors will use CredentialProvider.
    AWS credentials will automatically be retrieved through the DefaultAWSCredentialsProviderChain.
    Users can use IAM instance roles to authenticate to S3 (e.g. on EMR, or EC2). EC2).

POC : Redshift Connector (TABLE API Implementation)

Limitations

Parallelism in flink-connector-redhsift redshift will be limited by Quotas configured for the job in Redshift Connections Quotas and limits in Amazon Redshift - Amazon Redshift.
Speed and latency in source and sink will be regulated by the latency and data transfer rate of UNLOAD and COPY feature from redshift
Flink connector redshift sink will only support append-only tables. (no changelog mode support)

...

The flink-connector-redshift will be compatible with respect to the Flink source and sink interface.
This is a new connector(feature) no compatibility, deprecation, and migration plan is are expected.

Test Plan

There will be Unit Tests cases to test methods in the redshift - connector.
E2E test suit to be added in flink-connector-aws-e2e-tests.

Rejected Alternatives

...

Initially , Idea was to use flink-connector-jdbc to support jdbc mode in redshift connector but rejected because it adds dependency of flink-connector-aws on flink-connector-jdbc creating dependency on release cycles of different externalised connectors.

Page tree

Versions Compared

Old Version 9

New Version Current

Key

Background and Motivation

Overall Design

I) Transactional Guarantees

IV)

ii) Source configurations (optional)

Datatype Mapping

V) Source Design

Source Connector Options

VI) Sink Design

Sink Connector Options

iii) Sample Source Code:

Create and sink a table in COPY mode

Create and sink a table in pure JDBC mode

VII

V) Sink Design

i) Sink configurations`(`mandatory`)`

ii) Sink configurations (optional)

iii) Sample Source Code:

) Authentication to S3 and Redshift

Limitations

Test Plan

Rejected Alternatives

Page tree

Page History

Versions Compared

Old Version 9

New Version Current

Key

Background and Motivation

Overall Design

I) Transactional Guarantees

IV)

ii) Source configurations (optional)

Datatype Mapping

V) Source Design

Source Connector Options

VI) Sink Design

Sink Connector Options

iii) Sample Source Code:

Create and sink a table in COPY mode

Create and sink a table in pure JDBC mode

VII

V) Sink Design

i) Sink configurations(mandatory)

ii) Sink configurations (optional)

iii) Sample Source Code:

) Authentication to S3 and Redshift

Limitations

Test Plan

Rejected Alternatives

i) Sink configurations`(`mandatory`)`