Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

To customize compression encoding for specific columns when creating a table, the connector provides a way to set the
encoding column metadata field. Please refer to the available encodings in the Amazon Redshift documentation.
Redshift allows descriptions to be attached to columns, which can be viewed in most query tools. You can specify a description for individual columns by setting the description column metadata field. This can be done using the COMMENT command.


IV)

...

Datatype Mapping

Flink

...

  1. Directly reading from Redshift using JDBC driver.
  2. Use UNLOAD command to execute a query and save its result to s3 and read from s3.

In flink-connector-redshift, the concept of a connector source allows users to configure different modes for data retrieval. This functionality is achieved by utilizing the `scan.read.mode` option available in the source configuration. This flexible configuration empowers users to adapt the connector's behaviour based on their specific requirements.

The flink-connector-redshift module will leverage and use already existing connectors. When a user specifies the read mode as "jdbc" in the source configuration, the flink-connector-redshift module internally uses redshift jdbc driver. This integration enables seamless interaction with Redshift using the JDBC protocol. Leveraging this connector, the flink-connector-redshift can efficiently retrieve data from the Redshift database, providing a reliable and performant data source for further processing within the Flink framework.

Alternatively, when the selected read mode is unload , the flink-connector-redshift module takes a different approach. In this case, flink-connector-redshift will execute preprocessing and collect the required information from redshift like schema, the path to unload the data and other meta information then it dynamically employs the functionalities of the flink-filesystem module, specifically utilizing the S3 connector. By doing so, the flink-connector-redshift is able to read data from the unloaded path in an S3 bucket. This approach leverages the scalability and durability of the S3 storage system, making it an efficient and reliable intermediary for data transfer between Redshift and the Flink framework.

...

scan.read.mode

...

ii) Source configurations (optional)

...

Code Block
SELECT * from <Table_Name> limit <record_size>;

...

TypeRedshift Type
CHARVARCHAR
VARCHARVARCHAR
STRINGVARCHAR
BOOLEANBoolean
BYTESNot supported
DECIMALDecimal
TINYINTInt8
SMALLINTInt16
INTEGERInt32
BIGINTInt64
FLOATFloat32
DOUBLEFloat64
DATEDate
TIMETimestamp
TIMESTAMPTimestamp
TIMESTAMP_LTZTimestamp
INTERVAL_YEAR_MONTHInt32
INTERVAL_DAY_TIMEInt64
ARRAYNot supported
MAPNot supported
ROWNot supported
MULTISETNot supported
RAWNot supported


V) Source Design

Flink connector redshift will offer 2 modes to read from the source:-

  1. Directly reading from Redshift using JDBC driver.
  2. Use UNLOAD command to execute a query and save its result to s3 and read from s3.


In flink-connector-redshift, the concept of a connector source allows users to configure different modes for data retrieval. This functionality is achieved by utilizing the `scan.read.mode` option available in the source configuration. This flexible configuration empowers users to adapt the connector's behaviour based on their specific requirements.

The flink-connector-redshift module will leverage and use already existing connectors. When a user specifies the read mode as "jdbc" in the source configuration, the flink-connector-redshift module internally uses redshift jdbc driver. This integration enables seamless interaction with Redshift using the JDBC protocol. Leveraging this connector, the flink-connector-redshift can efficiently retrieve data from the Redshift database, providing a reliable and performant data source for further processing within the Flink framework.

Alternatively, when the selected read mode is unload , the flink-connector-redshift module takes a different approach. In this case, flink-connector-redshift will execute preprocessing and collect the required information from redshift like schema, the path to unload the data and other meta information then it dynamically employs the functionalities of the flink-filesystem module, specifically utilizing the S3 connector. By doing so, the flink-connector-redshift is able to read data from the unloaded path in an S3 bucket. This approach leverages the scalability and durability of the S3 storage system, making it an efficient and reliable intermediary for data transfer between Redshift and the Flink framework.

In summary, the flink-connector-redshift module acts as an intelligent mediator that dynamically selects and integrates with the appropriate connectors based on the specified read mode. This flexibility allows users to choose between a direct JDBC-based interaction with Redshift or an optimized data unload mechanism using the S3 connector. By leveraging the capabilities of these connectors, the flink-connector-redshift module effectively achieves the goal of seamless and efficient data transfer between Redshift and the Flink framework, providing users with a comprehensive and adaptable solution for their data processing needs.

  1. Configuration:

    Source Connector Options

    OptionRequiredDefaultTypeDescription
    hostnamerequirednoneStringRedshift connection hostname
    portrequired5439IntegerRedshift connection port
    usernamerequirednoneStringRedshift user username
    passwordrequirednoneStringRedshift user password
    database.namerequireddevStringRedshift database to connect
    table.namerequirednoneStringReshift table name
    source.batch.sizeoptional1000IntegerThe max flush size, over this will flush data.
    source.flush.intervaloptional1sDurationOver this flush interval mills, asynchronous threads will flush data.
    source.max.retriesoptional3IntegerThe max retry times when writing records to the database failed.
    scan.read.moderequiredfalseBooleanUsing Redshift UNLOAD command.
    unload.temp.s3.pathconditional requirednoneStringIf the unload-mode=true then then Redshift UNLOAD command must need a S3 URI.
    iam-role-arnconditional requirednoneStringIf the unload.mode=true then then Redshift UNLOAD command must need a IAM role. And this role must have the privilege and attache to the Redshift cluser



VI) Sink Design

      Flink connector Redshift will offer 2 modes to write to Redshift:-

  1. Directly writing to redshift using JDBC driver.
  2. Flink stream write to a specified s3 path in the format and schema accepted by Redshift then use COPY command to write data into the redshift table.

    Flink connector redshift will provide users with the ability to configure different modes for the connector sink by utilizing the sink.write.mode  option in the source configuration. This configuration flexibility allows users to tailor the behaviour of the connector sink to their specific needs. Internally, the flink-connector-redshift module will intelligently select and integrate with the appropriate connectors based on the chosen write mode. If the user specifies the write mode as "jdbc" in the source configuration, the flink-connector-redshift will use custom redshift JDBC driver. This integration will enable seamless interaction with Redshift using the JDBC protocol, ensuring efficient data transfer from Flink to the Redshift database. Similarly, when the write mode is selected as a file-based operation, the flink-connector-redshift module will utilize the flink-connector-filesystem. This connector will facilitate writing data to an S3 bucket, adhering to the specific format and schema requirements outlined by Redshift's COPY command. By utilizing this connector, the flink-connector-redshift module will ensure that data is written in a manner compatible with Redshift's expectations. To provide a streamlined sink solution for Flink and Redshift integration, the flink-connector-redshift module will orchestrate and use flink-filesystem, behind the scenes. It preprocesses the data and seamlessly wraps these connectors to offer a unified sink interface for transferring data from Flink to Redshift.

Sink Connector Options

OptionRequiredDefaultTypeDescription
hostnamerequirednoneStringRedshift connection hostname
portrequired5439IntegerRedshift connection port
usernamerequirednoneStringRedshift user username
passwordrequirednoneStringRedshift user password
database-namerequireddevStringRedshift database to connect
table-namerequirednoneStringReshift table name
sink.batch-sizeoptional1000IntegerThe max flush size, over this will flush data.
sink.flush-intervaloptional1sDurationOver this flush interval mills, asynchronous threads will flush data.
sink.max-retriesoptional3IntegerThe max retry times when writing records to the database failed.
copy-moderequiredfalseBooleanUsing Redshift COPY command to insert/upsert or not.
copy-temp-s3-uriconditional requirednoneStringIf the copy-mode=true then then Redshift COPY command must need a S3 URI.
iam-role-arnconditional requirednoneStringIf the copy-mode=true then then Redshift COPY command must need a IAM role. And this role must have the privilege and attache to the Redshift cluser.

Update/Delete Data Considerations: The data is updated and deleted by the primary key.

   

...

V) Sink Design

      Flink connector Redshift will offer 2 modes to write to Redshift:-

...

 i) Sink configurations(mandatory)

...

    ii) Sink configurations (optional)

...

               iii) Sample Source Code:   

...

Additional Features that the Flink connector for AWS Redshift can provide on top of using JDBC:

1. Integration with AWS Redshift Workload Management (WLM): AWS Redshift allows you to configure WLM to manage query prioritization and resource allocation. The Flink connector for Redshift will be agnostic to the configured WLM and utilise it for scaling in and out for the source/sink. This means that the connector can leverage the WLM capabilities of Redshift to optimize the execution of queries and allocate resources efficiently based on your defined workload priorities.

2. Abstraction of AWS Redshift Quotas and Limits: AWS Redshift imposes certain quotas and limits on various aspects such as the number of clusters, concurrent connections, queries per second, etc. The Flink connector for Redshift will provide an abstraction layer for users, allowing them to work with Redshift without having to worry about these specific limits. The connector will handle the management of connections and queries within the defined quotas and limits, abstracting away the complexity and ensuring compliance with Redshift's restrictions.

       

...

 VII) Authentication to S3 and Redshift

The use of this library involves several connections which must be authenticated/secured, all of which are illustrated in the following diagram:

...