Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Discussion thread

https://lists.apache.org/thread/7442c2357pk3tf2vlgh4h7bmjzwtkh4c
Vote thread
ISSUE
ReleaseTBD

Motivation

Load Action is a synchronous import method where users send requests via HTTP protocol to import local files or data streams into Paimon. Load Action executes the import synchronously and returns the results of the import. Users can directly determine the success of the import from the response body of the request.

Load Action is primarily suitable for importing local files.

SCENARIO

Analysts and business development personnel often encounter temporary small-scale data scenarios in their work. Faced with one or multiple CSV files, these files typically have small data volumes and are scattered across various systems. The traditional process for handling these files involves first uploading them to OSS storage, then using Flink or Hive SQL for table creation and data import, before analysis can begin. This process is not only complex and tedious but also requires specific prior knowledge. To simplify this process, we have introduced the 'Load Action' service. This service enables the rapid import of small-scale data, directly creating Paimon tables, without the dependence on any other data processing engines.

Data is submitted and transmitted via the HTTP protocol. This is demonstrated here using the curl command to show how to submit data for import.

Code Block
curl -u root -T date -H "label:123" http://paimon.com:8030/api/test/date/_load

Overview

Users can also operate using other HTTP clients.

Load Action is a synchronous import method where users send requests via HTTP protocol to import local files into Paimon. Load Action executes the import synchronously and returns the results of the import. 

Load Action is primarily suitable for importing local files.

How to Load

Users only need to execute an HTTP request with the dataset-related table creation information to complete:

  • Verification of user import permissions.
  • Creation of a Paimon table according to the table creation information.
  • Reading of the uploaded file, data segmentation according to specified separators, followed by batch data import.
  • Execution of filtering and adjustment of column positions.

Image Added

Signature parameters

Code Block
curl
Code Block
curl --location-trusted -u user:passwd [-H ""...] -T data.file -XPUT http://load_host:http_port/api/{db}/{table}/_load

Proposed Changes


  • user/passwd: To verify user identity and import permissions.
  • label: Identifier of the import task.
  • column_separator: Used to specify the column separator in the import file, default is \t.
  • line_delimiter: Used to specify the newline character in the import file, default is \n.
  • where: Filter condition specified for the import task.
  • columns: Names of the column fields of the data to be imported.
  • enclose:Enclosure character. When a CSV data field contains row separators or column separators, a single-byte character can be specified as an enclosure character to prevent accidental truncation. For example, if the column separator is ",", and the enclosure character is "'", then in the data "a,'b,c'", "b,c" will be interpreted as a single field.
  • escape:Escape character. Used to escape characters in a CSV field that are the same as the enclosure character. For example, if the data is "a,'b,'c'" and the enclosure character is "'", and it is desired that "b,'c" be interpreted as a single field, a single-byte escape character needs to be specified, such as "". Then the data should be modified to "a,'b,'c'".
  • format: Specifies the format of the data to be imported, supports csv and json, default is csv.

Using SQL to Express Load Parameters

It's possible to add a SQL parameter in the Header, which can replace the previous parameters such as column_separator, line_delimiter, where, columns, etc., for convenience of use.

Code Block
curl -u user:passwd [-H sql:${load_sql}...] -XPUT http://load_host:http_port/api/http_load


# -- load_sql
# INSERT INTO db.table 
# (
#     col, ...
# ) 
# SELECT 
#     col, ... 
# FROM read_files("file:///data.file") 
# with
# (
#    "column_separator" = ",",
#    "format" = "csv"
# );

Return Results

Since Load Action is a synchronous import method, the results of the import are directly returned to the user through the return value created for the import.

Example:

Code Block
{
  "Status": "Success",
  "NumberTotalRows": 1000000,
  "NumberFilteredRows": 1,
  "NumberUnselectedRows": 0,
  "LoadTimeMs": 2144
}

The following mainly explains the parameters of the Load Action import results:

  • Status: Import completion status.
  • NumberTotalRows: Total number of rows processed during import.
  • NumberFilteredRows: Number of rows that do not meet data quality standards.
  • NumberUnselectedRows: Number of rows filtered by the WHERE condition.
  • LoadTimeMs: Import completion time. Unit: milliseconds.

Proposed Changes

In the context of implementing write operations for multiple data formats, the `WriteStrategy` interface defines standards for writing operations and schema retrieval. Specific implementation classes provide dedicated logic for different data formats. For example, the `CsvWriteStrategy` class implements `WriteStrategy`, specifically handling CSV files by parsing and writing data using the provided column separator.

Code Block
public interface WriteStrategy extends Serializable {
Code Block
public interface WriteStrategy<T> extends Serializable {
   
    void init(Configuration conf) throws WriteStrategyException;

    
    void write(Table table, T record) throws WriteStrategyException;

   
    Schema defineTableSchema() throws WriteStrategyException;

    
    Catalog fetchTableCatalog(String path) throws WriteStrategyException;

    
    void updateTableCatalog(Schema schema, Identifier identifier) throws WriteStrategyExceptionwriter(BatchTableWrite batchTableWrite, String content, String columnSeparator)
            throws Exception;

   
 Schema   List<String> listFileNames(String pathretrieveSchema() throws WriteStrategyExceptionException;
}

Compatibility, Deprecation, and Migration Plan

No

Test Plan

UT and IT

Rejected Alternatives

...

Compatibility, Deprecation, and Migration Plan

This is a new additional feature.