Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Like snapshot, a new directory `/tag` will be created under table directory for storing tags. The qualified path for a tag file is `/path/to/table/tag/<tag-name>`, and the tag name is specified by user.

New Classes

It's not necessary to introduce a new `Tag` class because the tag is very similar to snapshot, we can just reuse the `Snapshot`.  When we create a tag from a snapshot, we can just copy the corresponding snapshot file to the tag directory with tag name; when we read a tag, we can deserialize the tag file to a snapshot.

We need a `TagManager` to manage the tags (similar to `SnapshotManager`). 

Code Block
languagejava
firstline1
titleTagManager
linenumberstrue
public class TagManager {
	/** Return the root Directory of tags. */
	public Path tagDirectory(); 
 	
	/** Return the path of a tag. */
	public Path tagPath(String tagName);

	/** Create a tag from given snapshot and save it in the storage. */
	public void commitTag(Snapshot snapshot, String tagName);

	/** Expire a tag and clean unused files in the storage. */
	public void expireTag(String tagName);	

 	/** Get a tag instance. */
	public Tag tag(String tagName);

 	/** Check if a tag exists. */
	public boolean tagExists(String tagName);    

	/** Get the snapshot id which this tag points totagged snapshot. */
	public longSnapshot snapshotIdsnapshot(String tagName);

	/** Get all tagged tagssnapshots in an iterator. */
	public Iterator<Tag>Iterator<Snapshot> tagstaggedSnapshots();

    /** Get previous tag of which commit time is earlier. */
	public @Nullable String previous(String tagName);

	/** Get next tag of which commit time is later. */
	public @Nullable String next(String tagName);
}


We need a `TagsTable`, which can provide information of tags as system table `<table>$tags`.

The schema of TagsTable is in section `Public Interfaces`.

DataFileMeta Modification and Compatibility

For the convenience of deleting unused data files when expiring snapshots (see `DataFiles Handling → Expiring Snapshot`), we propose to add a new field `long commitSnapshot` to `DataFileMeta`.

Compatibility

DataFileMeta Ser/De: We will upgrade `ManifestEntrySerializer` to version 3.  In version 3, if the ManifestEntrySerializer receives version 2 InternalRow, the commitSnapshot will be set to -1. 

Expiring snapshots: If we find the commitSnapshot is -1, we fall back to trivial method (walk through all data files of all tags to check whether the data file is used or not).

Public Interfaces

SQL Syntax of Time Travel (only for batch read)

...

SELECT * FROM t VERSION AS OF tag-name.<name>

SELECT * FROM t VERSION AS OF tag-id.<id>

SELECT * FROM t /*+ OPTIONS('scan.tag-name'='<name>') */SELECT * FROM t /*+ OPTIONS('scan.tag-id'='<id>') */

Flink Actions

We propose to provide two Flink action for users to control the creation and deletion of tag.

...

Code Block
languagesql
firstline1
tag_name STRING,
tagtagged_snapshot_id BIGINT,
creation_time BIGINT,
tagged_snapshot_schema_id BIGINT,
schemacommit_idtime BIGINT,
record_count BIGINT 

...

Creating Tag

When creating tag, we merge the `baseManifestList` and `deltaManifestList` to full data and create manifest list for them. The manifest list will be stored in tag.the tagged snapshot file will be copied to the tag directory, which contains the manifest list point to the data files. 

Deleting Tag

When we delete a tag, all data files used by this tag are deletion candidates. How we determine a data file can or can not be deleted? we should take snapshots and tags into consideration. 

For snapshots,  We We consider 3 2 scenarios:

  1. No snapshots. Do nothing to the candidates.
  2. Earliest snapshotId <= taggedSnapshotId: the snapshots in [earliest, taggedSnapshotId] may still use data files in deletion candidates. So we should check:
    Full data files of earliest snapshot should be removed from candidates;
    Delta data files of snapshots in (earliest, tagged snapshot] should be removed form candidates because they may be streaming read.
  3. Earliest snapshotId > taggedSnapshotId: All Since all the snapshots contains data files based on previous snapshot. So , we can just only check the full data files of earliest snapshot (remove from candidates).

...

  1. The snapshot id at which the data file is deleted (`deleteId`). This id can be gotten when we iterate the expiring snapshots.
  2. The snapshot id at which the data file is committed (`commitId`). To get this id, we should record it in `DataFileMeta` (see section `Proposed Changes → DataFileMeta Modification and Compatibility`)
  3. The list of tagged snapshots ID (`taggedSnapshots`). This can be gotten from tag files in storage.

...

  1. time travel to tag
  2. expiration of snapshots won't delete data files used by tags
  3. delete tags can delete unused data files correctly

Compatibility tests:

  1. version 3 ManifestEntry can read old style Paimon table
  2. create tag in

Rejected Alternatives

Use name `Savepoint`

...