You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Motivation

Table in Samza is an abstraction for a data source that supports random access by key. A table could be a remote data-store, Espresso for example, or a local InMemory or RocksDb backed store. The Samza table API [1] currently supports gets, puts and deletes. Partial updates to existing records is a commonly requested feature and is supported by many stores. Users of the Table API will benefit from the addition of partial updates to the existing Table API. This document describes the proposed approach and alternatives considered to provide support for partial updates in Table API.

Current State

Let’s first start with discussing the key interfaces of the Table API design-

  • Table: At its core, Table interface represents a dataset that is accessible by a key. Table access can be asynchronous or synchronous. There are three broad categories of tables: local, remote and hybrid.
  • ReadWriteTable: Interface that represents a read-write table. It implements Table.
  • RemoteTable: Provides a unified abstraction for Samza applications to access any remote data store through stream-table join.
  • TableReadFunction & TableWriteFunction: Remote Table implementations access new types of stores by writing pluggable I/O “Read/Write” functions(TableReadFunction and TableWriteFunction interfaces). TableWriteFunction typically supports put and delete operations only. Update is not currently supported.


The sample code snippet shows a sample write to a Remote Table in Samza high level API.

Table<KV<Integer, Profile>> table = appDesc.getTable(desc);
appDesc.getInputStream("PageView", new NoOpSerde<PageView>())
       .map(new MyMapFunc())
       .join(table, new MyJoinFunc())
       .sendTo(anotherTable);

MessageStream’s .sendTo method allows sending  messages in a MessageStream to a Table. Under the hood, sendTo creates SendToTableOperatorSpec in the OperatorGraph which in turn is translated SendToTableOperatorImpl. SendToTableOperatorImpl is the implementation of the send-to-table operator which writes to a table by calling ReadWriteTable’s putAsync. ReadWriteTable’s putAsync call is in turn delegated to TableWriteFunction’s putAsync method.

TableWriteFunction implementing classes typically have distinct genetic type parameters K, V specific to the table. V is the type of the record stored in the remote data store. Partial update record type is not always of the same type as the write record. Due to this type constraint, it will not be possible to change putAsync in Table API to support updates as well. 

Proposed Solution

In order to support Partial updates, we will need to add an update API in ReadWriteTable and related interfaces. Update is a variant of write but sometimes works with a different record type when compared to Write record. AsyncReadWriteTable works with generic KV where K is the key type and V is the value type of data in the Table. Adding another generic type parameter, say U, to indicate an update type parameter is a backward incompatible change and would result in changes all across the Table API. 

Samza Table API Changes with Partial Update

The following changes have to be made:

  • Add new update methods to Table API interfaces- AsyncReadWriteTable & TableWriteFunction
  • Add sendUpdateTo method to MessageStream API. This will be used to send updates to a table
  • Create a new operator spec and implementation for a “send update to table” operation on a MessageStream
    • SendUpdateToTableOperatorSpec
    • SendUpdateToTableOperatorImpl: Will attempt to send updates using Table’s updateAsync method. Similar to SendToTableOperatorImpl where writes are done using putAsync method.
  • UpdateMessage class to represent an update and a default value pair instead of using KV (discussed in detail below)
AsyncReadWriteTable
public interface AsyncReadWriteTable<K, V, U> extends Table {
   ..
   ..
   /**
   * Asynchronously updates an existing record for a given key with the specified update.
   *
   * @param key the key with which the specified {@code value} is to be associated.
   * @param update the update applied to the record associated with a given {@code key}.
   * @param args additional arguments
   * @throws NullPointerException if the specified {@code key} is {@code null}.
   * @return CompletableFuture for the operation
   */
  CompletableFuture<Void> updateAsync(K key, U record, Object … args);
 
  /**
   * Asynchronously updates the existing records for the given keys with their corresponding updates.
   *
   * @param updates the key and update mappings.
   * @param args additional arguments
   * @throws NullPointerException if any of the specified {@code entries} has {@code null} as key.
   * @return CompletableFuture for the operation
   */
  CompletableFuture<Void> updateAllAsync(List<Entry<K, U>> records, Object … args);
}


MessageStream
public interface MessageStream<M> {
/**
  * Allows sending update messages in this {@link MessageStream} to a {@link Table} and then propagates this
  * {@link MessageStream} to the next chained operator. The type of input message is expected to be {@link KV},
  * otherwise a {@link ClassCastException} will be thrown. The value is an UpdateMessage- update and default value.
  * Defaults are optional and can be used if the Remote Table integration supports inserting a default through PUT in
  * the event an update fails due to an existing record being absent.
  * <p>
  * Note: The update will be written but may not be flushed to the underlying table before its propagated to the
  * chained operators. Whether the message can be read back from the Table in the chained operator depends on whether
  * it was flushed and whether the Table offers read after write consistency. Messages retain the original partitioning
  * scheme when propagated to next operator.
  *
  * @param table the table to write messages to
  * @param args additional arguments passed to the table
  * @param <K> the type of key in the table
  * @param <V> the type of record value in the table
  * @param <U> the type of update value for the table
  * @return this {@link MessageStream}
  */
 <K, V, U> MessageStream<KV<K, UpdateMessage<U, V>>> sendUpdateTo(Table<KV<K, V>> table, Object ... args);
}

Handling First Time Updates

While partial updates are intended to update existing records, there will be certain cases which require support for first-time partial update i.e update to a record which doesn’t exist. To account for such cases, the design needs to have a provision to optionally provide a default record which can be PUT in the absence of an existing record. The update can then be applied on top of the default record. The approach introduces UpdateMessage class which captures the update and an optional default. sendUpdateTo operator which sends updates to a table is designed 


UpdateMessage
/**
 * Represents an update and an optional default record to be inserted for a key,
 * if the update is applied to a non-existent record.
 *
 * @param <U> type of the update record
 * @param <V> type of the default record
 */
public final class UpdateMessage<U, V> {
  private final U update;
  @Nullable private final V defaultValue;

  public static <U, V> UpdateMessage<U, V> of(U update, @Nullable V defaultValue) {
    return new UpdateMessage<>(update, defaultValue);
  }

  public static <U, V> UpdateMessage<U, V> of(U update) {
    return new UpdateMessage<>(update, null);
  }

  private UpdateMessage(U update, V defaultValue) {
    this.update = update;
    this.defaultValue = defaultValue;
  }

  public U getUpdate() {
    return update;
  }

  public V getDefault() {
    return defaultValue;
  }
}




  • No labels