Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: New proposal! Going to send this to the list.

...

As an optimization to avoid sending field names with every message, allow clients to register types to communicate the (and servers) to cache metadata for data they are about to send. The server will give back This is done by registering an ID for that datatype, and the ID can be used in future messages to refer to the metadata without retransmitting that metadata. This encoding will not actually be smaller for single values of a type, but if multiple values of the same type are sent the savings can be significant.

Type registration will be per-connection (meaning IDs cannot be cached between connections). This eliminates the need to keep synchronization on the server, as well as decoupling type registrations from the internal details of PDX. It also means that the drivers only have to keep track of a relatively small amount of data.

The outline of type registration for the client is this:

...

first time a client (or server) sends a type, it will send it with the NewStructType message, along with a unique ID number, which will lead the other side to cache it. After that, it should reuse that ID number and send StructById messages.

So for example, using the same User from above:

...

Code Block
titleProtobuf Type registration
linenumberstrue
collapsetrue
message Entry {
    EncodedValue key = 1;
    EncodedValue value = 2;
}

message EncodedValue {
    oneof value{
        // primitives
        int32 intResult = 1;
        int64 longResult = 2;
        int32 shortResult = 3;
        int32 byteResult = 4;
        bool booleanResult = 5;
        double doubleResult = 6;
        float floatResult = 7;
        bytes binaryResult = 8;
        string stringResult = 9;
        google.protobuf.NullValue nullResult = 11;
        NewStruct newStruct = 12;
        StructByID structById = 13;

        // Result serialized using a custom serialization format. This can only be used if
        // A HandshakeRequest is sent with valueFormat set to a valid format.
        //
        // See HandshakeRequest.valueFormat.
        bytes customObjectResult = 14;

        // Collections
        List listResult = 15;
        Map mapResult = 16;

        // Primitive arrays
        NumericArray intArray = 17;
        NumericArray longArray = 18;
        NumericArray shortArray = 19;
        NumericArray booleanArray = 20;
        ByteArrayArray byteArrayArray = 21;
        ObjectArray  objectArray = 22;

        // Used in NewStruct messages for defining fields that can be of multiple types.
        // This encoded value will contain the actual type of the field but the type
        // definition will have Object for the field type.
        EncodedValue// This is kind of a hack, sorry.
        EncodedValue objectField = 23;

        // if we decide to add builtin support for additional types, they can go here.
       }
}

message NewStruct {
    string typename = 1;
    int32 typeID = 2;
    repeated string fieldNames = 3;
    repeated EncodedValue fields = 4;
}

message StructByID {
    int32 typeID = 1;
    repeated EncodedValue fields = 2;
}

message List {
    repeated EncodedValue elements = 1;
}

message Map {
    repeated Entry entries = 1;
}

// All numeric values in Protobuf are encoded using the same varint encoding,
// so this encodes identically for all numbers and booleans.
message NumericArray {
    repeated int64 elements = 1;
}

message ByteArrayArray {
    repeated bytes arrays = 1;
}

message ObjectArray {
    repeated EncodedValue objects = 1;
}

Under this EncodedValue scheme, types defined by the server and types defined by the client will use different sets of IDs (though these can refer to the same cached values if they are the same). This is because we intend to add support for asynchronous messages and/or multiplexing of multiple channels of communication over one socket, and this avoids having the server and client race to assign IDs. If IDs were shared, the server would need to send back new IDs when it sent back types the client had not seen before.

The Object field is for fields Type definitions will encode all values that are not primitives or arrays of primitives as Objects that may be an Integer, String or Array type but have a broader type on the server side. Structs are viewed as Object type – more complex typing is not presentof any type, whereas primitives will be type checked. Clients may do their own validation. This is, in significant part, a leaky abstraction due to the way PDX saves values.

...

Whether a client must send all following values by ID or the values can be sent with a full ID each time should be configurable in the handshake.

Considerations

In order to avoid arbitrary object serialization (which can lead to gadget chain exploits), we will probably need to constrain valid types to those registered as DataSerializable, or possibly even only those registered with the ReflectionBasedAutoSerializer. This may also mean that we need a special class of typenames for those types that are put first by a client.

The way that objects are deserialized on the server is dependent on how PDX behaves now.

A driver developer may wish to provide a way for users to register types before sending values. An earlier version of this document described a protocol where the types had to be defined in a separate message before the value in which they are first put. That had separate list of types for the registration method. Because using the same list of types as EncodedValue amounts to pretty much the same as sending a new value, we opted for the method above.

Driver developers will have to make sure that types they want to use in different language clients can be correlated. So package names may or may not make sense. The naming convention is not entirely decided, nor is whether we can register nameless types. It may be wise to reserve a set of names with special meaning ("JSON" perhaps?) and perhaps a set of names that would correspond to classes that have no domain class in Java (leading underscore, or just those with no package name?)

If a server sends back a value of a type a client has not registered, the client can send a TypeDefinitionLookupRequest.

The use of NumericArray for all the integral types is because they all have the same varint encoding and will encode the same way on the wire. It may be advisable to use more restricted types and separate messages to get better typing in the generated Protobuf code.

Type Mappings

Each of the primitives maps to the corresponding Java primitive. Arrays map to arrays of Java primitives. Other fields will encode to the corresponding objects.