Adding support for serialized values in formats other than json

 

We should do three things to allow clients to send objects in more efficient formats - add a new customObjectResult to EncodedValue, add a way for clients to request the serialization format, and add a way to control the serialization format on the server.

Add customObjectResult as an option in EncodedValue

We want clients of the new protocol to be able to send and receive data that is serialized using PDX. We could just add a pdxObjectResult as one of the options in our EncodedValue protobuf message. But instead let's add a customResult and provide a way for the user to serialize a value however they want. PDX can just be of of the options with this mechanism.

message EncodedValue {
    oneof value{
        bytes binaryResult = 8;
        string stringResult = 9;
        string jsonObjectResult = 10;
        google.protobuf.NullValue nullResult = 11;
        bytes customObjectResult = 12;
    }
}

 

Provide a way to ask for a specific serialization format

We currently always send back PDX objects as JSON values. Because we will now have multiple formats that we need to add a way for clients to request what format they want. We will introduce a new message from the client to control the format that values are sent as. Here again, we could just use an enum with JSON and PDX. But in order to support custom formats let's make it a string so that users can provide their own serialization formats. TODO - should we actually make this a field on some sort of handshake message? Right now we just have an optional AuthenticationRequest

 message SetCustomValueFormat {
    string format =1; //Acceptable strings are JSON, PDX, or the ID of any CustomValueSerializer that is registered on the server
}

 

Provide a way to register a serialization format on the server

If the users want to serialize values their own way, we need a way for them to plugin their own serialization code on the server. We can provide a CustomValueSerializer like below. The easiest way for us to support a plugin would be to just use java's ServiceLoader mechanism, where the user provides an implementation of the interface and puts a reference to that under META-INF/services in their jar.  This serializer will be invoked on the server whenever we are receiving or sending an EncodedValue. We will call this serializer even for primitive objects like Integer.

public interface CustomValueSerializer {
  String getID();
  /**
    * @return true if the object was serialized, false if not. If this method returns false the server will 
    * fall back on sending the object as one of the primitives in EncodedValue (int, string, jsonObjectResult, ...)
    */
  boolean serialize(Object value, DataOutput out);
  Object deserialize(DataInput in);
}

Registering PDX types

Clients that send PDX values need to be able request PDX type IDs from the server. We can add a new message to request a PDX type ID. But until PDXType and it's serialization format are actually public maybe we should just make this an internal function call, not a new message in the protocol? Since clients that support PDX need to support DataSerializable, we will just have them serialize a PDXType  as a DataSerializable, rather than defining a PdxType in protobuf.

message GetPdxTypeIdRequest {
  bytes serializedPDXType = 1;
}
message GetPDXTypeIdResponse {
  int32 id = 1;
}
message GetPdxTypeRequest {
  int32 id = 1;
}
message GetPdxTypeResponse {
  bytes serializedPDXType = 1;
}
  • No labels

8 Comments

  1. I  see value in having a specific PDX option on EncodedValue.  If I want to use a custom format I could use binaryResult.  It does seem reasonable for the client to supply the supported response types (e.g. JSON vs PDX) during handshake negotiation.

    I'm not quite following why I need to use CustomValueSerializer?  Is that just to convert internally to PDX?

    I'd like to better understand the relationship between DataSerializable and PDX.  I'm wondering if it makes sense to expose a much narrower PDX_V2 subset that doesn't require 4 different string encodings for example.  Of course there are compatibility considerations but maybe that could be dealt with.

    How do you see cache operations working with PDX types?  How does a client obtain a PDX type definition?  Can it be inlined into a get response or do I need to send another message?

     

    1. I'm uneasy about this notion of a PDX_V2.  Isn't one of the requirements to support any and all existing data in the new client driver?  The client may encounter any of those 4 string encodings, for instance.

      PDX is a specific serialization type in DataSerializer.  When deserializing an object DataSerializer runs into the type code for a PDXInstance and deserializes it as a PDXInstanceImpl.  Values stored in the PDXInstance are in DataSerializer serialized form, so you may encounter a String in one of the DataSerializer serialized forms, a DataSerializable object, or even a java-serialized object.  In order to implement PDX you must first implement much of DataSerializer/InternalDataSerializer.

      The new client can omit implementing DataSerializableFixedID, backward-compatibility, DataSerializable (if it's not being used by applications), serialization white-listing, etc.

       

      1. Forcing implementers to implement DataSerializable and PDX is not acceptable. Having spent many hours refactoring it in the C++ client it is a complicated beast. There are too many primitive types and too many string encodings. One of the primary goals of this protocol is to remove barriers to implementing clients in other languages. If I have to implement all of PDX and DataSerializer in those languages, including the non-standard Java Modified UTF-8 conversions (also ASCII and UTF-16 support) then the barrier to a new client just increased significantly.

        Why not decouple the on wire representation of PDX from the on server representation. The two need not be the same. The encoding of the fields for PDX types could be reduced to a handful of primitives (variable length int, float, double, boolean) include UTF-8 (standard) strings. The server would transcode this form to and from the on server form of PDX.

        Something as simple as a sequence of variable length int for field id (maps directly to the field id in the PDXType metadata), tag for encoded value type (var len int, float, bool-true, bool-false, null, etc., utf-8, pdx), encoded value.

         

        1. Decoupling is certainly an option and is one reason we started with JSON instead of PDX.  We're converting PDX instances into JSON and vice-versa already.  Performance is less than optimal though, so we want to keep the server from doing any more work than necessary.

        2. The primary reason for exposing PDX at all is performance.   We already support one on-wire format for PDX:  JSON.  It's slow because it has to transcode on the server.  

          Unless we can support an almost zero copy code path from the network into heap I don't think we can hit performance expectations.

          1. It is slow because you are transcoding to and from text with embedded metadata (repeats field names, colons, quotes, braces, commas, etc.). The GC cost of text parsing and concatenation is very high. Binary to binary transcoding without repeated metadata will be much faster than that of JSON. Certainly not as fast as not transcoding. Certainly better than pushing PDX and DataSerializer on clients.

             

    2. Yes, the CustomValueSerializer lets you convert from your custom serialization format into something the server side understands (PDX, Java Objects, etc.). So it would allow you to query that data and manipulate it functions, as opposed to binaryResult which is just opaque bytes.

      I was thinking the client would have to send a second message to fetch a PDXType definition.

  2. I do agree with Jake that it's worth trying some other binary format that is easier to implement but better than JSON. That's part of why this proposal is introducing a way to plug in your own format and transcode on the server, so that we (or our users) can experiment with different formats. Personally I would like to try using protobuf structs, which are basically a binary representation of JSON. Given that we're already forcing protobuf on our client writers it seems like a good fit.

     

    Regarding PDX_V2 - if we want to be compatible with old clients and we want to use a new format, we need a transcoding layer somewhere. This proposal is sneaking in a way to transcode between the new client and the servers, leaving the data in the servers in the existing format. We could certainly investigate a different approach where servers keep the data in the new format and do the transcoding for old clients/WAN Sites/Peers/Persistent data. I would suggest that if we do want to go that route, we should keep the goal of keeping the new format pluggable for easier experimentation.