IDIEP-75
Author
Sponsor
Created

  

Status

ACTIVE


Motivation

Thin clients need a standardized way to serialize data.

In 2.x we used the same Ignite binary format that was used for server communication and data storage. However, in 3.x there are different formats for data storage and transmission (IEP-74 Data Storage), and those formats are not meant to be used by thin clients.

In Ignite 3.0 we want to avoid the cost of writing and supporting our own serialization mechanism.

Non-goals

Thin client protocol (handshake, message format, etc) will be designed in a separate IEP. Here we only discuss a mechanism to serialize user and system data: primitive and compound values, such as cache entries, configuration objects, and so on.

Description

Use MsgPack format in the Ignite 3.0 thin client protocol.

MsgPack Example

SQL request (query text + arguments array)
packer
        .packString("select * from cars where year > ? and seats = ?")
        .packArrayHeader(2)
        .packInt(2005)
        .packInt(2);

Result is 54 bytes:

For comparison, current Ignite binary protocol encodes the same data in 71 bytes, and takes 2x time to do so (see benchmarks below - writeSqlQueryIgnite and writeSqlQueryMsgPack).

Why MsgPack

The goal is to find an existing serialization format that satisfies the following requirements:

  • Binary (as opposed to text, like JSON, for performance reasons)
  • Supports nested object graphs
  • Supports primitives, not only objects (for example, integer or Guid value can be serialized independently)
  • Supports streaming: multiple values one after another in the same buffer / stream.
  • Schemaless: any object of any type can be written without prior set up
    • For Table APIs (when the schema is present), we can use field IDs instead of names for performance reasons
  • Can work without classes ("binary mode" in terms of 2.x): servers should be able to inspect the structure in serialized form
  • Extensible (can add custom types)
  • Well-supported implementations in all languages of interest (Java, C#, C++, Python, JavaScript, PHP)
    • With compatible license
  • Fast and compact

Comparison

NameCommentsLicense
MessagePack
  • Schemaless binary format.
  • Compatible with JSON (can be directly converted convert to JSON and from JSON: an important use case)
  • The most popular among all. High-performance, well-maintained implementations exist for many languages.
  • Battle tested: used by Tarantool and Redis
Java: Apache 2.0, C#: MIT, C++: MIT (nlohmann/json), Python: Apache 2.0, JavaScript: MIT, PHP: MIT
CBOR
  • Based on MessagePack.
  • Less popular than MessagePack, fewer implementations, outdated PHP implementation.
  • Standardized (RFC7049), but MessagePack is simpler.
  • Included in stdlib in .NET 5.
  • "Use MsgPack instead of CBOR":  https://diziet.dreamwidth.org/6568.html
Java: Apache 2.0, C#: CC0, C++: MIT, Python: MIT, JavaScript: MIT, PHP: PHP License
FlexBuffers
  • "Schemaless cousin of Google's FlatBuffers". Can be accessed without parsing, copying, or allocation.
  • Can't serialize arbitrary objects at this point (in Java and C#)
  • Relatively new, has not gained traction

BSONDesigned for MongoDB storage and in-memory manipupation, not for network usage => more verbose than MessagePack/CBOR
UBJSONSeems to be abandoned, implementations (e.g. C#) are not maintained

Popular formats like Avro, Thrift, ProtoBuf, FlatBuffers and others are not mentioned, because the don't satisfy one or more requirements above (schemaless, etc).

Conclusion

  • MessagePack and CBOR satisfy all requirements (and they are very similar, though not compatible).
  • There seems to be no other contenders.

MessagePack is more widely used and has more mature and well-maintained implementations in all languages of interest.

Benchmarks

  • Code is linked below
  • MsgPack is always faster on primitive values
  • MsgPack is more compact because of varints everywhere
  • Ignite is faster on POJOs, because MsgPack uses Jackson integration to handle objects, which is very configurable and nice, but comes at a cost.
    • We can develop our own implementation if needed.
    • In C# benchmarks (not included here) MsgPack is 4x faster than Ignite on a similar model class, which proves that the implementation can be more efficient (see also .NET Serialization Benchmark 2019 Roundup)


 Benchmark                                                       Mode  Cnt         Score        Error  Units
 JmhBinaryMarshallerMsgPackBenchmark.writePrimitivesMsgPackRaw  thrpt   10  16834154.556 ± 85624.143  ops/s
 JmhBinaryMarshallerMsgPackBenchmark.writePrimitivesIgnite      thrpt   10  12702562.838 ± 248094.068  ops/s

 JmhBinaryMarshallerMsgPackBenchmark.writePojoIgnite            thrpt   10  11590924.790 ±  42061.734  ops/s // Full footers
 JmhBinaryMarshallerMsgPackBenchmark.writePojoMsgPack           thrpt   10   5386377.535 ±  33835.097  ops/s // Fields with names
 JmhBinaryMarshallerMsgPackBenchmark.writePojoMsgPack2          thrpt   10   8505961.494 ± 465369.449  ops/s // Fields without names

 JmhBinaryMarshallerMsgPackBenchmark.readPrimitivesIgnite       thrpt   10  19873521.096 ± 545779.558  ops/s
 JmhBinaryMarshallerMsgPackBenchmark.readPrimitivesMsgPack      thrpt   10  29235107.372 ±  85371.004  ops/s

 JmhBinaryMarshallerMsgPackBenchmark.readPojoIgnite             thrpt   10  8437054.066 ± 104476.415  ops/s
 JmhBinaryMarshallerMsgPackBenchmark.readPojoMsgPack            thrpt   10  6292876.474 ±  73356.915  ops/s

 JmhBinaryMarshallerMsgPackBenchmark.writeSqlQueryIgnite        thrpt   10   5756908.336 ±  42079.083  ops/s
 JmhBinaryMarshallerMsgPackBenchmark.writeSqlQueryMsgPack       thrpt   10  12380076.956 ± 150712.634  ops/s

(Ubuntu 20.04, OpenJDK 1.8.0_292, i7-9700K)


Risks and Assumptions

  1. There is no true random access to fields by name in MsgPack - offsets are not stored, values are written sequentially. Though it is possible to skip values without reading them.
  2. Some types, like UUID and date/time, will require custom handling (e.g. UUID is written as string by default, which is not optimal). MsgPack allows up to 128 custom types to be defined.
  3. To be able to read user objects separately and efficiently without deserializing them (e.g. key and value in put operation), we'll have to wrap them one of the following ways:
    1. As a byte array (MsgPack bin format) - "MsgPack within MsgPack".
    2. Custom MsgPack type with size in the header

Discussion Links

Reference Links

Tickets

key summary type created updated due assignee reporter priority status resolution

JQL and issue key arguments for this macro require at least one Jira application link to be configured

  • No labels