Jerónimo López
Jerónimo López
8 min read

At Clarity AI, we generate batches of data that our application has to load and process to show our clients the social impact information of many companies. By volume of information, it is not Big Data, but it is enough to be a problem to read and load it efficiently in processes with online users.

The information can be stored in a database or as files, serialized in a standard format and with a schema agreed with your Data Engineering team. Depending on your information and requirements, it can be as simple as CSV, XML or JSON, or Big Data formats such as Parquet, Avro, ORC, Arrow, or message serialization formats like Protocol Buffers, FlatBuffers, MessagePack, Thrift, or Cap’n Proto.

My idea is to analyze some of these systems over a collection of articles, using JSON as a reference, to compare all with something everyone knows.

To allow anyone to reach their own conclusions according to their requirements, I will analyze different technical aspects: creation time, size of the resulting file, size of the file on compression (gzip), the memory needed to create the file, size of the libraries used, deserialization time or the memory needed to parse and access the data.

In my case, because my access pattern is Write Once Read Many, in my final selection read factors will take preference over write factors.

Data model

Given this data model of organizations and their attributes, using new Java records:

record Org(String name, String category, String country, Type type,
           List<Attr> attributes) { }

record Attr(String id, byte quantity, byte amount, boolean active, 
            double percent, short size) { }

enum Type { FOO, BAR, BAZ }

I will simulate a heavy scenario, where each Organization will randomly have between 40 and 70 different Attributes, and in total, we will have about 400K organizations. The values of the attributes are also randomized.


JSON

It is the data interchange format par excellence and the facto standard in web services communication. Although it was defined as a format in the early 2000s, it was not until 2013 that the Ecma published the first version, which became an international standard in 2017.

With a simple syntax, it does not need to predefine the schema to be able to parse it. Because it is a plain text-based format, it is human readable and there are a lot of libraries to process it in all languages.

Serialization

Using a library like Jackson and without special annotations, the code is very simple:

var organizations = dataFactory.getOrganizations(400_000);

ObjectMapper mapper = new ObjectMapper();
try (var os = new FileOutputStream("/tmp/organizations.json")) {
  mapper.writeValue(organizations, os);
}

Metrics:

  • Serialization time: 11,718 ms
  • File size: 2,457 MB
  • Compressed file size: 525 MB
  • Memory required: because we are serializing directly to OutputStream, it consumes nothing apart from the required internal IO buffers.
  • Library size (jackson-xxx.jar): 1,956,679 bytes

Deserialization

It is difficult to define how to measure deserialization in this analysis. The objective is to recover the state of the entities from their binary representation, but which entities? the original class or the one provided by the tool with a similar interface?

To exploit the capabilities and strengths of each tool, we will keep the representation of the entity generated by each library. In the case of JSON will be the original class, but in other libraries will be a different class.

Jackson makes it very easy again and solves the problem with only 3 lines of code:

try (InputStream is = new FileInputStream("/tmp/organizations.json")) {
  ObjectMapper mapper = new ObjectMapper();
  List<Org> organizations mapper.readValue(is, new TypeReference<List<Org>>() {});
  ....
}
  • Deserialization time: 20 410 ms
  • Memory required: because is reconstructing the original object structures, they consume 2 193 MB

Protocol Buffers

Protocol Buffers is the system that Google developed internally to serialize data in a way that is both CPU and storage efficient, especially if you compare it with how it was done in those days with XML. In 2008 it was released under BSD license.

It is based on predefining what format the data will have through an IDL (Interface Definition Language) and from it, generating the source code that will be able to either write and read files with data. The producer and the consumer must somehow share the format defined in the IDL.

The format is flexible enough to support adding new fields and deprecating existing fields without breaking compatibility.

The information generated after serialization is an array of bytes, unreadable to humans.

Support for different programming languages is provided by the existence of a code generator for each language. If a language is not officially supported you can always find an implementation from the community.

Google has standardized it and made it the base of their server-to-server communication mechanism: gRPC, instead of the usual REST with JSON.

Although the documentation discourages its use with large datasets and proposes to split large collections into a “concatenation” of individual objects serialization, I’m going to evaluate and try it.

IDL and code generation

The file with the schema could be this:

syntax = "proto3";

package com.jerolba.xbuffers.protocol;

option java_multiple_files = true;
option java_package = "com.jerolba.xbuffers.protocol";
option java_outer_classname = "OrganizationsCollection";

message Organization {
  string name = 1;
  string category = 2;
  OrganizationType type = 3;
  string country = 4;
  repeated Attribute attributes = 5;

  enum OrganizationType {
    FOO = 0;
    BAR = 1;
    BAZ = 2;
  }

  message Attribute {
    string id = 1;
    int32 quantity = 2;
    int32 amount = 3;
    int32 size = 4;
    double percent = 5;
    bool active = 6;
  }

}

message Organizations {
  repeated Organization organizations = 1;
}

To generate all Java classes you need to install the compiler protoc and execute it with some parameters referencing where the IDL file is located and the target path of the generated files:

protoc --java_out=./src/main/java -I=./src/main/resources ./src/main/resources/organizations.proto

You have all the instructions here and you can download the compiler utility from here, but I prefer to use directly docker with an image ready to execute the command:

 docker run --rm -v $(pwd):$(pwd) -w $(pwd) znly/protoc --java_out=./src/main/java -I=./src/main/resources ./src/main/resources/organizations.proto

Serialization

Protocol Buffers does not directly serialize your POJOs, but you need to copy the information to the objects generated by the schema compiler.

The code needed to serialize the information from the POJOs would look like this:

var organizations = dataFactory.getOrganizations(400_000)

var orgsBuilder = Organizations.newBuilder();
for (Org org : organizations) {
    var organizationBuilder = Organization.newBuilder()
        .setName(org.name())
        .setCategory(org.category())
        .setCountry(org.country())
        .setType(OrganizationType.forNumber(org.type().ordinal()));
    for (Attr attr : org.attributes()) {
        var attribute = Attribute.newBuilder()
            .setId(attr.id())
            .setQuantity(attr.quantity())
            .setAmount(attr.amount())
            .setActive(attr.active())
            .setPercent(attr.percent())
            .setSize(attr.size())
            .build();
        organizationBuilder.addAttributes(attribute);
    }
    orgsBuilder.addOrganizations(organizationBuilder.build());
}
Organizations orgsBuffer = orgsBuilder.build();
try (var os = new FileOutputStream("/tmp/organizations.protobuffer")) {
  orgsBuffer.writeTo(os);
}

The code is verbose, but simple. If for some reason you decided to make your business logic work directly with the Protocol Buffers generated classes, all that copy code would be unnecessary.

  • Serialization time: 5,823 ms
  • File size: 1,044 MB
  • Compressed file size: 448 MB
  • Memory required: instantiating all these intermediate objects in memory before serializing them requires 1,315 MB
  • Library size (protobuf-java-3.16.0.jar): 1,675,739 bytes
  • Size of generated classes: 41,229 bytes

Deserialization

It is also quite straightforward, and it is enough to pass an InputStream to rebuild in memory the whole object graph:

try (InputStream is = new FileInputStream("/tmp/organizations.protobuffer")) {
    Organizations organizations = Organizations.parseFrom(is);
    .....
}

The objects are instances of the classes generated from the schema, not the original records.

  • Deserialization time: 4,535 ms
  • Memory required: reconstructing the object structures defined by the schema takes up 2,710 MB

Analysis and impressions

  JSON Protocol Buffers
Serialization time 11,718 ms 5,823 ms
File size 2,457 MB 1,044 MB
GZ file size 525 MB 448 MB
Memory serializing N/A 1.29 GB
Deserialization time 20,410 ms 4,535 ms
Memory deserialization 2,193 MB 2,710 MB
JAR library size 1,910 KB 1,636 KB
Size of generated classes N/A 40 KB

From the data and from what I have been able to see by playing with the formats, we can conclude:

  • JSON is undoubtedly slower and heavier, but it is a very comfortable and convenient system.
  • Protocol Buffers is a good choice: fast serialization and deserialization, and with compact and compressible files. Its API is simple and intuitive, and the format is widely used, becoming a market standard with multiple use cases.
  • Both formats are all or nothing: you need to deserialize all the information to access a part of it. While in other tools you can access any element without having to traverse and parse all the information that precedes it.
  • Protocol Buffers uses 32-bit integers for all integer types (int, short, byte), but when serializing represents them with the value that takes the least bytes, saving some bytes.
  • For the same reason, when Protocol Buffers generates the classes from the IDL, it defines everything as 32-bit integers and deserializes the values to int32. That is why the memory consumption of deserialized objects is 23% higher than JSON. What it gains by compressing integers it loses in memory consumption.
  • For scalar values, Protocol Buffers does not support null. If a value is not present it takes the default value of the corresponding primitive (0, 0.0, or false), while Null Strings are deserialized to “”. This article proposes different strategies to simulate null values.
  • Although in the original object graph the name of the countries or categories are Strings that are only present in memory once, when serialized to a binary format they will appear as many times as references you have, and when deserialized you will have as many instances as references you had. That is why in both cases, the memory consumed when deserializing occupies twice the memory occupied by the objects originally serialized1.

1 Since Java 8u20, thanks to JEP 192, we have an option in G1 to de-duplicate Strings during garbage collection. But it is disabled by default and when it is enabled we have no control over when it is executed, so we can’t rely on that optimization to reduce the size of deserialization.