Compression algorithms in Parquet

Apache Parquet is a columnar storage format optimized for analytical workloads, though it can also be used to store any type of structured data solving multiple use cases.

One of its most notable features is the ability to efficiently compress data using different compression techniques at two stages of its process. This reduces storage costs and improves reading performance.

This article explains file compression in Parquet for Java, provides usage examples, and analyzes its performance.

Compression Techniques

Unlike traditional row-based storage formats, Parquet uses a columnar approach, allowing the usage of more specific and effective compression techniques based on data locality and redundancy of values of the same type.

Parquet writes information in binary and applies compression at two distinct levels, using different techniques at each:

While writing the values of a column, it adaptively chooses the encoding type based on the characteristics of the initial values: Dictionary, Run-Length Encoding Bit-Packing, Delta Encoding, etc.
Every time a certain amount of bytes is reached (1MB by default) a page is formed, and the binary block is compressed with the algorithm configured by the programmer (none, GZip, Snappy, LZ4, etc.).

Although the compression algorithm is configured at the file level, the encoding of each column is automatically selected using an internal heuristic (at least in the parquet-java implementation).

The performance of different compression techniques will depend heavily on your data, so there’s no silver bullet that guarantees the fastest processing time and lowest space consumption. You will need to execute your own tests.

Code

The configuration is straightforward, and it only needs to be explicitly set when writing. When reading a file, Parquet discovers which compression algorithm was used and applies the corresponding decompression algorithm.

Configuring the Algorithm or Codec

In both Carpet and Parquet with Protocol Buffers and Avro, to configure the compression algorithm, you just need to call the withCompressionCodec method of the builder:

Carpet

CarpetWriter<T> writer = new CarpetWriter.Builder<>(outputFile, clazz)
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .build();

Avro

ParquetWriter<Organization> writer = AvroParquetWriter.<Organization>builder(outputFile)
    .withSchema(new Organization().getSchema())
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .build();

Protocol Buffers

ParquetWriter<Organization> writer = ProtoParquetWriter.<Organization>builder(outputFile)
    .withMessage(Organization.class)
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .build();

The value must be one of the available ones in the CompressionCodecName enum: UNCOMPRESSED, SNAPPY, GZIP, LZO, BROTLI, LZ4, ZSTD, and LZ4_RAW (LZ4 is deprecated, and you should use LZ4_RAW).

Compression Level

Some compression algorithms offer a way to fine-tune the compression level. This level is usually related to the effort they need to apply to find repetition patterns, and the higher the compression, the more time and memory is required for the compression process.

Although they come with a default value, it is modifiable using Parquet’s generic configuration mechanism, although each codec uses a different key.

Additionally, the value to choose is not standard and depends on each codec, so you must refer to the documentation of each algorithm to understand what each level offers.

ZSTD

To reference the configuration of the level, ZSTD codec declares a constant: ZstandardCodec.PARQUET_COMPRESS_ZSTD_LEVEL.

Possible values range from 1 to 22, and the default value is 3.

ParquetWriter<Organization> writer = ProtoParquetWriter.<Organization>builder(outputFile)
    .withMessage(Organization.class)
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .config(ZstandardCodec.PARQUET_COMPRESS_ZSTD_LEVEL, "6")
    .build();

LZO

To reference the configuration of the level, LZO codec declares a constant: LzoCodec.LZO_COMPRESSION_LEVEL_KEY.

Possible values range from 1 to 9, 99, and 999, and the default value is “999”.

ParquetWriter<Organization> writer = ProtoParquetWriter.<Organization>builder(outputFile)
    .withMessage(Organization.class)
    .withCompressionCodec(CompressionCodecName.LZO)
    .config(LzoCodec.LZO_COMPRESSION_LEVEL_KEY, "99")
    .build();

GZIP

It does not declare any constant, and you have to use the string "zlib.compress.level" directly, with possible values ranging from 0 to 9 and with a default value of “6”

ParquetWriter<Organization> writer = ProtoParquetWriter.<Organization>builder(outputFile)
    .withMessage(Organization.class)
    .withCompressionCodec(CompressionCodecName.GZIP)
    .config("zlib.compress.level", "9")
    .build();

Performance Tests

To analyze the performance of different compression algorithms, I will use two public datasets containing different types of data:

New York City Taxi Trips: with a large number of numeric values and few string values in a few columns. It has 23 columns and contains 19.6 million records.
Cohesion Projects of the Italian Government: many columns with float values and a large quantity and variety of text strings. It has 91 columns and contains 2 million rows.

I will evaluate some of the compression algorithms enabled in Parquet Java: UNCOMPRESSED, SNAPPY, GZIP, LZO, ZSTD, LZ4_RAW.

As expected, I will use Carpet with the default configuration that parquet-java brings, and the default compression level of each algorithm.

You can find the source code on GitHub, and the tests were done on a laptop with an AMD Ryzen 7 4800HS CPU and JDK 17.

File Size

To understand how each compression performs, we will take the equivalent CSV file as a reference.

Format	gov.it	NYC Taxis
CSV	1761 MB	2983 MB
UNCOMPRESSED	564 MB	760 MB
SNAPPY	220 MB	542 MB
GZIP	146 MB	448 MB
ZSTD	148 MB	430 MB
LZ4_RAW	209 MB	547 MB
LZO	215 MB	518 MB

In both tests, compression with GZip and Zstandard stands out as the most efficient.

Using only Parquet encoding techniques, the file size can be reduced to 25-32% of the original CSV size. Applying additional compression reduces it to between 9% and 15% of the CSV size.

Writing

How much overhead does compressing the information bring?

If we write the same information three times and average the seconds, we get:

Algorithm	gov.it	NYC Taxis
UNCOMPRESSED	25.0	57.9
SNAPPY	25.2	56.4
GZIP	39.3	91.1
ZSTD	27.3	64.1
LZ4_RAW	24.9	56.5
LZO	26.0	56.1

SNAPPY, LZ4, and LZO achieve similar times to not compressing, while ZSTD adds a bit of overhead. GZIP performs the worst, worsening the writing time by 50%.

Reading

Reading the files is faster than writing since fewer computations are needed.

Reading all the columns from the file, the times in seconds are:

Algorithm	gov.it	NYC Taxis
UNCOMPRESSED	11.4	37.4
SNAPPY	12.5	39.9
GZIP	13.6	40.9
ZSTD	13.1	41.5
LZ4_RAW	12.8	41.6
LZO	13.1	41.1

Reading times are close to not compressing the information, and the overhead of decompression is between 10% and 20%.

Conclusion

No algorithm has stood out significantly over the others in reading and writing times, all being within a similar range. In most cases, compressing the information compensates for the space savings (and transmission) over the time penalty.

In these two use cases, the determining factor for selecting one or another would probably be the compression ratio achieved, with ZSTD and Gzip standing out (but with poor writing time).

Each algorithm has its strengths, so the best option is to test with your data, considering which factor is more important:

Minimizing storage usage, because you store a lot of data that you rarely use.
Minimizing file generation time.
Minimizing reading time, since files are read many times.

As with everything in life, it’s a trade-off, and you will have to see what compensates the most. In Carpet, by default, if you configure nothing, it compresses with Snappy.

Implementation Details

The value must be one of those available in the CompressionCodecName enum. Associated with each enum value is the name of the class implementing the algorithm:

public enum CompressionCodecName {
  UNCOMPRESSED(null, CompressionCodec.UNCOMPRESSED, ""),
  SNAPPY("org.apache.parquet.hadoop.codec.SnappyCodec", CompressionCodec.SNAPPY, ".snappy"),
  GZIP("org.apache.hadoop.io.compress.GzipCodec", CompressionCodec.GZIP, ".gz"),
  LZO("com.hadoop.compression.lzo.LzoCodec", CompressionCodec.LZO, ".lzo"),
  BROTLI("org.apache.hadoop.io.compress.BrotliCodec", CompressionCodec.BROTLI, ".br"),
  LZ4("org.apache.hadoop.io.compress.Lz4Codec", CompressionCodec.LZ4, ".lz4hadoop"),
  ZSTD("org.apache.parquet.hadoop.codec.ZstandardCodec", CompressionCodec.ZSTD, ".zstd"),
  LZ4_RAW("org.apache.parquet.hadoop.codec.Lz4RawCodec", CompressionCodec.LZ4_RAW, ".lz4raw");
  ...

Parquet will use reflection to instantiate the specified class, which must implement the CompressionCodec interface. If you look at its source code, you will see that it is within the Hadoop project, not Parquet. This shows how coupled Parquet is with Hadoop in the Java implementation.

To use one of the codecs, you must ensure you have added a JAR containing its implementation as a dependency.

Not all implementations are available in the transitive dependencies you have when adding parquet-java, or you may have excluded Hadoop dependencies too aggressively.

In the org.apache.parquet:parquet-hadoop dependency, the implementations of SnappyCodec, ZstandardCodec, and Lz4RawCodec are included, which transitively imports the snappy-java, zstd-jni, and aircompressor dependencies with the actual implementations of the three algorithms.

In the hadoop-common:hadoop-common dependency, the implementation of GzipCodec is included.

Where are the implementations of BrotliCodec and LzoCodec? They are not in any of the Parquet or Hadoop dependencies, so if you use them without adding additional dependencies, your application will not work with files compressed with those formats.

To support LZO, you need to add the dependency org.anarres.lzo:lzo-hadoop to your pom or gradle files.
Even more complex is the case of Brotli: the dependency is not in Maven Central, and you must also add the JitPack repository.