Jerónimo López
Jerónimo López
6 min read

Apache Parquet is a columnar storage format optimized for analytical workloads, though it can also be used to store any type of structured data solving multiple use cases.

One of its most notable features is the ability to efficiently compress data using different compression techniques at two stages of its process. This reduces storage costs and improves reading performance.

This article explains file compression in Parquet for Java, provides usage examples, and analyzes its performance.

Compression Techniques

Unlike traditional row-based storage formats, Parquet uses a columnar approach, allowing the usage of more specific and effective compression techniques based on data locality and redundancy of values of the same type.

Parquet writes information in binary and applies compression at two distinct levels, using different techniques at each:

  • While writing the values of a column, it adaptively chooses the encoding type based on the characteristics of the initial values: Dictionary, Run-Length Encoding Bit-Packing, Delta Encoding, etc.
  • Every time a certain amount of bytes is reached (1MB by default) a page is formed, and the binary block is compressed with the algorithm configured by the programmer (none, GZip, Snappy, LZ4, etc.).

Although the compression algorithm is configured at the file level, the encoding of each column is automatically selected using an internal heuristic (at least in the parquet-java implementation).

The performance of different compression techniques will depend heavily on your data, so there’s no silver bullet that guarantees the fastest processing time and lowest space consumption. You will need to execute your own tests.

Code

The configuration is straightforward, and it only needs to be explicitly set when writing. When reading a file, Parquet discovers which compression algorithm was used and applies the corresponding decompression algorithm.

Configuring the Algorithm or Codec

In both Carpet and Parquet with Protocol Buffers and Avro, to configure the compression algorithm, you just need to call the withCompressionCodec method of the builder:

Carpet

CarpetWriter<T> writer = new CarpetWriter.Builder<>(outputFile, clazz)
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .build();

Avro

ParquetWriter<Organization> writer = AvroParquetWriter.<Organization>builder(outputFile)
    .withSchema(new Organization().getSchema())
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .build();

Protocol Buffers

ParquetWriter<Organization> writer = ProtoParquetWriter.<Organization>builder(outputFile)
    .withMessage(Organization.class)
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .build();

The value must be one of the available ones in the CompressionCodecName enum: UNCOMPRESSED, SNAPPY, GZIP, LZO, BROTLI, LZ4, ZSTD, and LZ4_RAW (LZ4 is deprecated, and you should use LZ4_RAW).

Compression Level

Some compression algorithms offer a way to fine-tune the compression level. This level is usually related to the effort they need to apply to find repetition patterns, and the higher the compression, the more time and memory is required for the compression process.

Although they come with a default value, it is modifiable using Parquet’s generic configuration mechanism, although each codec uses a different key.

Additionally, the value to choose is not standard and depends on each codec, so you must refer to the documentation of each algorithm to understand what each level offers.

ZSTD

To reference the configuration of the level, ZSTD codec declares a constant: ZstandardCodec.PARQUET_COMPRESS_ZSTD_LEVEL.

Possible values range from 1 to 22, and the default value is 3.

ParquetWriter<Organization> writer = ProtoParquetWriter.<Organization>builder(outputFile)
    .withMessage(Organization.class)
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .config(ZstandardCodec.PARQUET_COMPRESS_ZSTD_LEVEL, "6")
    .build();

LZO

To reference the configuration of the level, LZO codec declares a constant: LzoCodec.LZO_COMPRESSION_LEVEL_KEY.

Possible values range from 1 to 9, 99, and 999, and the default value is “999”.

ParquetWriter<Organization> writer = ProtoParquetWriter.<Organization>builder(outputFile)
    .withMessage(Organization.class)
    .withCompressionCodec(CompressionCodecName.LZO)
    .config(LzoCodec.LZO_COMPRESSION_LEVEL_KEY, "99")
    .build();

GZIP

It does not declare any constant, and you have to use the string "zlib.compress.level" directly, with possible values ranging from 0 to 9 and with a default value of “6”

ParquetWriter<Organization> writer = ProtoParquetWriter.<Organization>builder(outputFile)
    .withMessage(Organization.class)
    .withCompressionCodec(CompressionCodecName.GZIP)
    .config("zlib.compress.level", "9")
    .build();

Performance Tests

To analyze the performance of different compression algorithms, I will use two public datasets containing different types of data:

I will evaluate some of the compression algorithms enabled in Parquet Java: UNCOMPRESSED, SNAPPY, GZIP, LZO, ZSTD, LZ4_RAW.

As expected, I will use Carpet with the default configuration that parquet-java brings, and the default compression level of each algorithm.

You can find the source code on GitHub, and the tests were done on a laptop with an AMD Ryzen 7 4800HS CPU and JDK 17.

File Size

To understand how each compression performs, we will take the equivalent CSV file as a reference.

Format gov.it NYC Taxis
CSV 1761 MB 2983 MB
UNCOMPRESSED 564 MB 760 MB
SNAPPY 220 MB 542 MB
GZIP 146 MB 448 MB
ZSTD 148 MB 430 MB
LZ4_RAW 209 MB 547 MB
LZO 215 MB 518 MB

In both tests, compression with GZip and Zstandard stands out as the most efficient.

Using only Parquet encoding techniques, the file size can be reduced to 25-32% of the original CSV size. Applying additional compression reduces it to between 9% and 15% of the CSV size.

Writing

How much overhead does compressing the information bring?

If we write the same information three times and average the seconds, we get:

Algorithm gov.it NYC Taxis
UNCOMPRESSED 25.0 57.9
SNAPPY 25.2 56.4
GZIP 39.3 91.1
ZSTD 27.3 64.1
LZ4_RAW 24.9 56.5
LZO 26.0 56.1

SNAPPY, LZ4, and LZO achieve similar times to not compressing, while ZSTD adds a bit of overhead. GZIP performs the worst, worsening the writing time by 50%.

Reading

Reading the files is faster than writing since fewer computations are needed.

Reading all the columns from the file, the times in seconds are:

Algorithm gov.it NYC Taxis
UNCOMPRESSED 11.4 37.4
SNAPPY 12.5 39.9
GZIP 13.6 40.9
ZSTD 13.1 41.5
LZ4_RAW 12.8 41.6
LZO 13.1 41.1

Reading times are close to not compressing the information, and the overhead of decompression is between 10% and 20%.

Conclusion

No algorithm has stood out significantly over the others in reading and writing times, all being within a similar range. In most cases, compressing the information compensates for the space savings (and transmission) over the time penalty.

In these two use cases, the determining factor for selecting one or another would probably be the compression ratio achieved, with ZSTD and Gzip standing out (but with poor writing time).

Each algorithm has its strengths, so the best option is to test with your data, considering which factor is more important:

  • Minimizing storage usage, because you store a lot of data that you rarely use.
  • Minimizing file generation time.
  • Minimizing reading time, since files are read many times.

As with everything in life, it’s a trade-off, and you will have to see what compensates the most. In Carpet, by default, if you configure nothing, it compresses with Snappy.

Implementation Details

The value must be one of those available in the CompressionCodecName enum. Associated with each enum value is the name of the class implementing the algorithm:

public enum CompressionCodecName {
  UNCOMPRESSED(null, CompressionCodec.UNCOMPRESSED, ""),
  SNAPPY("org.apache.parquet.hadoop.codec.SnappyCodec", CompressionCodec.SNAPPY, ".snappy"),
  GZIP("org.apache.hadoop.io.compress.GzipCodec", CompressionCodec.GZIP, ".gz"),
  LZO("com.hadoop.compression.lzo.LzoCodec", CompressionCodec.LZO, ".lzo"),
  BROTLI("org.apache.hadoop.io.compress.BrotliCodec", CompressionCodec.BROTLI, ".br"),
  LZ4("org.apache.hadoop.io.compress.Lz4Codec", CompressionCodec.LZ4, ".lz4hadoop"),
  ZSTD("org.apache.parquet.hadoop.codec.ZstandardCodec", CompressionCodec.ZSTD, ".zstd"),
  LZ4_RAW("org.apache.parquet.hadoop.codec.Lz4RawCodec", CompressionCodec.LZ4_RAW, ".lz4raw");
  ...

Parquet will use reflection to instantiate the specified class, which must implement the CompressionCodec interface. If you look at its source code, you will see that it is within the Hadoop project, not Parquet. This shows how coupled Parquet is with Hadoop in the Java implementation.

To use one of the codecs, you must ensure you have added a JAR containing its implementation as a dependency.

Not all implementations are available in the transitive dependencies you have when adding parquet-java, or you may have excluded Hadoop dependencies too aggressively.

In the org.apache.parquet:parquet-hadoop dependency, the implementations of SnappyCodec, ZstandardCodec, and Lz4RawCodec are included, which transitively imports the snappy-java, zstd-jni, and aircompressor dependencies with the actual implementations of the three algorithms.

In the hadoop-common:hadoop-common dependency, the implementation of GzipCodec is included.

Where are the implementations of BrotliCodec and LzoCodec? They are not in any of the Parquet or Hadoop dependencies, so if you use them without adding additional dependencies, your application will not work with files compressed with those formats.

  • To support LZO, you need to add the dependency org.anarres.lzo:lzo-hadoop to your pom or gradle files.
  • Even more complex is the case of Brotli: the dependency is not in Maven Central, and you must also add the JitPack repository.