Apache Parquet is a columnar storage format optimized for analytical workloads, though it can also be used to store any type of structured data solving multiple use cases.
One of its most notable features is the ability to efficiently compress data using different compression techniques at two stages of its process. This reduces storage costs and improves reading performance.
This article explains file compression in Parquet for Java, provides usage examples, and analyzes its performance.
Compression Techniques
Unlike traditional row-based storage formats, Parquet uses a columnar approach, allowing the usage of more specific and effective compression techniques based on data locality and redundancy of values of the same type.
Parquet writes information in binary and applies compression at two distinct levels, using different techniques at each:
- While writing the values of a column, it adaptively chooses the encoding type based on the characteristics of the initial values: Dictionary, Run-Length Encoding Bit-Packing, Delta Encoding, etc.
- Every time a certain amount of bytes is reached (1MB by default) a page is formed, and the binary block is compressed with the algorithm configured by the programmer (none, GZip, Snappy, LZ4, etc.).
Although the compression algorithm is configured at the file level, the encoding of each column is automatically selected using an internal heuristic (at least in the parquet-java implementation).
The performance of different compression techniques will depend heavily on your data, so there’s no silver bullet that guarantees the fastest processing time and lowest space consumption. You will need to execute your own tests.
Code
The configuration is straightforward, and it only needs to be explicitly set when writing. When reading a file, Parquet discovers which compression algorithm was used and applies the corresponding decompression algorithm.
Configuring the Algorithm or Codec
In both Carpet and Parquet with Protocol Buffers and Avro, to configure the compression algorithm, you just need to call the withCompressionCodec
method of the builder:
Carpet
CarpetWriter<T> writer = new CarpetWriter.Builder<>(outputFile, clazz)
.withCompressionCodec(CompressionCodecName.ZSTD)
.build();
Avro
ParquetWriter<Organization> writer = AvroParquetWriter.<Organization>builder(outputFile)
.withSchema(new Organization().getSchema())
.withCompressionCodec(CompressionCodecName.ZSTD)
.build();
Protocol Buffers
ParquetWriter<Organization> writer = ProtoParquetWriter.<Organization>builder(outputFile)
.withMessage(Organization.class)
.withCompressionCodec(CompressionCodecName.ZSTD)
.build();
The value must be one of the available ones in the CompressionCodecName enum: UNCOMPRESSED, SNAPPY, GZIP, LZO, BROTLI, LZ4, ZSTD, and LZ4_RAW (LZ4 is deprecated, and you should use LZ4_RAW).
Compression Level
Some compression algorithms offer a way to fine-tune the compression level. This level is usually related to the effort they need to apply to find repetition patterns, and the higher the compression, the more time and memory is required for the compression process.
Although they come with a default value, it is modifiable using Parquet’s generic configuration mechanism, although each codec uses a different key.
Additionally, the value to choose is not standard and depends on each codec, so you must refer to the documentation of each algorithm to understand what each level offers.
ZSTD
To reference the configuration of the level, ZSTD codec declares a constant: ZstandardCodec.PARQUET_COMPRESS_ZSTD_LEVEL
.
Possible values range from 1 to 22, and the default value is 3.
ParquetWriter<Organization> writer = ProtoParquetWriter.<Organization>builder(outputFile)
.withMessage(Organization.class)
.withCompressionCodec(CompressionCodecName.ZSTD)
.config(ZstandardCodec.PARQUET_COMPRESS_ZSTD_LEVEL, "6")
.build();
LZO
To reference the configuration of the level, LZO codec declares a constant: LzoCodec.LZO_COMPRESSION_LEVEL_KEY
.
Possible values range from 1 to 9, 99, and 999, and the default value is “999”.
ParquetWriter<Organization> writer = ProtoParquetWriter.<Organization>builder(outputFile)
.withMessage(Organization.class)
.withCompressionCodec(CompressionCodecName.LZO)
.config(LzoCodec.LZO_COMPRESSION_LEVEL_KEY, "99")
.build();
GZIP
It does not declare any constant, and you have to use the string "zlib.compress.level"
directly, with possible values ranging from 0 to 9 and with a default value of “6”
ParquetWriter<Organization> writer = ProtoParquetWriter.<Organization>builder(outputFile)
.withMessage(Organization.class)
.withCompressionCodec(CompressionCodecName.GZIP)
.config("zlib.compress.level", "9")
.build();
Performance Tests
To analyze the performance of different compression algorithms, I will use two public datasets containing different types of data:
- New York City Taxi Trips: with a large number of numeric values and few string values in a few columns. It has 23 columns and contains 19.6 million records.
- Cohesion Projects of the Italian Government: many columns with float values and a large quantity and variety of text strings. It has 91 columns and contains 2 million rows.
I will evaluate some of the compression algorithms enabled in Parquet Java: UNCOMPRESSED, SNAPPY, GZIP, LZO, ZSTD, LZ4_RAW.
As expected, I will use Carpet with the default configuration that parquet-java brings, and the default compression level of each algorithm.
You can find the source code on GitHub, and the tests were done on a laptop with an AMD Ryzen 7 4800HS CPU and JDK 17.
File Size
To understand how each compression performs, we will take the equivalent CSV file as a reference.
Format | gov.it | NYC Taxis |
---|---|---|
CSV | 1761 MB | 2983 MB |
UNCOMPRESSED | 564 MB | 760 MB |
SNAPPY | 220 MB | 542 MB |
GZIP | 146 MB | 448 MB |
ZSTD | 148 MB | 430 MB |
LZ4_RAW | 209 MB | 547 MB |
LZO | 215 MB | 518 MB |
In both tests, compression with GZip and Zstandard stands out as the most efficient.
Using only Parquet encoding techniques, the file size can be reduced to 25-32% of the original CSV size. Applying additional compression reduces it to between 9% and 15% of the CSV size.
Writing
How much overhead does compressing the information bring?
If we write the same information three times and average the seconds, we get:
Algorithm | gov.it | NYC Taxis |
---|---|---|
UNCOMPRESSED | 25.0 | 57.9 |
SNAPPY | 25.2 | 56.4 |
GZIP | 39.3 | 91.1 |
ZSTD | 27.3 | 64.1 |
LZ4_RAW | 24.9 | 56.5 |
LZO | 26.0 | 56.1 |
SNAPPY, LZ4, and LZO achieve similar times to not compressing, while ZSTD adds a bit of overhead. GZIP performs the worst, worsening the writing time by 50%.
Reading
Reading the files is faster than writing since fewer computations are needed.
Reading all the columns from the file, the times in seconds are:
Algorithm | gov.it | NYC Taxis |
---|---|---|
UNCOMPRESSED | 11.4 | 37.4 |
SNAPPY | 12.5 | 39.9 |
GZIP | 13.6 | 40.9 |
ZSTD | 13.1 | 41.5 |
LZ4_RAW | 12.8 | 41.6 |
LZO | 13.1 | 41.1 |
Reading times are close to not compressing the information, and the overhead of decompression is between 10% and 20%.
Conclusion
No algorithm has stood out significantly over the others in reading and writing times, all being within a similar range. In most cases, compressing the information compensates for the space savings (and transmission) over the time penalty.
In these two use cases, the determining factor for selecting one or another would probably be the compression ratio achieved, with ZSTD and Gzip standing out (but with poor writing time).
Each algorithm has its strengths, so the best option is to test with your data, considering which factor is more important:
- Minimizing storage usage, because you store a lot of data that you rarely use.
- Minimizing file generation time.
- Minimizing reading time, since files are read many times.
As with everything in life, it’s a trade-off, and you will have to see what compensates the most. In Carpet, by default, if you configure nothing, it compresses with Snappy.
Implementation Details
The value must be one of those available in the CompressionCodecName enum. Associated with each enum value is the name of the class implementing the algorithm:
public enum CompressionCodecName {
UNCOMPRESSED(null, CompressionCodec.UNCOMPRESSED, ""),
SNAPPY("org.apache.parquet.hadoop.codec.SnappyCodec", CompressionCodec.SNAPPY, ".snappy"),
GZIP("org.apache.hadoop.io.compress.GzipCodec", CompressionCodec.GZIP, ".gz"),
LZO("com.hadoop.compression.lzo.LzoCodec", CompressionCodec.LZO, ".lzo"),
BROTLI("org.apache.hadoop.io.compress.BrotliCodec", CompressionCodec.BROTLI, ".br"),
LZ4("org.apache.hadoop.io.compress.Lz4Codec", CompressionCodec.LZ4, ".lz4hadoop"),
ZSTD("org.apache.parquet.hadoop.codec.ZstandardCodec", CompressionCodec.ZSTD, ".zstd"),
LZ4_RAW("org.apache.parquet.hadoop.codec.Lz4RawCodec", CompressionCodec.LZ4_RAW, ".lz4raw");
...
Parquet will use reflection to instantiate the specified class, which must implement the CompressionCodec
interface. If you look at its source code, you will see that it is within the Hadoop project, not Parquet. This shows how coupled Parquet is with Hadoop in the Java implementation.
To use one of the codecs, you must ensure you have added a JAR containing its implementation as a dependency.
Not all implementations are available in the transitive dependencies you have when adding parquet-java
, or you may have excluded Hadoop dependencies too aggressively.
In the org.apache.parquet:parquet-hadoop
dependency, the implementations of SnappyCodec
, ZstandardCodec
, and Lz4RawCodec
are included, which transitively imports the snappy-java
, zstd-jni
, and aircompressor
dependencies with the actual implementations of the three algorithms.
In the hadoop-common:hadoop-common
dependency, the implementation of GzipCodec
is included.
Where are the implementations of BrotliCodec
and LzoCodec
? They are not in any of the Parquet or Hadoop dependencies, so if you use them without adding additional dependencies, your application will not work with files compressed with those formats.
- To support LZO, you need to add the dependency
org.anarres.lzo:lzo-hadoop
to your pom or gradle files. - Even more complex is the case of Brotli: the dependency is not in Maven Central, and you must also add the JitPack repository.