Compression algorithms in Parquet

Apache Parquet is a columnar storage format optimized for analytical workloads, though it can also be used to store any type of structured data solving multiple use cases.

One of its most notable features is the ability to efficiently compress data using different compression techniques at two stages of its process. This reduces storage costs and improves reading performance.

This article explains file compression in Parquet for Java, provides usage examples, and analyzes its performance.

6 min read

Working with Parquet files in Java using Parquet Carpet

After some time working with Parquet files in Java using the Parquet Avro library, and studying how it worked, I concluded that despite being very useful in multiple use cases and having great potential, the documentation and ecosystem needed for adoption in the Java world was very poor.

Many people are using suboptimal solutions (CSV or JSON files), applying more complex solutions (Spark), or using languages they are not familiar with (Python) because they don’t know how to work with Parquet files easily. That’s why I decided to write this series of articles.

Once you understand it and have the examples, everything is easier. But, can it be even easier? Can we avoid the hassle of using strange libraries that serialize other formats? Yes, it should be even easier.

That’s why I decided to implement an Open Source library that makes working with Parquet from Java extremely simple, something that covers it: Carpet.

6 min read

Working with Parquet files in Java using Protocol Buffers

This post continues the series of articles about working with Parquet files in Java. This time, I’ll explain how to do it using the Protocol Buffers (PB) library.

Finding examples and documentation on how to use Parquet with Avro is challenging, but with Protocol Buffers, it’s even more complicated.

7 min read

Working with Parquet files in Java using Avro

In the previous article, I wrote an introduction to using Parquet files in Java, but I did not include any examples. In this article, I will explain how to do this using the Avro library.

Parquet with Avro is one of the most popular ways to work with Parquet files in Java due to its simplicity, flexibility, and because it is the library with the most examples.

11 min read

Working with Parquet files in Java

Parquet is a widely used format in the Data Engineering realm and holds significant potential for traditional Backend applications. This article serves as an introduction to the format, including some of the unique challenges I’ve faced while using it, to spare you from similar experiences.

9 min read

Java Serialization with Protocol Buffers

At Clarity AI, we generate batches of data that our application has to load and process to show our clients the social impact information of many companies. By volume of information, it is not Big Data, but it is enough to be a problem to read and load it efficiently in processes with online users.

8 min read

Bikey

TL;DR: in this post I present Bikey, a Java library to create Maps and Sets whose elements have two keys, consuming from 70%-85% to 99% less memory than usual data structures. It is Open Source, published in https://github.com/jerolba/bikey and available in Maven Central.

Bikey map memory consumption comparison

10 min read

Composite key in HashMaps

In my last post I talked about the problems of using an incorrect hash function when you put an object with a composite key in a Java HashMap, but I was stuck with the question: Which data structure is better to index those objects?

Continuing with the same example I will talk about products and stores, and I will use their identifiers to form the map key. The proposed data structures are:

  • A single map with a key containing its indexes: HashMap<Tuple<Integer, Integer>, MyObject>, which I will call TupleMap.
  • A nested map: HashMap<Integer, HashMap<Integer, MyObject>>, which I will call DoubleMap.