Saturday, August 28, 2021

Understanding Parquet file format - What, Why, How and When - Explained


What is Parquet Format ?

Parquet is a columnar(column-oriented) data format, which was initially built to support the Apache Hadoop ecosystem. It is a read optimized format, which allows efficient reads and efficient compression algorithms.

Why Parquet ?

When you have a dataset that you want to query very efficiently and improve query pruning, Parquet is one file format you could go for. Being a columnar storage option, it greatly improves upon the reads and minimizes Input Output operations on the data.

Below are the summarized advantages of Parquet files,

  1.  Efficient query pruning - With parquet format, columns that are not required while reading the dataset is efficiently pruned. This minimizes the IO operations, there by fast and efficient.

2. Able to process from large volumes of data, and has efficient data compression/encoding mechanisms, there by helping to reduce storage costs as well.

3, Parquet format, though initially built in the Apache Hadoop ecosystem, works well with most data lake storage systems, ADLS, S3, and is supported as external tables in almost all new generation columnar data warehouses.

4. Since parquet is read optimized and a column oriented data format, it is very efficient for data storage in OLAP workloads. Since in OLAP workloads, mostly we only fetch a subset of columns.

How is Parquet stored on disk ?

Below is an example of a parquet file being stored in an Azure Data Lake Gen2 system. The source dataset was small, hence you see just one file.


There are however other way storing where you have a root folder for your parquet and inside the root folder you have multiple part files which split and store the data in the parquet file, something as below,



The lesser part(partition) files you have, the faster your read performance would become. Parquet in spark allows setting the number of partitions(manual partitioning), the ideal scenario here is having a number of partitions somewhere between not very less and not many.

What are the compression schemes in Parquet ?

The most widely used compression scheme for parquet is the Snappy compression mechanism which can compress the original file to a smaller one, hence also contributing to the IO efficiency. Snappy compressed parquet files significantly improve on read performance than parquet files without encoding.

When should I use parquet ?

Parquet should be considered when you have an OLAP use case, and ideally where you have a subset of columns for a problem, the more the number of columns and unique values you have in the column, the lower the performance of parquet. So consider parquet where you have operations that happen on a subset of columns and where you are doing aggregations.

References and Further reads

I found some great beginner articles for the parquet format online, sharing the same below if you are interested to learn more,


No comments:

Post a Comment