Discover BI: August 2021

Saturday, August 28, 2021

Understanding Parquet file format - What, Why, How and When - Explained

What is Parquet Format ?

Parquet is a columnar(column-oriented) data format, which was initially built to support the Apache Hadoop ecosystem. It is a read optimized format, which allows efficient reads and efficient compression algorithms.

Why Parquet ?

When you have a dataset that you want to query very efficiently and improve query pruning, Parquet is one file format you could go for. Being a columnar storage option, it greatly improves upon the reads and minimizes Input Output operations on the data.

Below are the summarized advantages of Parquet files,

1. Efficient query pruning - With parquet format, columns that are not required while reading the dataset is efficiently pruned. This minimizes the IO operations, there by fast and efficient.

2. Able to process from large volumes of data, and has efficient data compression/encoding mechanisms, there by helping to reduce storage costs as well.

3, Parquet format, though initially built in the Apache Hadoop ecosystem, works well with most data lake storage systems, ADLS, S3, and is supported as external tables in almost all new generation columnar data warehouses.

4. Since parquet is read optimized and a column oriented data format, it is very efficient for data storage in OLAP workloads. Since in OLAP workloads, mostly we only fetch a subset of columns.

How is Parquet stored on disk ?

Below is an example of a parquet file being stored in an Azure Data Lake Gen2 system. The source dataset was small, hence you see just one file.

There are however other way storing where you have a root folder for your parquet and inside the root folder you have multiple part files which split and store the data in the parquet file, something as below,

The lesser part(partition) files you have, the faster your read performance would become. Parquet in spark allows setting the number of partitions(manual partitioning), the ideal scenario here is having a number of partitions somewhere between not very less and not many.

What are the compression schemes in Parquet ?

The most widely used compression scheme for parquet is the Snappy compression mechanism which can compress the original file to a smaller one, hence also contributing to the IO efficiency. Snappy compressed parquet files significantly improve on read performance than parquet files without encoding.

When should I use parquet ?

Parquet should be considered when you have an OLAP use case, and ideally where you have a subset of columns for a problem, the more the number of columns and unique values you have in the column, the lower the performance of parquet. So consider parquet where you have operations that happen on a subset of columns and where you are doing aggregations.

References and Further reads

I found some great beginner articles for the parquet format online, sharing the same below if you are interested to learn more,

Monday, August 16, 2021

Not authorized to view container in Azure Blob under Storage Account

Error - Below is the error description from inside newly created container under storage account we created

This request is not authorized to perform this operation. RequestId:342dca41-601e-004f-0dc9-9291aa000000 Time:2021-08-16T18:10:08.0322108Z
This storage account's 'Firewalls and virtual networks' settings may be blocking access to storage services. Try adding your client IP address ('xx.xx.xx.xxx') to the firewall exceptions, or by allowing access from 'all networks' instead of 'selected networks'

Cause: The client address of your machine is not added as an exception to your firewall settings under the parent storage account

Solution:

1. The solution is to add your ip address as an exception under 'Networking' > 'Firewall and Virtual Networks'

2. Click on 'Add your clinet IP Address ( xx.xx.xxx.xxx)

3. Now click on 'Save' in the top left corner under 'Firewall and Virtual Networks'

4. Now navigate back to the container and you should be able to access the container.

Error 22300 - Azure Data Factory Linked Service cannot connect to Azure SQL Database

Error: Error Code 22300 - Cannot connect to SQL Database: 'db-learn-azure.database.windows.net', Database: 'learn-azure-athul', User: 'athul'. Check the linked service configuration is correct, and make sure the SQL Database firewall allows the integration runtime to access. Cannot open server 'db-learn-azure' requested by the login. Client with IP address '20.xx.xx.xxx' is not allowed to access the server.

Cause: This happens because the IP address of the Azure Data Factory instance is not known by the Azure SQL Database.

Resolution:

1. Go to the Azure SQL Database instance in Microsoft Portal > Go to Firewall Settings >

2. Change the 'Allow Azure Services and Resources to access this server' to 'Yes' and save the firewall settings

3. Now try connecting again.