Wednesday, December 7, 2016

What are Columnar Databases and it's significance in Analytics/Data warehousing ?

I was recently watching a session on Amazon Redshift from Youtube, when I came to know that Redshift is a columnar database. Which Amazon boasted optimizes performance and storage to a great deal. This ended up in me wanting to know more and hence the post.

So what is a columnar database ?

Typical databases like Oracle, mySQL etc store information as rows, i.e for example the details of a particular students is stored in a row.  In columnar databases the information is stored as columns. Data in a particular column is grouped together unlike other databases where row wise data is grouped. 

Say for example I have the following raw data, 
Name: Athul
Age: 26
Sex: Male 
Location: Trivandrum

Name: Rahul

Age: 43
Sex: Male
Location Bangalore

In typical databases the data would be stored as below, 

Athul26MaleTrivandrum

Rahul43MaleBangalore

However when it comes to a columnar database, data is grouped in Columnar fashion, i.e

AthulRahul|2643|MaleMale|TrivandrumBangalore

Each column data here is stored in a separate blocks. The advantage they say is that data of same datatype are grouped together and hence storage and retrieval is easier. This makes it great for Analytical data storage. However operations like insert and update are costlier. Aggregations again are great because of this columnar structure. Data compression also is greatly increased in columnar architecture.  This being the case it is very advisable to consider columnar databases if you are dealing with huge volumes of data. Another advantage is that your databases need parse whole rows to read data as most queries are limited to a subset of the table's actual number of columns. 

Examples of Columnar databases are Apache HBase, Amazon Redshift etc. Below is a good youtube video explanation for columnar database.

 
 Columnar Databases

courtesy: StackOverflow 

No comments:

Post a Comment