What is columnar database?
It is a type of NoSQL database, also called wide-column store or extensible record stores.
It uses tables, rows, and columns, but names and format of the columns can vary from row to row in the same table.
While a relational database is optimized for storing rows of data, typically for transactional applications, a columnar database is optimized for fast retrieval of columns of data, typically in analytical applications.
How are Columnar databases different from Row oriented databases ?
First understand Row Oriented Databases:
General databases which we have used till now like MySQL, Oracle etc are Row oriented databases. These databases store data on file system row by row, where a single row represent different fields related to record.
Consider this table:
Table rows are stored in sequential order. Hence on filesystem this table may look like:
In columnar database same records will be stored, column by column. Hence on file system above table will look like:
You will notice that first all IDs are stored, then Names, then prices and so on.
What are the benefits of using Columnar databases?
Since you are storing all the column data together, some of the queries will be very fast. Consider the query to execute is: How many items have price greater than 10,000.
- Since all prices are being stored together, you can easily count total prices with value greater than 10,000.
- In row oriented database, since prices are stored row by row, total Disk IO will be much more than column based datastore as the required data is separated by other fields data.
Columnar database have following advantages:
- Aggregation queries, where you have to just work on subset of data, will be very fast.
- Since data type of each column is similar, you can get better compression when running compression algorithms on each column making queries even faster.
To confirm your understanding, lets go little deeper:
Storage systems have to pull data from physical disk drives, which store information magnetically on spinning platters using read/write heads that move around to find the data that users request. The less the heads have to move, the faster the drive performs. If data is kept closer together, minimizing seek time, systems can deliver that data faster.
Well, what if I needed to query multiple values for each item (basically row)?
You are thinking right, you may have many cases where you have to get all data for given item basically row by row. For example you have to display details of all Items in your inventory. In such cases row oriented databases makes more sense, then columnar databases.
So, more the fields you have to read per record, lesses will be the benefit of columnar database.
Hence, generally when you are creating your app you are using row oriented databases (MySQL, Oracle etc) only. Columnar databases have their specific uses (we will be discussing this later in this article)
Are there any performance differences between columnar storage, compared to Row oriented databases when writing data?
Oh my god, you are asking all the right questions 🙂
Yes, writing new data can take more time:
- When you are inserting new record into row-oriented database, you simply write that in one operation. But if you’re inserting a new record to a columnar database, you need to write to each column one by one.
- As a result, loading new data or updating many values in a columnar database could take much more time.
It seems I shouldn’t use Columnar database then?
Well both databases may have their place in your application.
- You would usually want a row-oriented database like MySQL running the back-end of your web app, etc.
- And once your app becomes huge, you would also want to consider having a columnar database like Amazon Redshift to run your BI (business intelligence) analytics queries (which usually consist of aggregation queries).
Columnar databases are good for:
- Queries that involve only a few columns
- Aggregation queries against vast amounts of data
- Column-wise compression
But are not so good at:
- Incremental data loading
- Online Transaction Processing (OLTP) usage
- Queries against only a few rows
What is advantage of columnar database?
It allows for fast retrieval of columns of data and can be scaled using distributed clusters of low-cost hardware to increase throughput.
Where are they generally used?
Mainly used for:
- Analytical applications because it allows for fast retrieval of columns of data.
- Data warehousing and big data processing because they scale using distributed clusters of low-cost hardware.
Can yo give example of columnar databases?
Yes. Here are few:
- Amazon Redshift
- Apache HBase
- Apache Cassandra