ORC Vs Parquet Vs. Avro: How do you select the right file format for Hive?

ORC Vs Parquet Vs. Avro How do you select the right file format for Hive

ORC Vs. Parquet Vs. Avro: Which one is the better of the lot? People working in Hive would be asking this question more often. There have been many interesting discussions around this.
Contributing my two cents, I’ll also answer this. At Ellicium, we have come across this question many times. We have responded to it quite successfully, doing this multiple times for our customers.

Why don’t you try it out yourself first?

You can use a sample dataset and familiarize yourself with the comparison framework before experimenting with actual data.

You can download the data ‘airlines.dat’ and ‘airports.dat’ from the below links –https://raw.githubusercontent.com/jpatokal/openflights/master/data/airlines.dat
https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat

The performance metrics to be captured during this process (to keep it simple for now) would be the time taken for data loading into hive tables, the compression factor achieved, and the time taken for the query execution.

It is highly recommended to perform ORC, Parquet, or Avro comparison activity on large/actual volumes of data to reach an appropriate conclusion based on your requirements. The airline data used here is just for the sake of taking an example.

Follow the steps below to start on this exciting and revealing journey –

Data Loading into Hive

The first step would be to get the data available in Hive.

We need 3 copies of the ‘airlines table and ‘airports table created in Hive, which would store data in ORC/Parquet/Avro format. Let us call them ‘airlines_orc’ and ‘airlines_parquet’ and ‘airlines_avro’ and similarly for the ‘airports’ table. There are various options for doing this.

One option would be to load the CSV data for airlines and countries directly into both the Hive tables.

Another option would be to create an external table in Hive pointing to the source files on HDFS and then move this data from the outer table to the hive table using the ‘insert overwrite table’ command.

Performance Metric – One of the performance metrics to capture here would be the actual loading time taken for data loading into ‘airlines_orc’ and airlines_parquet’ and ‘airlines_avro.’

Check the total underlying file size to determine the compression factor achieved, which should be a considerable factor in huge datasets.

Also, check out the table properties to gain more insights into the details regarding the underlying files created.

Query Execution

Next, we move on to the most important task of capturing the performance metrics for query execution.

Once the data has been loaded into Hive, we need to execute several queries to capture the execution time.

We can use the below type of queries for execution –

  • Simple select query with order by.
  • Simple count(*) queries for e.g. select count(airlineid) from airlines.
  • Finding Distinct counts, for example, select count(distinct airline) from airlines.
  • Using joins for, for example, joining airlines and airports table.
  • To use aggregate functions like max, min, etc. in the queries.
  • To use group by with various combinations of having and order by clause.
  • Using subqueries.
  • Executing queries with various combinations of join and sub-queries, join and group by, subquery and group by, case statements, etc.
  • Queries that would be using complex logic to answer some business questions.
  • The most important and useful would be to execute frequently used queries based on the project or client requirement.

Performance Metric – Capture the execution time for these queries and various other factors, for example, the number of mappers and reducers used.

You will get a fair idea once the above step has been completed. Based on your type of data and queries, which format is suitable for your kind of requirements? We need to control the hive environment for query execution. We can achieve this by trying to set various Hadoop parameters. Parameters like the split size which affects the number of mappers, number of reducers, etc. And then executing all the above queries and again capturing the performance metrics like query execution time, number of mappers and reducers used, etc.

Comparison and Conclusion

So, just a recap,

We have captured performance metrics for loading data onto the Hive ORC/Parquet tables and measured the compression factor achieved.

We have captured query execution times for various queries ranging from simple to complex involving joins, aggregation, business requirements, etc.

To confirm the outcome, we have further set up a controlled environment by setting up specific Hadoop parameters impacting Hive queries.

Based on all the results we have captured, it should be pretty straightforward to conclude which is the best format to store and query data on Hive based on your requirements.
We have done this kind of comparison exercises involving more complex steps for many of our customers. If you ask me which is the best file format, it depends on the customer’s requirements.

Interestingly, we have found this file format choice to depend on various parameters. Parameters like the amount of data a customer needs to manage, the kind of queries frequently executed, the available infrastructure, various kinds of optimizations performed, etc. For some situations, ORC gave better results based on the above parameters. While for some Parquet won the bet and others went with Avro.

I hope this provides a starting point for those looking forward to deciding on the file format for storing huge amounts of data and faster query retrieval based on their project requirements.

In my next blog, I will share our findings from one of our projects at Ellicium, wherein we compared the ORC, Parquet, and Avro format files for data storage and query execution on Hive.

Stay tuned for this very interesting piece of information!