Here’s a quickly compiled check list of basic choices that need to be considered for getting optimum performance from your Cloudera Impala installation (other than the obvious requirement of using multiple nodes, which is a given)
1. File format:
Impala gives the best performance with the Parquet file format. If raw performance is desired, its best to use HDFS (over HBase etc) with Parquet columnar format. Snappy is the default compression choice and works best. Too much compression can impact read and processing performance just like lack of compression does. One other side note here is that Parquet being a columnar format, delivers best results when dealing with limited number of columns. In other words, “select * from tbl1″ will always be a lot slower compared to “select col1 from tbl1″. So ensure the queries always select only what is required. Performance degrades quite a bit actually as more and more columns are selected.
The Dremel paper stressed on the performance gains possible with a partitioned data set that lends itself to multiple nodes acting on the data in parallel. Partition pruning is essential to good query performance and the main fact table needs to be partitioned. So ensure the table is partitioned appropriately and check the query profiles to make sure that the partition is being used. For ex, the following snippet from the query profile shows the number of partitions being hit. Ideally the best performance comes by limiting the scan to the right partition, typically a single partition.
table=schema1.fact_tbl #partitions=1 size=72.37GB
Again, care should be taken to not partition too much, fine partitioning can result in a large number of small files that can lead to sub-optimal scan(/read) performance
3. Table order:
As of now Impala does not come with a query planner that can re-arrange tables in the SQL for the most optimal execution. As a result, the developer should ensure that larger table, typically the fact table, is the left side of the join and the smaller table is on the right side of the join.
When talking about joins, it also needs to be noted that Impala uses 2 ways to implement joins. One way is by broadcasting the right side table (ie copied in full) to each impalad node. The other way is to send only relevant parts of the right side table to each impalad node (shuffle). If the right side table is small, broadcasting is a good option. Check the query profile again to see what the size of the right side table is and consider altering the join strategy with query hints. If broadcast is being used, you would be able to see it in the query profiles like below
table=schema1.fact_tbl #partitions=1 size=72.37GB
4. Short circuit reads:
Typically when reading a file, the bytes are routed through the datanode. Impala exploits a recent HDFS feature to avoid this thereby streaming data from the disk directly to the socket without the datanode overhead. Make sure this is enabled in the Impala configuration
5. Log level:
Once you use Impala for some time, you will notice the amount of logging happening. By default, the logs are flushed for each write, which can be turn out to be costly. Reducing the amount of logging can help improve performance.
6. Network throughput:
Parse the query profiles to make sure your network throughput is good enough. The DatastreamSender section of the query profile captures this. When the query hits a impalad instance, the other impalad instances are requested to process local/remote data. If there are aggregates in the query, a partial aggregate may be performed on each of these impalad instances before they are streamed to the coordinator node. Though I haven’t seen it but there could be multiple such partial aggregate levels. The coordinator then performs the final aggregation before the results can be streamed to the client. This underscores the need for at-least gigabit connectivity between the impalads and also the client.
I would recommend anything above 48 GB for each impalad instance. This is what I have seen from my tests. Among other things, this also depends on the concurrency you need from Impala. Look at the query profiles to see how much memory is needed for processing one query. Now extrapolate it to the number of concurrent users for a rough figure of how much memory is needed. Also be sure you don’t take this to the extreme. Concurrency is typically not a big deal for decision support systems and anyways the number of concurrent requests will technically be restricted by the number of cores in your client web servers. Also it needs to be noted that creating Parquet files using Impala insert statements is memory bound and that is another thing to keep in mind. Another way in which memory can be useful is that by default most OSes use the available RAM for the file system cache. And when the same partitions are hit again and again, the data most often is available in the file system cache without having to read from the disks. This is one reason why Impala queries would be much faster roughly the fifth time onwards. It is also disadvantageous in that query performance can vary especially with mixed workloads (the cluster being used for some batch hive/map reduce tasks between queries). There is a new HDFS feature to pin files in memory in the data nodes. That can be useful in these situations.
8. Table and column statistics:
As of now it appears the statistics are used only when choosing the right join strategy (broadcast/shuffle). So you could live without this for now, as long as you ensure the right strategy is used either by default or by using hints in your queries. As of today, the statistics collection only works with MySQL state stores and it doesn’t help in that MySQL is sometimes not preferred in enterprise environments.
9. Round robin impalad use:
Impala does not have a master-slave concept but the node that receives the query takes the role of the coordinator for the execution of that query. If there is aggregation using group bys, or if there is a sort or even a large select, all the data will be routed to that coordinator from the other worker impalads (after partial aggregation as applicable). So it makes sense to pick impalad instances in a round robin fashion for queries so that you don’t beat a single impalad to death due to the coordination for concurrent users. AFAIK there is no automatic way to do it, you have to manually code for this, but its quite simple to do.
10. Block size and number of splits:
The optimal block size for Parquet appears to be 1 GB blocks. Ensure that your files are not too small. You can check the size of your files by doing a hadoop fs -ls <impala_db/table/partitions>. Also you can check the query profile in the fragments sections to ensure the number of blocks/splits the query has to scan is not a big number (tens or hundreds is not optimal).
Hdfs split stats (<volume id>:<# splits>/<split lengths>):
0:1/550.27 MB 11:1/550.53 MB 2:3/1.59 GB 14:1/338.22 MB 17:1/246.76 MB
18:1/550.53 MB 8:1/338.54 MB 20:2/1.08 GB
Again, care should be taken to not take it to the extreme, fine partitions can result in a large number of smaller files that are sub-optimal for scan performance. So too much of pruning isn’t good either.
The primary benefits of RAID like redundancy, striping do not apply to HDFS which handles these automatically as part of HDFS architecture. So using a plain JBOD (just a bunch of disks) works well for HDFS and in fact is more efficient. Also when configured as JBOD, Impala automatically determines the number of threads to use when scanning the disks. For ex, if there are 2 disks, 2 or more threads will be used automatically, thereby boosting scan performance. You should see something like this in your query profiles
- NumDisksAccessed: 8
- NumScannerThreadsStarted: 11