Wednesday, March 12, 2014

Resource manager HA support in Hadoop 2.3.0

Looks like Hadoop 2.3.0 added resource manager HA support using zookeeper. For examples, YARN-1232, YARN-1222, YARN-1307, and YARN-1325 are resolved in 2.3.0. But YARN-149 has not been resolved yet. Perhaps, Resource manager HA is not fully supported yet.

Tuesday, March 11, 2014

Hive skew join issue

If you use skew join for Hive 0.11, 0.12, and lower versions by setting hive.optimize.skewjoin=true and hive.auto.convert.join=true, perhaps you will get back empty results in some cases, which may not be correct.
The reason is that the skew join in hive relies on a reduce phase to save skewed keys on local disk, but hive.auto.convert.join=true turns a mapreduce task into a mapjoin task in some scenarios. As a result, there is no skewed keys generated by the mapjoin and the result is empty.
If you set hive.auto.convert.join=false to disable the auto conversion of a mapjoin, the performance is very bad because the reduce phase takes a very long time to process the skew keys. As a result, you probably do not want to use skew join for Hive 0.12 and lower versions. Inconsistent results are very bad.
Fortunately, Hive-6041 (https://issues.apache.org/jira/browse/HIVE-6041) resolves this issue in Hive 0.13.

Friday, February 28, 2014

Hadoop 2 shuffle handler

The shuffle handler in Hadoop 2 is implemented as an auxiliary service in node manager and it uses netty, which means there is no need to launch a work thread for each connection. The number of threads in shuffle handler is the same as Runtime.availableProcessors(). There is a JIRA to make the thread number configurable (https://issues.apache.org/jira/browse/MAPREDUCE-5596). But this has not been included in Hadoop 2.2.0 yet.

Monday, February 17, 2014

JVM reuse is not supported in Hadoop 2 so far

If you migrate to Hadoop 2 (MR2), please be aware that the JVM reuse is no longer support at this moment. There is a JIRA for container reuse, i.e., MAPREDUCE-3902, but unfortunately, this one is not resolved yet. If you have many short time MR tasks, make sure to combine them so as to overcome the container launching overhead.

Tuesday, January 21, 2014

ZooKeeper oberservers

Amazing, Apache zookeeper added support for observers http://zookeeper.apache.org/doc/trunk/zookeeperObservers.html. The main motivation behind this is to support huge number of clients and at the same time to maintain the writing throughput since the time to vote is affected by the number of participants in the Quorum. As a result, observers can accept client connections, but they takes no part in the voting process.
Other long waiting features include ZOOKEEPER-107 to allow dynamic cluster membership change and ZOOKEEPER 1044 to allow dynamic changes to roles of a peer. Seems these features will be available in 3.5.0. Unfortunately, there is no release plan announced yet.

Friday, December 13, 2013

Impala on EMR

Finally Impala can be run on EMR. The released version is 1.2.1. http://aws.typepad.com/aws/2013/12/analyze-large-data-sets-with-impala.html

Friday, December 6, 2013

page cache

When I run impala, I found a lot of memory are used for page cache. In this way, queries using the same data will run much faster. To clear up the pagecache, you could just run.
echo 3 > /proc/sys/vm/drop_caches
where 3 stands for freeing pagecache, dentries and inodes. Two other values are: 1 - free page cache and 2 - free dentries and inodes. Once the pagecache is cleared up, you will observe that the query runs slower.