Tuesday, July 22, 2014

Change Git commit message

If you have typos in your commit message that has not been pushed to the remote repository, you can simply use the following command to replace the message.

git commit --amend -m "Updated commit message"

If you use review board, you need to rerun the post-review command with the "-r" option to specify the old CR number.

Exception running child : java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSplit cannot be cast to org.apache.hadoop.mapred.InputSplit

Sometimes, people run into the following exception when using Hadoop 2.

2014-07-14 22:06:13,824 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSplit cannot be cast to org.apache.hadoop.mapred.InputSplit
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:404)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

The reason is that the Map task treated the Map as an old Map for some reason, but it was actually a new one with the new API org.apache.hadoop.mapreduce.lib.input.FileSplit. To solve this problem, you can set the parameter "mapred.mapper.new-api" to true.

Friday, July 18, 2014

How does File split work in Hadop 2

For MapReduce, each Map task will get a FileSplit (or InputSplit) and process the data for this split.

First, when a JobSubmitter in Hadoop 2 submits a job, it will call InputFormat, FileInputFormat for example, to get a List of FileStatus including path, length, blocksize, ownership and so on from the input FileSystem. Then the InputFormat calculates the split size and creates a list of FileSplits by calling makeSplit(file, start, length, hosts). After that, the JobSubmitter calls JobSplitWriter to create split metadata files and persist them.

When a Job (JobImpl) in the state transition, it will get the TaskSplitMetaInfo by reading the split metadata file. Then, it will create map tasks by calling createMapTasks(job, inputlength, taskSplitMetaInfo) and add them to a hash map in JobImpl. This only creates map task objects and not actually run them.

Once a map task is scheduled to run, it will use the task split meta info to determine what input data it needs to process.

Hive on Tez

Take Tez-0.4.1-incubating and Hive 0.13 as an example. First, configure tez by adding the following environment variables to .bashrc.
export TEZ_HOME=$HADOOP_HOME/tez
export TEZ_CONF_DIR=$TEZ_HOME/conf
export TEZ_JARS=$TEZ_HOME/*:$TEZ_HOME/lib/*
export HADOOP_CLASSPATH=$TEZ_CONF_DIR:$TEZ_JARS:$HADOOP_CLASSPATH
Then, set mapreduce.framework.name as yarn-tez in mapred-site.xml.
To make hive work with Tez, just set hive.aux.jars.path to be "file:///${TEZ_CONF_DIR},file:///${TEZ_JARS},file:///${HIVE_HOME/auxlib" if Tez is installed on local disk. Also you need to create the user home directory, for example, /user/hadoop for user hadoop.
To run hive on Tez, you should put "set hive.execution.engine=tez;" to the top of your hive script and set "set hive.execution.engine=mr;" if you want to switch back to use MR engine instead.

Change Tez log level

To change Tez log level, you should add tez.am.log.level property to tez-site.xml. For example.
 <property><name>tez.am.log.level</name><value>DEBUG</value></property>

Tuesday, July 15, 2014

error: org.apache.tez.dag.api.TezException: No running dag at present

I tried to configure hive over tez, but kept getting the following errors.
error: org.apache.tez.dag.api.TezException: No running dag at present
Later on, I found it was a class path issue. I need to add TEZ_CONF_DIR and TEZ_JARS to hive auxlib path. Once that was fixed, hive ran successfully over tez. The performance is very impressive. For a TPCDS test, the speedup of tez over MR is 3 times.

Wednesday, July 9, 2014

HDFS block locations

You can use the following command to show hdfs block locations:
hadoop fsck path -files -blocks -racks