Khoa Le

Những trục trặc có thể gặp khi chạy hadoop

9/04/13

Error that occured in Hadoop and its sub-projects

1. OOZIE job failed:

Error message : ERROR is considered as FAILED for SLA
Cause 1 : Not able to find hadoop namenode (master), jobtracker machine.
Suppose you are running oozie, hadoop-master and job tracker on one machine  and datanode, tasktracker are running on another machine.

Your file contains following lines:
In above case, FS action will work fine because no map-reduce opertion is perform in FS action case. But, if you run map-reduce action then tasktracker will look hadoop-master on localhost machine becuase we have used localhost:9000 in file.
Solution : Used  IP of hadoop-namenode and jobtracker machine in file instead of localhost.   
Cause 2 : Oozie not able to find Mysql server.
Suppose I am using mysql as a metastore for hive.
Hive hive-default.xml file have following lines :
<description>JDBC connect string for a JDBC metastore</description>
Solution : Use IP of mysql machine instead of localhost. 

2. Zookeeper server not running:
Error message: Could not find my address: zk-serevr1 in list of ZooKeeper quorum servers
Causes :
HBase tries to start a ZK server on some machine but that machine isn’t able to find itself in the hbase.zookeeper.quorum configuration. This is a name lookup problem. 

Use the hostname presented in the error message instead of the value you used (zk-server1). If you have a DNS server, you can set hbase.zookeeper.dns.interface and hbase.zookeeper.dns.nameserver in hbase-site.xml to make sure it resolves to the correct FQDN.

3. Hadoop-datanode job failed or datanode not running: File ../mapred/system/ could only be replicated to 0 nodes, instead of 1
Cause 1: Make sure atleast one datanode is running.

Cause 2: namespaceID of master and slaves machines are not same.
If you see the error Incompatible namespaceIDs in the logs of a datanode , chances are you are affected by bug HADOOP-1212 (well, I’ve been affected by it at least).
Solution :               
If namespaceID of master and slaves machines are not same. Than replace the namespaceID of slaves machine with master namespaceID.
- dfs/name/current/VERSION file contains the namespaceID of master machine
- dfs/data/current/VERSION file contains the namespaceID of master machine
Cause 3: Datanode instance running out of space.
Solution : Free some space.

Cause 4 : You may also get this message due to permissions. May be JobTracker can not create on startup.

4.    Sqoop export command failed:
Error message:
attempt_201101151840_1006_m_000001_0, Status : FAILED
at java.util.AbstractList$
at impressions_by_zip.__loadFromFields(
at impressions_by_zip.parse(

Cause : Given field separator is not valid
Solution : Specify correct field delimeter in sqoop export command.

5. HBase regionserver not running :

Error message: 2012-01-02 13:48:49,973 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Master rejected startup because clock is out of sync
org.apache.hadoop.hbase.ClockOutOfSyncException: org.apache.hadoop.hbase.ClockOutOfSyncException: Server hadoop-datanode2,60020,1325492317440 has been rejected; Reported time is too far out of sync with master.  Time difference of 206141ms > max allowed of 30000ms

Solution: Clock of regionservers are not sync with master machine. Synchronized the clock of hbase master and regionserver machines.

