hadoop常见异常及解决办法_第1页
hadoop常见异常及解决办法_第2页
hadoop常见异常及解决办法_第3页
hadoop常见异常及解决办法_第4页
hadoop常见异常及解决办法_第5页
已阅读5页,还剩15页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Hadoop常见异常分析及解决方法1. org.apache.hadoop.security.AccessControlException: Permission denied: user=FDC2, access=EXECUTE, inode=job_2_0003:heipark:supergroup:rwx- 解决方法:在hdfs-site.xml中添加如下: dfs.permissions false 2. localhost: Error: JAVA_HOME is not set. 需要在conf/hadoop-env.sh中设置JAVA_HOME环境变量: . export HAD

2、OOP_JOBTRACKER_OPTS=-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS export JAVA_HOME=/usr/customize/java3. Warning: $HADOOP_HOME is deprecated. 分析:Hadoop 在bin/hadoop-config.sh 中对HADOOP_HOME 进行了判断,意思是提醒你自己也定义了变量 HADOOP_HOME. 判断发生的地方: # the root of the Hadoop installation export HADOOP_PREFIX=d

3、irname $this/. export HADOOP_HOME=$HADOOP_PREFIX 报出错误的地方: if $HADOOP_HOME_WARN_SUPPRESS = & $HADOOP_HOME != ; then echo Warning: $HADOOP_HOME is deprecated. 1&2 留着异常也无所谓不会对程序的正常运行产生影响。 解决方法: 添加export HADOOP_HOME_WARN_SUPPRESS=TRUE 到 hadoop-env.sh 中,注意要添加到集群中每一个 节点中。4. ERROR org.apache.hadoop.securit

4、y.UserGroupInformation: PriviledgedActionException java.io.IOException: File . could only be replicated to 0 nodes,instead of 1 分析:是防火墙的问题,需要把防火墙关掉。 解决方法: 首先Stop Hadoop集群,接着执行: sudo ufw disable1:Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out Answer:程序里面需要打开多个文件,进行分析,系

5、统一般默认数量是1024,(用ulimit -a可以看到)对于正常使用是够了,但是对于程序来讲,就太少了。修改办法:修改2个文件。/etc/security/limits.confvi /etc/security/limits.conf加上:* soft nofile * hard nofile $cd /etc/pam.d/$sudo vi login添加 session required /lib/security/pam_limits.so针对第一个问题我纠正下答案:这是reduce预处理阶段shuffle时获取已完成的map的输出失败次数超过上限造成的,上限默认为5。引起此问题的方式可

6、能会有很多种,比如网络连接不正常,连接超时,带宽较差以及端口阻塞等。通常框架内网络情况较好是不会出现此错误的。2:Too many fetch-failures Answer:出现这个问题主要是结点间的连通不够全面。1) 检查 、/etc/hosts要求本机ip 对应 服务器名要求要包含所有的服务器ip + 服务器名2) 检查 .ssh/authorized_keys要求包含所有服务器(包括其自身)的public key3:处理速度特别的慢 出现map很快 但是reduce很慢 而且反复出现 reduce=0% Answer:结合第二点,然后修改 conf/hadoop-env.sh 中的ex

7、port HADOOP_HEAPSIZE=4000 4:能够启动datanode,但无法访问,也无法结束的错误 在 重新格式化一个新的分布式文件时,需要将你NameNode上所配置的.dir这一namenode用来存放NameNode 持久存储名字空间及事务日志的本地文件系统路径删除,同时将各DataNode上的dfs.data.dir的路径 DataNode 存放块数据的本地文件系统路径的目录也删除。如本此配置就是在NameNode上删除/home/hadoop/NameData,在DataNode上 删除/home/hadoop/DataNode1和/home/hadoop

8、/DataNode2。这是因为Hadoop在格式化一个新的分布式文件系 统时,每个存储的名字空间都对应了建立时间的那个版本(可以查看/home/hadoop /NameData/current目录下的VERSION文件,上面记录了版本信息),在重新格式化新的分布式系统文件时,最好先删除NameData 目录。必须删除各DataNode的dfs.data.dir。这样才可以使namedode和datanode记录的信息版本对应。注意:删除是个很危险的动作,不能确认的情况下不能删除!做好删除的文件等通通备份!5:java.io.IOException: Could not obtain block

9、: blk_1100 file=/user/hive/warehouse/src_log/src_log 出现这种情况大多是结点断了,没有连接上。6:java.lang.OutOfMemoryError: Java heap space 出现这种异常,明显是jvm内存不够得原因,要修改所有的datanode的jvm内存大小。Java -Xms1024m -Xmx4096m一般jvm的最大内存使用应该为总内存大小的一半,我们使用的8G内存,所以设置为4096m,这一值可能依旧不是最优的值。 Hadoop添加节点的方法 自己实际添加节点过程:1. 先在slave上配置好环境,包括ssh,jdk,相

10、关config,lib,bin等的拷贝;2. 将新的datanode的host加到集群namenode及其他datanode中去;3. 将新的datanode的ip加到master的conf/slaves中;4. 重启cluster,在cluster中看到新的datanode节点;5. 运行bin/start-balancer.sh,这个会很耗时间备注:1. 如果不balance,那么cluster会把新的数据都存放在新的node上,这样会降低mr的工作效率;2. 也可调用bin/start-balancer.sh 命令执行,也可加参数 -threshold 5threshold 是平衡阈值,

11、默认是10%,值越低各节点越平衡,但消耗时间也更长。3. balancer也可以在有mr job的cluster上运行,默认dfs.balance.bandwidthPerSec很低,为1M/s。在没有mr job时,可以提高该设置加快负载均衡时间。其他备注:1. 必须确保slave的firewall已关闭;2. 确保新的slave的ip已经添加到master及其他slaves的/etc/hosts中,反之也要将master及其他slave的ip添加到新的slave的/etc/hosts中mapper及reducer个数 url地址: /hadoop/

12、HowManyMapsAndReducesHowManyMapsAndReducesPartitioning your job into maps and reducesPicking the appropriate size for the tasks for your job can radically change the performance of Hadoop. Increasing the number of tasks increases the framework overhead, but increases load balancing and lowers the co

13、st of failures. At one extreme is the 1 map/1 reduce case where nothing is distributed. The other extreme is to have 1,000,000 maps/ 1,000,000 reduces where the framework runs out of resources for the overhead.Number of MapsThe number of maps is usually driven by the number of DFS blocks in the inpu

14、t files. Although that causes people to adjust their DFS block size to adjust the number of maps. The right level of parallelism for maps seems to be around 10-100 maps/node, although we have taken it up to 300 or so for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps ta

15、ke at least a minute to execute.Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the defau

16、lt case the DFS block size of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input data and have 128MB DFS blocks, youll end up with 82k maps, unless your mapred.map.tasks is even larger

17、. Ultimately the WWW InputFormat determines the number of maps.The number of map tasks can also be increased manually using the JobConfs conf.setNumMapTasks(int num). This can be used to increase the number of map tasks, but will not set the number below that which Hadoop determines via splitting th

18、e input data.Number of ReducesThe right number of reduces seems to be 0.95 or 1.75 * (nodes * mapred.tasktracker.tasks.maximum). At 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. At 1.75 the faster nodes will finish their first round of reduces a

19、nd launch a second round of reduces doing a much better job of load balancing.Currently the number of reduces is limited to roughly 1000 by the buffer size for the output files (io.buffer.size * 2 * numReduces heapSize). This will be fixed at some point, but until it is it provides a pretty firm upp

20、er bound.The number of reduces also controls the number of output files in the output directory, but usually that is not important because the next map/reduce step will split them into even smaller splits for the maps.The number of reduce tasks can also be increased in the same way as the map tasks,

21、 via JobConfs conf.setNumReduceTasks(int num). 自己的理解:mapper个数的设置:跟input file 有关系,也跟filesplits有关系,filesplits的上线为dfs.block.size,下线可以通过mapred.min.split.size设置,最后还是由InputFormat决定。较好的建议:The right number of reduces seems to be 0.95 or 1.75 multiplied by ( * mapred.tasktracker.reduce.tasks.maximum).increas

22、ing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures. mapred.tasktracker.reduce.tasks.maximum 2 The maximum number of reduce tasks that will be run simultaneously by a task tracker. 单个node新加硬盘 1.修改需要新加硬盘的node的dfs.data.dir,用逗号分隔新、旧文件

23、目录2.重启dfs同步hadoop 代码 hadoop-env.sh# host:path where hadoop code should be rsyncd from. Unset by default.# export HADOOP_MASTER=master:/home/$USER/src/hadoop用命令合并HDFS小文件 hadoop fs -getmerge 重启reduce job方法 Introduced recovery of jobs when JobTracker restarts. This facility is off by default.Introduced

24、 config parameters mapred.jobtracker.restart.recover, mapred.jobtracker.job.history.block.size, and mapred.jobtracker.job.history.buffer.size.还未验证过。IO写操作出现问题 0-98, infoPort=50075, ipcPort=50020):Got exception while serving blk_-_1292 to /65:.SocketTimeoutException: millis timeout

25、 while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannelconnected local=/65:50010 remote=/65:50930 at .SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185) at .SocketOutputStream.waitForWritable(So

26、cketOutputStream.java:159) at .SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387) at org.apac

27、he.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94) at java.lang.Thread.run(Thread.java:619)It seems there are many reasons that it can timeout, the example given inHADOOP-3831 is a slow reading cli

28、ent.解决办法:在hadoop-site.xml中设置dfs.datanode.socket.write.timeout=0试试;My understanding is that this issue should be fixed in Hadoop 0.19.1 so thatwe should leave the standard timeout. However until then this can helpresolve issues like the one youre seeing.HDFS退服节点的方法 目前版本的dfsadmin的帮助信息是没写清楚的,已经file了一个b

29、ug了,正确的方法如下:1. 将 dfs.hosts 置为当前的 slaves,文件名用完整路径,注意,列表中的节点主机名要用大名,即 uname -n 可以得到的那个。2. 将 slaves 中要被退服的节点的全名列表放在另一个文件里,如 slaves.ex,使用 dfs.host.exclude 参数指向这个文件的完整路径3. 运行命令 bin/hadoop dfsadmin -refreshNodes4. web界面或 bin/hadoop dfsadmin -report 可以看到退服节点的状态是 Decomission in progress,直到需要复制的数据复制完成为止5. 完成

30、之后,从 slaves 里(指 dfs.hosts 指向的文件)去掉已经退服的节点附带说一下 -refreshNodes 命令的另外三种用途:2. 添加允许的节点到列表中(添加主机名到 dfs.hosts 里来)3. 直接去掉节点,不做数据副本备份(在 dfs.hosts 里去掉主机名)4. 退服的逆操作停止 exclude 里面和 dfs.hosts 里面都有的,正在进行 decomission 的节点的退服,也就是把 Decomission in progress 的节点重新变为 Normal (在 web 界面叫 in service)hadoop 学习借鉴 1. 解决hadoop Ou

31、tOfMemoryError问题:mapred.child.java.opts-Xmx800M -serverWith the right JVM size in your hadoop-site.xml , you will have to copy thisto all mapred nodes and restart the cluster.或者:hadoop jar jarfile main class -D mapred.child.java.opts=-Xmx800M 2. Hadoop java.io.IOException: Job failed! at org.apache.

32、hadoop.mapred.JobClient.runJob(JobClient.java:1232) while indexing.when i use nutch1.0,get this error:Hadoop java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) while indexing.这个也很好解决:可以删除conf/perties,然后可以看到详细的错误报告我这儿出现的是out of memory解决办法是在给运行主

33、类org.apache.nutch.crawl.Crawl加上参数:-Xms64m -Xmx512m你的或许不是这个问题,但是能看到详细的错误报告问题就好解决了distribute cache使用 类似一个全局变量,但是由于这个变量较大,所以不能设置在config文件中,转而使用distribute cache具体使用方法:(详见the definitive guide,P240)1. 在命令行调用时:调用-files,引入需要查询的文件(可以是local file, HDFS file(使用hdfs:/xxx?), 或者 -archives (JAR,ZIP, tar等)% hadoop j

34、ar job.jar MaxTemperatureByStationNameUsingDistributedCacheFile -files input/ncdc/metadata/stations-fixed-width.txt input/ncdc/all output2. 程序中调用:public void configure(JobConf conf) metadata = new NcdcStationMetadata(); try metadata.initialize(new File(stations-fixed-width.txt); catch (IOException e

35、) throw new RuntimeException(e); 另外一种间接的使用方法:在hadoop-0.19.0中好像没有调用addCacheFile()或者addCacheArchive()添加文件,使用getLocalCacheFiles() 或 getLocalCacheArchives() 获得文件hadoop的job显示web There are web-based interfaces to both the JobTracker (MapReduce master) and NameNode (HDFS master) which display status pages

36、about the state of the entire system. By default, these are located at WWW http:/job.tracker.addr:50030/ and WWW http:/name.node.addr:50070/. hadoop监控 OnlyXP() 用nagios作告警,ganglia作监控图表即可status of 255 error 错误类型:java.io.IOException: Task process exit with nonzero status of 255. at org.apache.hadoop.ma

37、pred.TaskRunner.run(TaskRunner.java:424)错误原因:Set erval and mapred.userlog.retain.hours to higher value. By default, their values are 24 hours. These might be the reason for failure, though Im not suresplit size FileInputFormat input splits: (详见 the definitive guideP190

38、)mapred.min.split.size: default=1, the smallest valide size in bytes for a file split.mapred.max.split.size: default=Long.MAX_VALUE, the largest valid size.dfs.block.size: default = 64M, 系统中设置为128M。如果设置 minimum split size block size, 会增加块的数量。(猜想从其他节点拿去数据的时候,会合并block,导致block数量增多) 如果设置maximum split si

39、ze block size, 会进一步拆分block。split size = max(minimumSize, min(maximumSize, blockSize);其中 minimumSize blockSize 100% when the total size of map outputs (for asingle reducer) is high 造成原因:在reduce的merge过程中,check progress有误差,导致status 100%,在统计过程中就会出现以下错误:java.lang.ArrayIndexOutOfBoundsException: 3 at org.

40、apache.hadoop.mapred.StatusHttpServer$TaskGraphServlet.getReduceAvarageProgresses(StatusHttpServer.java:228) at org.apache.hadoop.mapred.StatusHttpServer$TaskGraphServlet.doGet(StatusHttpServer.java:159) at javax.servlet.http.HttpServlet.service(HttpServlet.java:689) at javax.servlet.http.HttpServle

41、t.service(HttpServlet.java:802) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567) at org.mortbay.http.HttpCo

42、ntext.handle(HttpContext.java:1565) at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635) at org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at org.mortbay.http.HttpServer.service(HttpServer.java:954)jira地址:counters 3中counters:1. built-in counters: Map

43、input bytes, Map output records.2. enum counters调用方式: enum Temperature MISSING,MALFORMED reporter.incrCounter(Temperature.MISSING, 1)结果显示:09/04/20 06:33:36 INFO mapred.JobClient: Air Temperature Recor09/04/20 06:33:36 INFO mapred.JobClient: Malformed=309/04/20 06:33:36 INFO mapred.JobClient: Missing

44、=3. dynamic countes:调用方式:reporter.incrCounter(TemperatureQuality, parser.getQuality(),1);结果显示:09/04/20 06:33:36 INFO mapred.JobClient: TemperatureQuality09/04/20 06:33:36 INFO mapred.JobClient: 2=09/04/20 06:33:36 INFO mapred.JobClient: 1=09/04/20 06:33:36 INFO mapred.JobClient: 0=1 7: Namenode in s

45、afe mode 解决方法bin/hadoop dfsadmin -safemode leave8:.NoRouteToHostException: No route to host j解决方法:sudo /etc/init.d/iptables stop9:更改namenode后,在hive中运行select 依旧指向之前的namenode地址 这是因为:When youcreate a table, hive actually stores the location of the table (e.g.hdfs:/ip:port/user/root/.) in the SD

46、S and DBS tables in the metastore . So when I bring up a new cluster the master has a new IP, but hives metastore is still pointing to the locations within the oldcluster. I could modify the metastore to update with the new IP everytime I bring up a cluster. But the easier and simpler solution was t

47、o just use an elastic IP for the master所以要将metastore中的之前出现的namenode地址全部更换为现有的namenode地址 10:Your DataNode is started and you can create directories with bin/hadoop dfs -mkdir, but you get an error message when you try to put files into the HDFS (e.g., when you run a command like bin/hadoop dfs -put).

48、 解决方法:Go to the HDFS info web page (open your web browser and go to http:/namenode:dfs_info_port where namenode is the hostname of your NameNode and dfs_info_port is the port you chose .port; if followed the QuickStart on your personal computer then this URL will be http:/localhost:50070). O

49、nce at that page click on the number where it tells you how many DataNodes you have to look at a list of the DataNodes in your cluster.If it says you have used 100% of your space, then you need to free up room on local disk(s) of the DataNode(s).If you are on Windows then this number will not be accurate (there is some kind of bug either in Cygwins df.exe or in Windows). Just free up some more space and you should b

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论