搜索推薦有1個job,1000多個map,200個reduce,運行到最后只剩1個reduce(10.39.6.130上)的時候,出現以下異常,致使job失?。?/span>
2014⑴2-04 15:49:04,297 INFO [main] org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 12 segments left of total size: 11503294914 bytes 2014⑴2-04 15:49:04,314 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 2014⑴2-04 15:49:04,394 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.lzo_deflate] 2014⑴2-04 16:02:26,889 WARN [ResponseProcessor for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086 java.io.IOException: Bad response ERROR for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086 from datanode 10.39.5.193:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819) 2014⑴2-04 16:02:26,889 WARN [ResponseProcessor for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223 java.io.IOException: Bad response ERROR for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223 from datanode 10.39.1.90:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819) 2014⑴2-04 16:02:26,891 WARN [DataStreamer for file /dw_ext/recmd/mds6/mds_filter_relation_10/20141203/_temporary/1/_temporary/attempt_1415948652989_195149_r_000158_3/user-r-00158 block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086] org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086 in pipeline 10.39.6.130:50010, 10.39.5.185:50010, 10.39.5.193:50010: bad datanode 10.39.5.193:50010 2014⑴2-04 16:02:26,891 WARN [DataStreamer for file /dw_ext/recmd/mds6/mds_filter_relation_10/20141203/_temporary/1/_temporary/attempt_1415948652989_195149_r_000158_3/exposure-r-00158 block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223] org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223 in pipeline 10.39.6.130:50010, 10.39.1.89:50010, 10.39.1.90:50010: bad datanode 10.39.1.90:50010 java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1987) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:796) 2014⑴2-04 16:05:23,743 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2734) at java.util.Vector.ensureCapacityHelper(Vector.java:226) at java.util.Vector.add(Vector.java:728) at rec.CommonUtil.pack_Treeset(CommonUtil.java:395) at rec.ConvertExposure10$MyReducer.collect_exposure(ConvertExposure10.java:259) at rec.ConvertExposure10$MyReducer.reduce(ConvertExposure10.java:329) at rec.ConvertExposure10$MyReducer.reduce(ConvertExposure10.java:234) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1550) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
2014⑴2-04 16:02:26,889 WARN [ResponseProcessor for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086 java.io.IOException: Bad response ERROR for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086 from datanode 10.39.5.193:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819) 2014⑴2-04 16:02:26,889 WARN [ResponseProcessor for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223 java.io.IOException: Bad response ERROR for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223 from datanode 10.39.1.90:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819) 2014⑴2-04 16:02:26,891 WARN [DataStreamer for file /dw_ext/recmd/mds6/mds_filter_relation_10/20141203/_temporary/1/_temporary/attempt_1415948652989_195149_r_000158_3/user-r-00158 block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086] org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086 in pipeline 10.39.6.130:50010, 10.39.5.185:50010, 10.39.5.193:50010: bad datanode 10.39.5.193:50010 2014⑴2-04 16:02:26,891 WARN [DataStreamer for file /dw_ext/recmd/mds6/mds_filter_relation_10/20141203/_temporary/1/_temporary/attempt_1415948652989_195149_r_000158_3/exposure-r-00158 block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223] org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223 in pipeline 10.39.6.130:50010, 10.39.1.89:50010, 10.39.1.90:50010: bad datanode 10.39.1.90:50010 java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1987) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:796)
2014⑴2-04 16:00:57,424 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.5.193:50010 remote=/10.39.5.185:58225] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at java.io.DataInputStream.read(DataInputStream.java:132) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:662) 2014⑴2-04 16:00:57,424 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086, type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.5.193:50010 remote=/10.39.5.185:58225]
java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1987) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1083) at java.lang.Thread.run(Thread.java:662) 2014⑴2-04 16:00:58,008 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086 java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.5.185:50010 remote=/10.39.6.130:59083] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at java.io.DataInputStream.read(DataInputStream.java:132) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:662) 2014⑴2-04 16:00:58,009 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086, type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted.
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.5.185:50010 remote=/10.39.6.130:59083]
2014⑴2-04 16:00:59,987 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086 java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.6.130:50010 remote=/10.39.6.130:45259] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at java.io.DataInputStream.read(DataInputStream.java:132) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:662)
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.6.130:50010 remote=/10.39.6.130:45259]
在10.39.6.130上,根據任務id:attempt_1415948652989_195149_r_000158_3 找到進程id 31050,查看內存使用情況:
果然,JVM老年代被占滿,不斷履行FGC,直接stop the world,致使JVM沒法對外提供服務,致使dfsclient掛起,1直不能往pipeline上面的節點寫Packet,直到socket超時。
根據reduce任務最后的日志,也從側面驗證了這類觀點:
Error running child : java.lang.OutOfMemoryError: Java heap space
既然是OOM致使的job失敗,那是甚么對象致使的內存泄漏呢:履行:
總結:對該job出現的異常,1般在以下幾種情況下產生:
1、寫數據塊的DN出現問題,不能寫入,就像之前出現的DN由于本地讀問題致使xceivers(每一個DN用于并發數據傳輸處理最大線程數)到達4096,耗盡了所有的線程,沒法對新發起的輸出寫入做出相應。
2、網絡出現異常,DN節點進或出的帶寬被耗盡,致使數據寫不出去或寫不進來,這類情況可以視察ganglia看節點帶寬使用情況,這類情況1般比較少。當該job出現問題的時候,也懷疑過是帶寬被耗盡的問題,查看了1下相干節點ganglia帶寬使用情況,最大in/out 85M/s左右,最后排除是帶寬問題。
3、dfsclient出現問題,長時間沒反應,致使已發起的socket超時。由于dfsclient情況比較復雜,出現問題的情況比較多,比如本問就是由于reduce出現內存溢出,jvm不斷進行FGC,致使dfsclient掛起,終究socket出現超時。