多多色-多人伦交性欧美在线观看-多人伦精品一区二区三区视频-多色视频-免费黄色视屏网站-免费黄色在线

國內最全IT社區平臺 聯系我們 | 收藏本站
阿里云優惠2
您當前位置:首頁 > 服務器 > [置頂] reduce端緩存數據過多出現FGC,導致reduce生成的數據無法寫到hdfs

[置頂] reduce端緩存數據過多出現FGC,導致reduce生成的數據無法寫到hdfs

來源:程序員人生   發布時間:2014-12-18 09:01:30 閱讀次數:4390次

     搜索推薦有1個job,1000多個map,200個reduce,運行到最后只剩1個reduce(10.39.6.130上)的時候,出現以下異常,致使job失?。?/span>

2014⑴2-04 15:49:04,297 INFO [main] org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 12 segments left of total size: 11503294914 bytes
2014⑴2-04 15:49:04,314 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
2014⑴2-04 15:49:04,394 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.lzo_deflate]
2014⑴2-04 16:02:26,889 WARN [ResponseProcessor for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086
java.io.IOException: Bad response ERROR for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086 from datanode 10.39.5.193:50010
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819)
2014⑴2-04 16:02:26,889 WARN [ResponseProcessor for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223
java.io.IOException: Bad response ERROR for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223 from datanode 10.39.1.90:50010
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819)
2014⑴2-04 16:02:26,891 WARN [DataStreamer for file /dw_ext/recmd/mds6/mds_filter_relation_10/20141203/_temporary/1/_temporary/attempt_1415948652989_195149_r_000158_3/user-r-00158 block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086] org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086 in pipeline 10.39.6.130:50010, 10.39.5.185:50010, 10.39.5.193:50010: bad datanode 10.39.5.193:50010
2014⑴2-04 16:02:26,891 WARN [DataStreamer for file /dw_ext/recmd/mds6/mds_filter_relation_10/20141203/_temporary/1/_temporary/attempt_1415948652989_195149_r_000158_3/exposure-r-00158 block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223] org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223 in pipeline 10.39.6.130:50010, 10.39.1.89:50010, 10.39.1.90:50010: bad datanode 10.39.1.90:50010
java.io.EOFException: Premature EOF: no length prefix available
	at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1987)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:796)

2014⑴2-04 16:05:23,743 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:2734)
	at java.util.Vector.ensureCapacityHelper(Vector.java:226)
	at java.util.Vector.add(Vector.java:728)
	at rec.CommonUtil.pack_Treeset(CommonUtil.java:395)
	at rec.ConvertExposure10$MyReducer.collect_exposure(ConvertExposure10.java:259)
	at rec.ConvertExposure10$MyReducer.reduce(ConvertExposure10.java:329)
	at rec.ConvertExposure10$MyReducer.reduce(ConvertExposure10.java:234)
	at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1550)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
從異常上看,首先是reduce在往hdfs寫數據時,發現建pipeline時,沒有收到pipeline上最后1個節點的回應:
2014⑴2-04 16:02:26,889 WARN [ResponseProcessor for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086
java.io.IOException: Bad response ERROR for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086 from datanode 10.39.5.193:50010
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819)
2014⑴2-04 16:02:26,889 WARN [ResponseProcessor for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223
java.io.IOException: Bad response ERROR for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223 from datanode 10.39.1.90:50010
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819)
2014⑴2-04 16:02:26,891 WARN [DataStreamer for file /dw_ext/recmd/mds6/mds_filter_relation_10/20141203/_temporary/1/_temporary/attempt_1415948652989_195149_r_000158_3/user-r-00158 block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086] org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086 in pipeline 10.39.6.130:50010, 10.39.5.185:50010, 10.39.5.193:50010: bad datanode 10.39.5.193:50010
2014⑴2-04 16:02:26,891 WARN [DataStreamer for file /dw_ext/recmd/mds6/mds_filter_relation_10/20141203/_temporary/1/_temporary/attempt_1415948652989_195149_r_000158_3/exposure-r-00158 block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223] org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223 in pipeline 10.39.6.130:50010, 10.39.1.89:50010, 10.39.1.90:50010: bad datanode 10.39.1.90:50010
java.io.EOFException: Premature EOF: no length prefix available
	at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1987)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:796)
這里以寫block:blk_1394149732_320469086為例,pipeline[10.39.6.130:50010, 10.39.5.185:50010, 10.39.5.193:50010]上面的最后1個DN是10.39.5.193,到10.39.5.193查看該block的日志信息:
2014⑴2-04 16:00:57,424 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.5.193:50010 remote=/10.39.5.185:58225]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at java.io.DataInputStream.read(DataInputStream.java:132)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
        at java.lang.Thread.run(Thread.java:662)
2014⑴2-04 16:00:57,424 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086, type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
10.39.5.193上面日志顯示,在讀取pipeline上1個節點10.39.5.185的Packet時,1直讀取不到,直到10分鐘超時:
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.5.193:50010 remote=/10.39.5.185:58225]
那我們來看以下pipeline上第2個節點10.39.5.185,dn日志以下:
2014⑴2-04 16:00:57,988 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086, type=HAS_DOWNSTREAM_IN_PIPELINE
java.io.EOFException: Premature EOF: no length prefix available
        at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1987)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1083)
        at java.lang.Thread.run(Thread.java:662)
2014⑴2-04 16:00:58,008 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.5.185:50010 remote=/10.39.6.130:59083]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at java.io.DataInputStream.read(DataInputStream.java:132)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
        at java.lang.Thread.run(Thread.java:662)
2014⑴2-04 16:00:58,009 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086, type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted.
10.39.5.193日志類似,也是在等待讀取pipeline的第1個節點10.39.6.130的Packet時超時:
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.5.185:50010 remote=/10.39.6.130:59083]
這樣說來,問題出在10.39.6.130上,也即reduce任務運行的節點上,該節點DN日志以下:
2014⑴2-04 16:00:59,987 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.6.130:50010 remote=/10.39.6.130:45259]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at java.io.DataInputStream.read(DataInputStream.java:132)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
        at java.lang.Thread.run(Thread.java:662)
但是根據日志信息,10.39.6.130的DN也是在等待Packet,但是1直等到超時也沒等到:

java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.6.130:50010 remote=/10.39.6.130:45259]

看來不是10.39.6.130節點DN的問題,如果pipeline上面3個DN都沒有問題,那問題肯定就出在dfsclient端了,也就是reduce任務在往hdfs寫數據的時候根本就沒有寫得出去,在dfsclient上面就給堵住了,接下來查看dfsclient,也就是reduce任務進程的履行情況:

10.39.6.130上,根據任務id:attempt_1415948652989_195149_r_000158_3 找到進程id 31050,查看內存使用情況:

jstat -gcutil 31050 1000: S0 S1 E O P YGC YGCT FGC FGCT GCT 68.95 0.00 0.00 92.98 66.32 111 16.825 10 25.419 42.244 68.95 0.00 0.00 92.98 66.32 111 16.825 10 25.419 42.244 68.95 0.00 0.00 92.98 66.32 111 16.825 10 25.419 42.244 68.95 0.00 0.00 92.98 66.32 111 16.825 10 25.419 42.244 68.95 0.00 0.00 92.98 66.32 111 16.825 10 25.419 42.244 68.95 0.00 0.00 92.98 66.32 111 16.825 10 25.419 42.244 68.95 0.00 26.75 100.00 66.32 111 16.825 10 25.419 42.244 0.00 0.00 31.85 100.00 68.16 111 16.825 10 44.767 61.591 0.00 0.00 35.37 100.00 68.16 111 16.825 10 44.767 61.591 0.00 0.00 40.64 100.00 68.16 111 16.825 10 44.767 61.591 0.00 0.00 45.35 100.00 68.16 111 16.825 10 44.767 61.591 0.00 0.00 48.87 100.00 68.16 111 16.825 10 44.767 61.591 0.00 0.00 54.14 100.00 68.16 111 16.825 10 44.767 61.591 0.00 0.00 58.85 100.00 68.16 111 16.825 10 44.767 61.591

果然,JVM老年代被占滿,不斷履行FGC,直接stop the world,致使JVM沒法對外提供服務,致使dfsclient掛起,1直不能往pipeline上面的節點寫Packet,直到socket超時。

根據reduce任務最后的日志,也從側面驗證了這類觀點:

Error running child : java.lang.OutOfMemoryError: Java heap space

既然是OOM致使的job失敗,那是甚么對象致使的內存泄漏呢:

履行:

jmap -histo:live 31050 > jmap.log cat jmap.log : num #instances #bytes class name ---------------------------------------------- 1: 71824177 2872967080 java.util.TreeMap$Entry 2: 71822939 1723750536 java.lang.Long 3: 10684 24777776 [B 4: 47174 6425152 <methodKlass> 5: 47174 6408120 <constMethodKlass> 6: 3712 4429776 <constantPoolKlass> 7: 66100 3979224 <symbolKlass> 8: 3712 2938192 <instanceKlassKlass> 9: 3125 2562728 <constantPoolCacheKlass> 10: 3477 1267752 [I 11: 12923 1180224 [C 12: 1794 772488 <methodDataKlass> 13: 13379 428128 java.lang.String 14: 4034 419536 java.lang.Class 15: 6234 410312 [S 16: 6409 352576 [[I 17: 7567 242144 java.util.HashMap$Entry 18: 293 171112 <objArrayKlassKlass> 19: 4655 148960 java.util.Hashtable$Entry 20: 1535 135080 java.lang.reflect.Method 21: 842 121696 [Ljava.util.HashMap$Entry;
果然啊,reduce代碼中使用了TreeMap,往里面放置了大量對象,致使出現OOM,TreeMap的Entry就站用了2.8G內存,而我們reduce設置的內存也就1.5G。

總結:對該job出現的異常,1般在以下幾種情況下產生:

1、寫數據塊的DN出現問題,不能寫入,就像之前出現的DN由于本地讀問題致使xceivers(每一個DN用于并發數據傳輸處理最大線程數)到達4096,耗盡了所有的線程,沒法對新發起的輸出寫入做出相應。

2、網絡出現異常,DN節點進或出的帶寬被耗盡,致使數據寫不出去或寫不進來,這類情況可以視察ganglia看節點帶寬使用情況,這類情況1般比較少。當該job出現問題的時候,也懷疑過是帶寬被耗盡的問題,查看了1下相干節點ganglia帶寬使用情況,最大in/out 85M/s左右,最后排除是帶寬問題。

3、dfsclient出現問題,長時間沒反應,致使已發起的socket超時。由于dfsclient情況比較復雜,出現問題的情況比較多,比如本問就是由于reduce出現內存溢出,jvm不斷進行FGC,致使dfsclient掛起,終究socket出現超時。








生活不易,碼農辛苦
如果您覺得本網站對您的學習有所幫助,可以手機掃描二維碼進行捐贈
程序員人生
------分隔線----------------------------
分享到:
------分隔線----------------------------
關閉
程序員人生
主站蜘蛛池模板: 五月视频 | 亚洲综合久久综合激情久久 | 国产h视频 | 亚洲综合精品一区二区三区中文 | 亚洲经典一区二区三区 | a免费视频| 中文字幕乱码一区三区免费 | 精品国产一区二区二三区在线观看 | 成人性视频在线三级 | 欧美日韩国产高清一区二区三区 | 久久精品国产一区二区三区不卡 | 亚洲第一视频在线观看 | 黑人xxxx日本 | 亚洲激情专区 | 青青草原国产在线视频 | 亚洲经典一区 | 成年人在线视频免费观看 | 精品午夜国产在线观看不卡 | 亚洲在线播放 | 日本一区二区在线视频 | 日韩 欧美 中文 | 亚洲91av | 日本一区二区三区四区无限 | 国产福利专区 | 美国伊人网 | 色自拍 | 国产中文99视频在线观看 | 91久久精一区二区三区大全 | 亚洲精品第五页 | 亚洲成人高清在线观看 | 欧美 日韩 中文字幕 | 亚洲欧美中文字幕 | 国产一区二区免费视频 | 国产 日韩 欧美 综合 | 久久久久avav久久久 | 国产亚洲欧美一区二区 | 中文字幕一区二区三区免费看 | 看黄色网址 | 欧美亚洲桃花综合 | 亚洲成a人片在线观看中文!!! | 亚洲欧洲精品国产二码 |