hadoop - java.io.IOException: Not a data file -
i processing bunch of avro files stored in nested directory structure in hdfs. files stored in year/month/day/hour format directory structure.
i wrote simple code process
sc.hadoopconfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true") val rootdir = "/user/cloudera/rootdir" val rdd1 = sc.newapihadoopfile[avrokey[genericrecord], nullwritable, avrokeyinputformat[genericrecord]](rootdir) rdd1.count()
i exception have pasted below. biggest problem facing doesn't tell me file not data file. have go in hdfs , scan through 1000s of files see 1 not data file.
is there more efficient way debug/solve this?
5/11/01 19:01:49 warn tasksetmanager: lost task 1084.0 in stage 14.0 (tid 11562, datanode): java.io.ioexception: not data file. @ org.apache.avro.file.datafilestream.initialize(datafilestream.java:102) @ org.apache.avro.file.datafilereader.<init>(datafilereader.java:97) @ org.apache.avro.mapreduce.avrorecordreaderbase.createavrofilereader(avrorecordreaderbase.java:183) @ org.apache.avro.mapreduce.avrorecordreaderbase.initialize(avrorecordreaderbase.java:94) @ org.apache.spark.rdd.newhadooprdd$$anon$1.<init>(newhadooprdd.scala:133) @ org.apache.spark.rdd.newhadooprdd.compute(newhadooprdd.scala:104) @ org.apache.spark.rdd.newhadooprdd.compute(newhadooprdd.scala:66) @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277) @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244) @ org.apache.spark.rdd.mappartitionsrdd.compute(mappartitionsrdd.scala:35) @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277) @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244) @ org.apache.spark.scheduler.shufflemaptask.runtask(shufflemaptask.scala:68) @ org.apache.spark.scheduler.shufflemaptask.runtask(shufflemaptask.scala:41) @ org.apache.spark.scheduler.task.run(task.scala:64) @ org.apache.spark.executor.executor$taskrunner.run(executor.scala:203) @ java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1145) @ java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor.java:615) @ java.lang.thread.run(thread.java:745)
one of nodes on cluster block located down. data not found because of that, gives error. solution repair , bring nodes in cluster.
i getting exact error below java map reduce program uses avro input. below rundown of issue.
error: java.io.ioexception: not data file. @ org.apache.avro.file.datafilestream.initialize(datafilestream.java:102) @ org.apache.avro.file.datafilereader.<init>(datafilereader.java:97) @ org.apache.avro.mapreduce.avrorecordreaderbase.createavrofilereader(avrorecordreaderbase.java:183) @ org.apache.avro.mapreduce.avrorecordreaderbase.initialize(avrorecordreaderbase.java:94) @ org.apache.hadoop.mapred.maptask$newtrackingrecordreader.initialize(maptask.java:548) @ org.apache.hadoop.mapred.maptask.runnewmapper(maptask.java:786) @ org.apache.hadoop.mapred.maptask.run(maptask.java:341) @ org.apache.hadoop.mapred.yarnchild$2.run(yarnchild.java:168) @ java.security.accesscontroller.doprivileged(native method) @ javax.security.auth.subject.doas(subject.java:422) @ org.apache.hadoop.security.usergroupinformation.doas(usergroupinformation.java:1657) @ org.apache.hadoop.mapred.yarnchild.main(yarnchild.java:162)
i decided cat file because able run program on file in same folder of hdfs , receive following.
info hdfs.dfsclient: no node available <block location in cluster> node: java.io.ioexception: no live nodes contain block bp-6168826450-10.1.10.123-1457116155679:blk_1073853378_112574 after checking nodes = [], ignorednodes = null no live nodes contain current block block locations: dead nodes: . new block locations namenode , retry...
we have been having problems our cluster , unfortunately nodes down. after remedy of problem error resolved
Comments
Post a Comment