Parse Timeout?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Parse Timeout?

Michael Chen
Hi,

I've been getting a strange timeout exception during parsing of a large
sitemap XML document. I've set the timeout in nutch-site.xml to -1 or
large numbers and "ant clean && ant runtime" before deploying the parse
job, to no avail. Nor did restarting the cluster help. The strange thing
is that the error happens exactly 30 seconds after the job is started,
so something must be wrong with the config. Here's the log:

2017-08-18 06:05:11,257 WARN [main] org.apache.nutch.parse.ParseUtil: Error parsing https://www.mscdirect.com/detail16.xml
java.util.concurrent.TimeoutException
        at java.util.concurrent.FutureTask.get(FutureTask.java:205)
        at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:174)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:163)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:146)
        at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:337)
        at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:151)
        at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:88)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
2017-08-18 06:05:11,258 WARN [main] org.apache.nutch.parse.ParseUtil: Unable to successfully parse content https://www.mscdirect.com/detail16.xml of type application/xml
2017-08-18 06:05:11,349 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1502934094771_0059_m_000000_0 is done. And is in the process of committing
2017-08-18 06:05:11,398 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1502934094771_0059_m_000000_0' done.
2017-08-18 06:05:11,400 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system...
2017-08-18 06:05:11,401 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system stopped.
2017-08-18 06:05:11,401 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system shutdown complete.

The Hadoop version is 2.6.0, Nutch version 2.x, running on CloudEra
Manager managed 5-node AWS cluster.

Any help would be appreciated, thanks!

Michael