[jira] [Commented] (NUTCH-2071) A parser failure on a single document may fail crawling job if parser.timeout=-1

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-2071) A parser failure on a single document may fail crawling job if parser.timeout=-1

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537099#comment-16537099 ]

ASF GitHub Bot commented on NUTCH-2071:
---------------------------------------

sebastian-nagel opened a new pull request #358: NUTCH-2071 A parser failure on a single document may fail crawling job if parser.timeout=-1
URL: https://github.com/apache/nutch/pull/358
 
 
   - also catch any Throwable if parser.timeout == -1 (parser is not called from ExecutorService)
   - improve log message: show full class name of called parser

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


>  A parser failure on a single document may fail crawling job if parser.timeout=-1
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-2071
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2071
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.11
>            Reporter: Arkadi Kosmynin
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.14, 1.15
>
>         Attachments: NUTCH-2071.diff
>
>
> java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>         at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
>         <...>
> Caused by: java.lang.IncompatibleClassChangeError: class org.apache.tika.parser.asm.XHTMLClassVisitor has interface org.objectweb.asm.ClassVisitor as super class
>                 at java.lang.ClassLoader.defineClass1(Native Method)
>                 at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
>                 at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>                 at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>                 at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>                 at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>                 at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>                 at java.security.AccessController.doPrivileged(Native Method)
>                 at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>                 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>                 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>                 at org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
>                 at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98)
>                 at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103)
> Suggested fix in ParseUtil:
> Replace
>             if (maxParseTime!=-1)
>                        parseResult = runParser(parsers[i], content);
>             else
>                        parseResult = parsers[i].getParse(content);
> with
>       try
>       {
>             if (maxParseTime!=-1)
>                        parseResult = runParser(parsers[i], content);
>             else
>                        parseResult = parsers[i].getParse(content);
>       } catch( Throwable e )
>       {
>         LOG.warn( "Parsing " + content.getUrl() + " with " + parsers[i].getClass().getName() + " failed: " + e.getMessage() ) ;
>         parseResult = null ;
>       }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)