[jira] [Reopened] (NUTCH-2071) A parser failure on a single document may fail crawling job

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Reopened] (NUTCH-2071) A parser failure on a single document may fail crawling job

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Nagel reopened NUTCH-2071:
------------------------------------

See also [discussion @dev|https://lists.apache.org/thread.html/d64cf8d04f73fbe253a9cf1988f69b46e77cf63a937af476b7649d2a@%3Cdev.nutch.apache.org%3E] and NUTCH-1993. I've misunderstood the objective: it's not about the error  shown in the provided stack which has been fixed in parse-tika. The objective is
{quote}because people may use their own or third party parsers and Nutch should be protected from parsers problems{quote}

>  A parser failure on a single document may fail crawling job
> ------------------------------------------------------------
>
>                 Key: NUTCH-2071
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2071
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.11
>            Reporter: Arkadi Kosmynin
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.14, 1.15
>
>         Attachments: NUTCH-2071.diff
>
>
> java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>         at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
>         <...>
> Caused by: java.lang.IncompatibleClassChangeError: class org.apache.tika.parser.asm.XHTMLClassVisitor has interface org.objectweb.asm.ClassVisitor as super class
>                 at java.lang.ClassLoader.defineClass1(Native Method)
>                 at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
>                 at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>                 at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>                 at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>                 at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>                 at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>                 at java.security.AccessController.doPrivileged(Native Method)
>                 at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>                 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>                 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>                 at org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
>                 at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98)
>                 at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103)
> Suggested fix in ParseUtil:
> Replace
>             if (maxParseTime!=-1)
>                        parseResult = runParser(parsers[i], content);
>             else
>                        parseResult = parsers[i].getParse(content);
> with
>       try
>       {
>             if (maxParseTime!=-1)
>                        parseResult = runParser(parsers[i], content);
>             else
>                        parseResult = parsers[i].getParse(content);
>       } catch( Throwable e )
>       {
>         LOG.warn( "Parsing " + content.getUrl() + " with " + parsers[i].getClass().getName() + " failed: " + e.getMessage() ) ;
>         parseResult = null ;
>       }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)