[jira] Created: (NUTCH-150) OutlinkExtractor extremely slow on some non-plain text

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-150) OutlinkExtractor extremely slow on some non-plain text

JIRA jira@apache.org
OutlinkExtractor extremely slow on some non-plain text
------------------------------------------------------

         Key: NUTCH-150
         URL: http://issues.apache.org/jira/browse/NUTCH-150
     Project: Nutch
        Type: Bug
    Versions: 0.8-dev    
 Environment: All
    Reporter: Paul Baclace
    Priority: Minor


While using mime settings which aggressively parsed everything by default, rather than having conf/parse-plugins.xml  associate parse-default with *, some parse tasks took an incredibly long time to finish.  For instance, a single postscript file took 9 hours to parse.  Stacktraces indicated this to be a problem with OutlinkExtractor.getOutlinks(...) during the call to reg expr match().  

Analysis:  The regular expression matching in OutlinkExtractor.getOutlinks(...) encounters parasitic cases which have extremely long runtimes when non-plain-text is processed.

Workaround 1:  Avoid treating non-plain-text, especially postscript files, as text or html.

Workaround 2:  kill -SIGQUIT  the child TaskRunner process, this will interrupt the match() and the process will continue.  This might need to be done multiple times.  (In theory, SIGQUIT is not supposed to do this, but in practice it does.)



--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-150) OutlinkExtractor extremely slow on some non-plain text

JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-150?page=all ]

Paul Baclace updated NUTCH-150:
-------------------------------

    Attachment: OutlinkExtractor.java.patch

This patch has 3 changes:

1. Adds a comment that non-plain-text can be a problem.
2. Adds quantifiers to the regular expression to limit length of matched text.
3. Monitors the time spent doing matching and if more than 60 seconds, it will stop looking for additional matches (this does not prevent the first lengthy match).


> OutlinkExtractor extremely slow on some non-plain text
> ------------------------------------------------------
>
>          Key: NUTCH-150
>          URL: http://issues.apache.org/jira/browse/NUTCH-150
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>  Environment: All
>     Reporter: Paul Baclace
>     Priority: Minor
>  Attachments: OutlinkExtractor.java.patch
>
> While using mime settings which aggressively parsed everything by default, rather than having conf/parse-plugins.xml  associate parse-default with *, some parse tasks took an incredibly long time to finish.  For instance, a single postscript file took 9 hours to parse.  Stacktraces indicated this to be a problem with OutlinkExtractor.getOutlinks(...) during the call to reg expr match().  
> Analysis:  The regular expression matching in OutlinkExtractor.getOutlinks(...) encounters parasitic cases which have extremely long runtimes when non-plain-text is processed.
> Workaround 1:  Avoid treating non-plain-text, especially postscript files, as text or html.
> Workaround 2:  kill -SIGQUIT  the child TaskRunner process, this will interrupt the match() and the process will continue.  This might need to be done multiple times.  (In theory, SIGQUIT is not supposed to do this, but in practice it does.)

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (NUTCH-150) OutlinkExtractor extremely slow on some non-plain text

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-150?page=all ]
     
Doug Cutting resolved NUTCH-150:
--------------------------------

    Fix Version: 0.7.2-dev
     Resolution: Fixed

I just committed this.  Thanks, Paul!

> OutlinkExtractor extremely slow on some non-plain text
> ------------------------------------------------------
>
>          Key: NUTCH-150
>          URL: http://issues.apache.org/jira/browse/NUTCH-150
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>  Environment: All
>     Reporter: Paul Baclace
>     Priority: Minor
>      Fix For: 0.7.2-dev
>  Attachments: OutlinkExtractor.java.patch
>
> While using mime settings which aggressively parsed everything by default, rather than having conf/parse-plugins.xml  associate parse-default with *, some parse tasks took an incredibly long time to finish.  For instance, a single postscript file took 9 hours to parse.  Stacktraces indicated this to be a problem with OutlinkExtractor.getOutlinks(...) during the call to reg expr match().  
> Analysis:  The regular expression matching in OutlinkExtractor.getOutlinks(...) encounters parasitic cases which have extremely long runtimes when non-plain-text is processed.
> Workaround 1:  Avoid treating non-plain-text, especially postscript files, as text or html.
> Workaround 2:  kill -SIGQUIT  the child TaskRunner process, this will interrupt the match() and the process will continue.  This might need to be done multiple times.  (In theory, SIGQUIT is not supposed to do this, but in practice it does.)

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira