[jira] Created: (NUTCH-182) Log when db.max configuration limits reached

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-182) Log when db.max configuration limits reached

Steve Loughran (Jira)
Log when db.max configuration limits reached
--------------------------------------------

         Key: NUTCH-182
         URL: http://issues.apache.org/jira/browse/NUTCH-182
     Project: Nutch
        Type: Improvement
  Components: fetcher  
    Versions: 0.8-dev    
    Reporter: Matt Kangas
    Priority: Trivial


Followup to http://www.nabble.com/Re%3A-Can%27t-index-some-pages-p2480833.html

There are three "db.max" parameters currently in nutch-default.xml:
 * db.max.outlinks.per.page
 * db.max.anchor.length
 * db.max.inlinks

Having values that are too low can result in a site being under-crawled. However, currently there is nothing written to the log when these limits are hit, so users have to guess when they need to raise these values.

I suggest that we add three new log messages at the appropriate points:
 * "Exceeded db.max.outlinks.per.page for URL "
 * "Exceeded db.max.anchor.length for URL "
 * "Exceeded db.max.inlinks for URL "

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-182) Log when db.max configuration limits reached

Steve Loughran (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-182?page=all ]

Matt Kangas updated NUTCH-182:
------------------------------

    Attachment: ParseData.java.patch
                LinkDb.java.patch

Two patches are attached for nutch/trunk (0.8-dev).

LinkDb.java.patch adds two new LOG.info() statements:
 * "Exceeded db.max.anchor.length for URL <url>"
 * "Exceeded db.max.inlinks for URL <url>"

ParseData.java.patch adds a private static LOG variable, pluse one LOG.info() statement:
 * "Exceeded db.max.outlinks.per.page"

I would have preferred to print the URL too on the latter, but it's not available in the method where the cutoff is performed (afaik).

> Log when db.max configuration limits reached
> --------------------------------------------
>
>          Key: NUTCH-182
>          URL: http://issues.apache.org/jira/browse/NUTCH-182
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.8-dev
>     Reporter: Matt Kangas
>     Priority: Trivial
>  Attachments: LinkDb.java.patch, ParseData.java.patch
>
> Followup to http://www.nabble.com/Re%3A-Can%27t-index-some-pages-p2480833.html
> There are three "db.max" parameters currently in nutch-default.xml:
>  * db.max.outlinks.per.page
>  * db.max.anchor.length
>  * db.max.inlinks
> Having values that are too low can result in a site being under-crawled. However, currently there is nothing written to the log when these limits are hit, so users have to guess when they need to raise these values.
> I suggest that we add three new log messages at the appropriate points:
>  * "Exceeded db.max.outlinks.per.page for URL "
>  * "Exceeded db.max.anchor.length for URL "
>  * "Exceeded db.max.inlinks for URL "

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira