[jira] [Commented] (NUTCH-1106) Options to skip url's based on length

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-1106) Options to skip url's based on length

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16538409#comment-16538409 ]

ASF GitHub Bot commented on NUTCH-1106:
---------------------------------------

sebastian-nagel opened a new pull request #359: NUTCH-1106 Options to skip url's based on length
URL: https://github.com/apache/nutch/pull/359
 
 
   - add property db.max.outlink.length to limit length of outlinks and redirects (default = 8192 characters)
   - check length in ParseOutputFormat and FetcherThread for outlinks and redirects
   - also add rule (not active, commented out) to regex-urlfilters.txt.template
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


> Options to skip url's based on length
> -------------------------------------
>
>                 Key: NUTCH-1106
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1106
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.15
>
>         Attachments: NUTCH-1106-1.4-1.patch
>
>
> Adds option to skip URL's exceeding a certain length. At first we used regex to impose this limit but having this options configurable is more convenient. Comments?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)