More fetcher speed increases

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

More fetcher speed increases

Doug Cook
Hi, folks,

I, too, was slowed down by reduce operations in fetch. Some benchmarking showed that in my case, the limiting operation was filtering (though a distant second was the time spent calculating Levenshtein distances, presumably part of the spellchecking that Sami just removed to speed things up, though I haven't looked at it yet).

I've fixed the problem, and my reduce speed is better by about a factor of three. However, the fix is limited to certain usage patterns.

In my case, I have tens of thousands of sites and subsites I'm crawling, and I'm using a combination of PrefixURLFilter + AutomatonURLFilter. I essentially use the prefix filter to limit to the set of sites, and then automaton to pattern-match within those sites. I only have subsite matches on < 10% of the sites, however, so I was clearly wasting a lot of time running the automaton patterns that didn't need it. And automaton, though much faster than RegexURLFilter, is still dog-slow with that many patterns.

A simple fix was to extend the current "AND all the filters together" model to have the notion of a "short-circuit" match, which allows a filter to say "let this URL through and DON'T run the other filters" by returning a special token to URLFilters. Now I have a version of PrefixURLFilter that can return both "normal" matches and "short circuit" matches, and only returns "normal" matches for those sites that need to run subsite patterns. It seems to work well, the overhead is negligible when not in use, and the speedup is massive for my usage pattern.

I'd like to contribute it back, if people would find this useful (not that it's rocket science!).

First, is there anyone out there besides me who would find this useful?

Second, I've been thinking about the best way to handle PrefixURLFilter configuration. I can see a few options:

1. Have two different config files, one for "normal" matches, and one for "short-circuit" matches.
2. Have one config file, with a syntax to say "make this pattern a short-circuit match," and make the default be a "normal" match, so it is backwards compatible with the current version.
3. Make a new type of filter which internally combines Prefix and Automaton, takes one config file, and decides internally which patterns should generate automaton inputs vs "normal" or "short circuit" prefix matches.

Approach #3 requires no changes to the URLFilter model, and makes it difficult to screw up by making config files which are inconsistent (e.g. forgetting to put in a prefix pattern for one of the automaton patterns). It is also the least flexible, requires the most code, and introduces yet another kind of filter.

I tend to like the changed URLFilter model; it's more flexible, even if it requires a little more care in configuration (a simple Perl script, in my case, to generate the config files correctly and consistently). I'm leaning towards approach #2. I'm thinking something simple, syntax-wise, like putting SHORTCIRCUIT: before the patterns which should short-circuit. Any suggestions for a  better syntax? Or reasons why I should consider a different approach?

Doug
Reply | Threaded
Open this post in threaded view
|

Re: More fetcher speed increases

scott green
Hi Doug,

Your idea about PrefixURLFilter and  AutomatonURLFilter combination
sounds interesting. Could you please attach the patch to JIRA? Thanks

- Scott

On 11/17/06, Doug Cook <[hidden email]> wrote:

>
> Hi, folks,
>
> I, too, was slowed down by reduce operations in fetch. Some benchmarking
> showed that in my case, the limiting operation was filtering (though a
> distant second was the time spent calculating Levenshtein distances,
> presumably part of the spellchecking that Sami just removed to speed things
> up, though I haven't looked at it yet).
>
> I've fixed the problem, and my reduce speed is better by about a factor of
> three. However, the fix is limited to certain usage patterns.
>
> In my case, I have tens of thousands of sites and subsites I'm crawling, and
> I'm using a combination of PrefixURLFilter + AutomatonURLFilter. I
> essentially use the prefix filter to limit to the set of sites, and then
> automaton to pattern-match within those sites. I only have subsite matches
> on < 10% of the sites, however, so I was clearly wasting a lot of time
> running the automaton patterns that didn't need it. And automaton, though
> much faster than RegexURLFilter, is still dog-slow with that many patterns.
>
> A simple fix was to extend the current "AND all the filters together" model
> to have the notion of a "short-circuit" match, which allows a filter to say
> "let this URL through and DON'T run the other filters" by returning a
> special token to URLFilters. Now I have a version of PrefixURLFilter that
> can return both "normal" matches and "short circuit" matches, and only
> returns "normal" matches for those sites that need to run subsite patterns.
> It seems to work well, the overhead is negligible when not in use, and the
> speedup is massive for my usage pattern.
>
> I'd like to contribute it back, if people would find this useful (not that
> it's rocket science!).
>
> First, is there anyone out there besides me who would find this useful?
>
> Second, I've been thinking about the best way to handle PrefixURLFilter
> configuration. I can see a few options:
>
> 1. Have two different config files, one for "normal" matches, and one for
> "short-circuit" matches.
> 2. Have one config file, with a syntax to say "make this pattern a
> short-circuit match," and make the default be a "normal" match, so it is
> backwards compatible with the current version.
> 3. Make a new type of filter which internally combines Prefix and Automaton,
> takes one config file, and decides internally which patterns should generate
> automaton inputs vs "normal" or "short circuit" prefix matches.
>
> Approach #3 requires no changes to the URLFilter model, and makes it
> difficult to screw up by making config files which are inconsistent (e.g.
> forgetting to put in a prefix pattern for one of the automaton patterns). It
> is also the least flexible, requires the most code, and introduces yet
> another kind of filter.
>
> I tend to like the changed URLFilter model; it's more flexible, even if it
> requires a little more care in configuration (a simple Perl script, in my
> case, to generate the config files correctly and consistently). I'm leaning
> towards approach #2. I'm thinking something simple, syntax-wise, like
> putting SHORTCIRCUIT: before the patterns which should short-circuit. Any
> suggestions for a  better syntax? Or reasons why I should consider a
> different approach?
>
> Doug
>
> --
> View this message in context: http://www.nabble.com/More-fetcher-speed-increases-tf2644170.html#a7381430
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: More fetcher speed increases

Doug Cook

Done. See http://issues.apache.org/jira/browse/NUTCH-409

This is my first Nutch contribution, so hopefully I've got it right ;-) Any suggestions/questions/feedback welcome.

Hope this is useful to others.

D

scott green wrote
Hi Doug,

Your idea about PrefixURLFilter and  AutomatonURLFilter combination
sounds interesting. Could you please attach the patch to JIRA? Thanks

- Scott

On 11/17/06, Doug Cook <nabble@candiru.com> wrote:
>
> Hi, folks,
>
> I, too, was slowed down by reduce operations in fetch. Some benchmarking
> showed that in my case, the limiting operation was filtering (though a
> distant second was the time spent calculating Levenshtein distances,
> presumably part of the spellchecking that Sami just removed to speed things
> up, though I haven't looked at it yet).
>
> I've fixed the problem, and my reduce speed is better by about a factor of
> three. However, the fix is limited to certain usage patterns.
>
> In my case, I have tens of thousands of sites and subsites I'm crawling, and
> I'm using a combination of PrefixURLFilter + AutomatonURLFilter. I
> essentially use the prefix filter to limit to the set of sites, and then
> automaton to pattern-match within those sites. I only have subsite matches
> on < 10% of the sites, however, so I was clearly wasting a lot of time
> running the automaton patterns that didn't need it. And automaton, though
> much faster than RegexURLFilter, is still dog-slow with that many patterns.
>
> A simple fix was to extend the current "AND all the filters together" model
> to have the notion of a "short-circuit" match, which allows a filter to say
> "let this URL through and DON'T run the other filters" by returning a
> special token to URLFilters. Now I have a version of PrefixURLFilter that
> can return both "normal" matches and "short circuit" matches, and only
> returns "normal" matches for those sites that need to run subsite patterns.
> It seems to work well, the overhead is negligible when not in use, and the
> speedup is massive for my usage pattern.
>
> I'd like to contribute it back, if people would find this useful (not that
> it's rocket science!).
>
> First, is there anyone out there besides me who would find this useful?
>
> Second, I've been thinking about the best way to handle PrefixURLFilter
> configuration. I can see a few options:
>
> 1. Have two different config files, one for "normal" matches, and one for
> "short-circuit" matches.
> 2. Have one config file, with a syntax to say "make this pattern a
> short-circuit match," and make the default be a "normal" match, so it is
> backwards compatible with the current version.
> 3. Make a new type of filter which internally combines Prefix and Automaton,
> takes one config file, and decides internally which patterns should generate
> automaton inputs vs "normal" or "short circuit" prefix matches.
>
> Approach #3 requires no changes to the URLFilter model, and makes it
> difficult to screw up by making config files which are inconsistent (e.g.
> forgetting to put in a prefix pattern for one of the automaton patterns). It
> is also the least flexible, requires the most code, and introduces yet
> another kind of filter.
>
> I tend to like the changed URLFilter model; it's more flexible, even if it
> requires a little more care in configuration (a simple Perl script, in my
> case, to generate the config files correctly and consistently). I'm leaning
> towards approach #2. I'm thinking something simple, syntax-wise, like
> putting SHORTCIRCUIT: before the patterns which should short-circuit. Any
> suggestions for a  better syntax? Or reasons why I should consider a
> different approach?
>
> Doug
>
> --
> View this message in context: http://www.nabble.com/More-fetcher-speed-increases-tf2644170.html#a7381430
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
>