urlfilter-db plugin usage...

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

urlfilter-db plugin usage...

bparker
I have approx. 14k urls (in a mysql db) that I extracted from the dmoz
content file. I intended to dump it to a txt file in order to seed the webDb
using:  bin/nutch inject db/ -urlfile urls.txt

I was wondering... If instead, I did a whole web crawl using the full dmoz
content file, but filtered it using the urlfilter-db plugin, using my 14k
urls in mysql.... would I obtain similar results?

I am a bit unsure as to what is going on under the hood, so I am looking for
the best approach.  If they do in fact give similar results, is one more
efficient, etc.?

Thanks for any/all advice!
Brent Parker

Reply | Threaded
Open this post in threaded view
|

RE: urlfilter-db plugin usage...

Richard Braman
I was wondering... If instead, I did a whole web crawl using the full
dmoz content file, but filtered it using the urlfilter-db plugin, using
my 14k urls in mysql.... would I obtain similar results?

My gut tells me this has to be slower. Ig would put the urls in the url
db,  the less urls you have the filter the better, because it uses
regular expressions to check each url it comes across one against the
filter.. with 14 thousand regexps it would be very slow.


-----Original Message-----
From: Brent Parker [mailto:[hidden email]]
Sent: Tuesday, February 28, 2006 6:43 PM
To: [hidden email]
Subject: urlfilter-db plugin usage...


I have approx. 14k urls (in a mysql db) that I extracted from the dmoz
content file. I intended to dump it to a txt file in order to seed the
webDb
using:  bin/nutch inject db/ -urlfile urls.txt

I was wondering... If instead, I did a whole web crawl using the full
dmoz content file, but filtered it using the urlfilter-db plugin, using
my 14k urls in mysql.... would I obtain similar results?

I am a bit unsure as to what is going on under the hood, so I am looking
for the best approach.  If they do in fact give similar results, is one
more efficient, etc.?

Thanks for any/all advice!
Brent Parker

Reply | Threaded
Open this post in threaded view
|

Re: urlfilter-db plugin usage...

Thomas Delnoij-3
In reply to this post by bparker
You need to do both: seed the WebDB with the 14k urls extracted from the
dmoz
content file AND filter newly found urls against the urls in the mysql
database using the urlfilter-db.

This is significantly faster than adding the 14k urls to the
regex-urlfilter.txt file and checking against that.

If you like to boost performance any further, consider pre-initializing the
cache in the urlfilter-db upon load of the plugin and remove the code that
goes to the database every time a url was not found in the cache. I learned
this improves performance even more.

HTH Thomas D.

On 3/1/06, Brent Parker <[hidden email]> wrote:

>
> I have approx. 14k urls (in a mysql db) that I extracted from the dmoz
> content file. I intended to dump it to a txt file in order to seed the
> webDb
> using:  bin/nutch inject db/ -urlfile urls.txt
>
> I was wondering... If instead, I did a whole web crawl using the full dmoz
> content file, but filtered it using the urlfilter-db plugin, using my 14k
> urls in mysql.... would I obtain similar results?
>
> I am a bit unsure as to what is going on under the hood, so I am looking
> for
> the best approach.  If they do in fact give similar results, is one more
> efficient, etc.?
>
> Thanks for any/all advice!
> Brent Parker
>
>