I was wondering... If instead, I did a whole web crawl using the full
dmoz content file, but filtered it using the urlfilter-db plugin, using
my 14k urls in mysql.... would I obtain similar results?
My gut tells me this has to be slower. Ig would put the urls in the url
db, the less urls you have the filter the better, because it uses
regular expressions to check each url it comes across one against the
filter.. with 14 thousand regexps it would be very slow.
You need to do both: seed the WebDB with the 14k urls extracted from the
content file AND filter newly found urls against the urls in the mysql
database using the urlfilter-db.
This is significantly faster than adding the 14k urls to the
regex-urlfilter.txt file and checking against that.
If you like to boost performance any further, consider pre-initializing the
cache in the urlfilter-db upon load of the plugin and remove the code that
goes to the database every time a url was not found in the cache. I learned
this improves performance even more.
> I have approx. 14k urls (in a mysql db) that I extracted from the dmoz
> content file. I intended to dump it to a txt file in order to seed the
> using: bin/nutch inject db/ -urlfile urls.txt
> I was wondering... If instead, I did a whole web crawl using the full dmoz
> content file, but filtered it using the urlfilter-db plugin, using my 14k
> urls in mysql.... would I obtain similar results?
> I am a bit unsure as to what is going on under the hood, so I am looking
> the best approach. If they do in fact give similar results, is one more
> efficient, etc.?
> Thanks for any/all advice!
> Brent Parker