Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

Doug Cutting
[hidden email] wrote:
> Don't generate URLs that don't pass URLFilters.

Just to be clear, this is to support folks changing their filters while
they're crawling, right?  We already filter before we put things into
the db, so we're filtering twice now, no?  If so, then perhaps there
should be an option to disable this second filtering for folks who don't
change their filters?

Doug

Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

Stefan Groschupf-2
I notice filtering urls is done in the output format until parsing.  
Wouldn't it be better to filter it until updating crawlDb?
Sure it would require to have some more disk space but since parsing  
is done until fetching it may be improve fetching speed.

Stefan

Am 08.03.2006 um 18:53 schrieb Doug Cutting:

> [hidden email] wrote:
>> Don't generate URLs that don't pass URLFilters.
>
> Just to be clear, this is to support folks changing their filters  
> while they're crawling, right?  We already filter before we put  
> things into the db, so we're filtering twice now, no?  If so, then  
> perhaps there should be an option to disable this second filtering  
> for folks who don't change their filters?
>
> Doug
>
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net


Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

Andrzej Białecki-2
In reply to this post by Doug Cutting
Doug Cutting wrote:
> [hidden email] wrote:
>> Don't generate URLs that don't pass URLFilters.
>
> Just to be clear, this is to support folks changing their filters
> while they're crawling, right?  We already filter before we

Yes, and this seems to be the most common case. This is especially
important since there are no tools yet to clean up the DB.

> put things into the db, so we're filtering twice now, no?  If so, then
> perhaps there should be an option to disable this second filtering for
> folks who don't change their filters?

IMHO doing this here has a minimal impact while preventing a common
problem, but if you think this would harm many users then we should of
course make it optional.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

Doug Cutting
Andrzej Bialecki wrote:
> IMHO doing this here has a minimal impact while preventing a common
> problem, but if you think this would harm many users then we should of
> course make it optional.

Let's just leave it as-is for now.  Thanks!

Doug
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

Rod Taylor-2
In reply to this post by Andrzej Białecki-2
On Wed, 2006-03-08 at 19:15 +0100, Andrzej Bialecki wrote:
> Doug Cutting wrote:
> > [hidden email] wrote:
> >> Don't generate URLs that don't pass URLFilters.
> >
> > Just to be clear, this is to support folks changing their filters
> > while they're crawling, right?  We already filter before we
>
> Yes, and this seems to be the most common case. This is especially
> important since there are no tools yet to clean up the DB.

I have this situation now. There are over 100M urls in my DB from crap
domains that I want to get rid of.

Adding a --refilter option to updatedb seemed like the most obvious
course of action.

A completely separate command so it could be initiated by hand would
also work for me.

--
Rod Taylor <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

CrawlDb Filter tool, was Re: svn commit: r384219 -

Stefan Groschupf-2
Rod,
some days ago I had written a small tool that is filtering a crawlDb.
You can find it here now:
http://issues.apache.org/jira/browse/NUTCH-226
Give it a try and let me know if that works for you, in any case  
backup your crawlDb first!!!
I tested it only with a small crawlDb, so it is your own risk. :)

Stefan

Am 08.03.2006 um 19:47 schrieb Rod Taylor:

> On Wed, 2006-03-08 at 19:15 +0100, Andrzej Bialecki wrote:
>> Doug Cutting wrote:
>>> [hidden email] wrote:
>>>> Don't generate URLs that don't pass URLFilters.
>>>
>>> Just to be clear, this is to support folks changing their filters
>>> while they're crawling, right?  We already filter before we
>>
>> Yes, and this seems to be the most common case. This is especially
>> important since there are no tools yet to clean up the DB.
>
> I have this situation now. There are over 100M urls in my DB from crap
> domains that I want to get rid of.
>
> Adding a --refilter option to updatedb seemed like the most obvious
> course of action.
>
> A completely separate command so it could be initiated by hand would
> also work for me.
>
> --
> Rod Taylor <[hidden email]>
>
>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com


Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

kangas
In reply to this post by Rod Taylor-2
Rod, I just posted my PruneDB.java file to: http://
blog.busytonight.com/2006/03/nutch_07_prunedb_tool.html

(104 lines, nutch 0.7 only.)

License granted anyone to hack/copy this as they wish. Should be easy  
to adapt to 0.8.

> Usage: PruneDB <db> -s
> Where: db is the path of the nutch db to prune
> Usage: -s simulate: parses the db, but doesn't delete any pages

--Matt

On Mar 8, 2006, at 1:47 PM, Rod Taylor wrote:

> On Wed, 2006-03-08 at 19:15 +0100, Andrzej Bialecki wrote:
>> Doug Cutting wrote:
>>> [hidden email] wrote:
>>>> Don't generate URLs that don't pass URLFilters.
>>>
>>> Just to be clear, this is to support folks changing their filters
>>> while they're crawling, right?  We already filter before we
>>
>> Yes, and this seems to be the most common case. This is especially
>> important since there are no tools yet to clean up the DB.
>
> I have this situation now. There are over 100M urls in my DB from crap
> domains that I want to get rid of.
>
> Adding a --refilter option to updatedb seemed like the most obvious
> course of action.
>
> A completely separate command so it could be initiated by hand would
> also work for me.
>
> --
> Rod Taylor <[hidden email]>
>

--
Matt Kangas / [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

David Wallace-3-2
In reply to this post by Doug Cutting
I know this is off on a tangent, but:
 
One huge adavantage to filtering in the FetchListTool (or is that the
Generator, I'm still on 0.7?) is that you can generate separate fetch
lists for separate "scopes", or subsets of your crawl data.  You can
then give your users some control over which of several scopes they're
actually searching in; all while having a single URL database.  I
suspect many people who are using Nutch over one or a small number of
sites are actually doing this.
 
Regards,
David.
 

Date: Wed, 08 Mar 2006 10:42:50 -0800
From: Doug Cutting <[hidden email]>
To: [hidden email]
Subject: [Nutch-dev] Re: svn commit: r384219 -
/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java
Reply-To: [hidden email]

Andrzej Bialecki wrote:
> IMHO doing this here has a minimal impact while preventing a common
> problem, but if you think this would harm many users then we should
of
> course make it optional.

Let's just leave it as-is for now.  Thanks!

Doug



********************************************************************************
This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or
communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or
information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA.

All emails have been scanned for viruses and content by MailMarshal.
NZQA reserves the right to monitor all email communications through its network.

********************************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

Andrzej Białecki-2
In reply to this post by Stefan Groschupf-2
Stefan Groschupf wrote:
> I notice filtering urls is done in the output format until parsing.
> Wouldn't it be better to filter it until updating crawlDb?

"Until" == "during" ?

As you observed, doing it at this stage saves space in segment data, and
in consequence saves on processing time (no CPU/IO needed to process
useless data, throw away junk as soon as possible).

> Sure it would require to have some more disk space but since parsing
> is done until fetching it may be improve fetching speed.

Parsing is not always done at fetching stage (Fetcher.parsing == false).

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

Stefan Groschupf-2
>> I notice filtering urls is done in the output format until  
>> parsing. Wouldn't it be better to filter it until updating crawlDb?
>
> "Until" == "during" ?
Sorry, yes during!
>
> As you observed, doing it at this stage saves space in segment  
> data, and in consequence saves on processing time (no CPU/IO needed  
> to process useless data, throw away junk as soon as possible).
Make sense, thanks for the hint. I guess now with a published db  
filter tool for nutch .7 and .8 people will be able to clean up web-  
and crawl databases.

Stefan
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

Doug Cutting
In reply to this post by Andrzej Białecki-2
Andrzej Bialecki wrote:

> Stefan Groschupf wrote:
>
>> I notice filtering urls is done in the output format until parsing.
>> Wouldn't it be better to filter it until updating crawlDb?
>
>
> "Until" == "during" ?
>
> As you observed, doing it at this stage saves space in segment data, and
> in consequence saves on processing time (no CPU/IO needed to process
> useless data, throw away junk as soon as possible).

I think it is better to not filter at parse time, but at db insert time.
  This way if desired urls are accidentally filtered out then one only
has to re-update the db to include them rather than re-parse and re-update.

Doug