Intranet crawl and re-fetch - newbie question

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Intranet crawl and re-fetch - newbie question

carmmello
 have been using Nutch for over 1 year now and that is a question that I have allways asked without any answer.  I have tried a lot of things, looked in the mail lists, the tutorial, everywhere, but no answer.  So, for me, it seems that the only way to keep yourself updated is to start everything all over again.  It seems (as far as I know) that Nutch was not designed to allow you to update yourself with only new or modified pages on an existing set of index, db and segments.  If someone knows something about this issue, let us know, because this points seems, to me, the bigest problem to, really, start using Nutch on a regular basis in a "production site".
Tanks

No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.322 / Virus Database: 267.4.1 - Release Date: 2/6/2005
Reply | Threaded
Open this post in threaded view
|

Re: Intranet crawl and re-fetch - newbie question

Piotr Kosiorowski
Hello,
I am not sure if I understood you correctly but if you use technique
described as "whole web crawling" in tutorial you are not starting from
scratch but you can fetch new pages and refetch and update existing
ones. But probably I misunderstood your question so please give us more
details on the thing you want to achieve -e.g. do you plan to fetch from
limited number of sites ?

Regards
Piotr
carmmello wrote:

>  have been using Nutch for over 1 year now and that is a question that I
> have allways asked without any answer.  I have tried a lot of things,
> looked in the mail lists, the tutorial, everywhere, but no answer.  So,
> for me, it seems that the only way to keep yourself updated is to start
> everything all over again.  It seems (as far as I know) that Nutch was
> not designed to allow you to update yourself with only new or modified
> pages on an existing set of index, db and segments.  If someone knows
> something about this issue, let us know, because this points seems, to
> me, the bigest problem to, really, start using Nutch on a regular basis
> in a "production site".
> Tanks
>
>
> ------------------------------------------------------------------------
>
> No virus found in this outgoing message.
> Checked by AVG Anti-Virus.
> Version: 7.0.322 / Virus Database: 267.4.1 - Release Date: 2/6/2005

Reply | Threaded
Open this post in threaded view
|

Re: Intranet crawl and re-fetch - newbie question

wmelo
In reply to this post by carmmello
Hi,
I have about 300 hundred sites. on a specific subject, to start with and I have used both, the crawl method and the whole internet.  Once for testing purposes, I crawled those sites to depth 2, with the expiring time of just 1 day (I set this in the site.xml file)  and got about 3,000 sites. .  After that  1 day I used the command "bin/nutch generate db segments" with the only flag "-refetchonly". When I did a fetch of the generated segment, I got about 30,000 sites.  If, besides the refetchonly I have used the -topN 3000, for instances, I would get diferent sites, not the original ones. So, I really dont't know how, begining with a initial set of fetched or crawled sites, just to perform the maintenance of them adding only modified or new sites to the ones that you already have.
Tanks

No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.322 / Virus Database: 267.4.1 - Release Date: 2/6/2005
Reply | Threaded
Open this post in threaded view
|

Re: Intranet crawl and re-fetch - newbie question

carmmello
In reply to this post by carmmello
Hi,
I have about 300 hundred sites. on a specific subject, to start with and I have used both, the crawl method and the whole internet.  Once for testing purposes, I crawled those sites to depth 2, with the expiring time of just 1 day (I set this in the site.xml file)  and got about 3,000 sites. .  After that  1 day I used the command "bin/nutch generate db segments" with the only flag "-refetchonly". When I did a fetch of the generated segment, I got about 30,000 sites.  If, besides the refetchonly I have used the -topN 3000, for instances, I would get diferent sites, not the original ones. So, I really dont't know how, begining with a initial set of fetched or crawled sites, just to perform the maintenance of them adding only modified or new sites to the ones that you already have.
Tanks

No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.322 / Virus Database: 267.4.1 - Release Date: 2/6/2005
Reply | Threaded
Open this post in threaded view
|

Re: Intranet crawl and re-fetch - newbie question

Konstantin Ott
In reply to this post by Piotr Kosiorowski
Hi,
I too am not getting round this problem. But maybe I didnt understand
the documentation good. This is how I got it:
- WebDBInjector injects urls into the webdb
- FetchListTool gets all the urls from the webdb und generates the
segment for the
- Fetcher
- UpdateDatabaseTool gets the crawled urls from the segments and saves
them into the webdb

this is wanted for the first crawl, but when I inject another url and
want to crawl only this one then the FetchListTool generates segments
for all the urls from the webdb, also the old ones.
So I started to inject into a second webdb und fetched only for this
one. After that I updated the original webdb, deleted the second webdb
and merged the segments. Well this works fine under linux. But running
under windows there stay some locked files, so I cant delete and i cant
recrawl because the WebDBWriter is waiting for the lock to be released
and that will never be.
So did I understand you correct and it is possible to:
fetch new pages (with a depth of 3 for example)
and
refetch/update existing ones
as independent tasks?
I cant read the tutorial in that way. So please could you explain it a
little more?
thx Konstantin

Piotr Kosiorowski wrote:

> Hello,
> I am not sure if I understood you correctly but if you use technique
> described as "whole web crawling" in tutorial you are not starting
> from scratch but you can fetch new pages and refetch and update
> existing ones. But probably I misunderstood your question so please
> give us more details on the thing you want to achieve -e.g. do you
> plan to fetch from limited number of sites ?
>
> Regards
> Piotr
> carmmello wrote:
>
>>  have been using Nutch for over 1 year now and that is a question
>> that I have allways asked without any answer.  I have tried a lot of
>> things, looked in the mail lists, the tutorial, everywhere, but no
>> answer.  So, for me, it seems that the only way to keep yourself
>> updated is to start everything all over again.  It seems (as far as I
>> know) that Nutch was not designed to allow you to update yourself
>> with only new or modified pages on an existing set of index, db and
>> segments.  If someone knows something about this issue, let us know,
>> because this points seems, to me, the bigest problem to, really,
>> start using Nutch on a regular basis in a "production site".
>> Tanks
>>
>>
>> ------------------------------------------------------------------------
>>
>> No virus found in this outgoing message.
>> Checked by AVG Anti-Virus.
>> Version: 7.0.322 / Virus Database: 267.4.1 - Release Date: 2/6/2005
>
>

Reply | Threaded
Open this post in threaded view
|

Deleting urls/Recurring urls

quovadis
Hi there

I'm experiencing a recurring url for this example lets call
it xyz.com.

I've added a regexfilter so that it would be excluded from
any crawls as well as added it to the banned-hosts file and
pruned the segments regularly for any reference to the
domain however which each and every fetch I'm seeing the
url reappear a few times. This is one of those sites that
have a "nocache tag" (xyz.com/adasdasd.asp?nc=329084723
etc) in the url which thus creates thousands of pages to
crawl for a 6 page site.

Any ideas?

Thanks
_____________________________________________________________________
For super low premiums, click here http://www.dialdirect.co.za/quote
Reply | Threaded
Open this post in threaded view
|

Re: Intranet crawl and re-fetch - newbie question

Piotr Kosiorowski
In reply to this post by carmmello
Hello,

I spent some time to analyze it and I am a bit surprised with results. I
always assumed that refetch only works as described in one of email on
the list:
"-refetchonly generates you an segment(FetchList) that only contains the
urls
that need to be refetched based on your refetch interval.
Right, new discovered links are not in the fetchlist that will be
generated by
using this option."

But today after reading the code and performing some experiments I know
it is not true. It works the way you described it. I will post an email
later today with my findings to clarify this issue. So -refetchonly  is
not working currently as expected as it generates also new urls (even
though they are not fetched later).
Regards,
Piotr


carmmello wrote:

> Hi,
> I have about 300 hundred sites. on a specific subject, to start with and
> I have used both, the crawl method and the whole internet.  Once for
> testing purposes, I crawled those sites to depth 2, with the expiring
> time of just 1 day (I set this in the site.xml file)  and got about
> 3,000 sites. .  After that  1 day I used the command "bin/nutch generate
> db segments" with the only flag "-refetchonly". When I did a fetch of
> the generated segment, I got about 30,000 sites.  If, besides the
> refetchonly I have used the -topN 3000, for instances, I would get
> diferent sites, not the original ones. So, I really dont't know how,
> begining with a initial set of fetched or crawled sites, just to perform
> the maintenance of them adding only modified or new sites to the ones
> that you already have.
> Tanks
>
>
> ------------------------------------------------------------------------
>
> No virus found in this outgoing message.
> Checked by AVG Anti-Virus.
> Version: 7.0.322 / Virus Database: 267.4.1 - Release Date: 2/6/2005