fetcher : some doubts

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

fetcher : some doubts

shrinivas patwardhan
hello
 while fetching a certain fetchlist say about 2 - 3 million pages there are
some errors that might the fetcher process to stop
1: low disk space
2 : any severe error (like internet connection faliure for a long time )
3: some more like  java heap space
these are some of the resons that i faced ..
now straight answer would be that i should be taking care of all those
before i start fetching ..
i agree but in case the fetcher stops ...
Is there any way to continue fetching from where it stopped ?
if not can we all contribute towards that ?
i have been through the re fetching threads but wud that help me in this
case
example :
if i have fetched around a million pages and still some 2 million pages a
left to be fetched and the fetcher stops  due to low disk space is there any
way to continue from where i stopped after i organise everything (arrange
for another disk or free partition )

Thanks & Regards
Shrinivas Patwardhan
Reply | Threaded
Open this post in threaded view
|

Re: fetcher : some doubts

Sean Dean-3
Okay, I actually just wrote you a long email of what to do, step by step but when I tried to send it, my web mail session timed out and forced me to re-login, losing it all... I'm not happy :(
 
But straight to the point, since your using the older 0.7 code-base you can use partially fetched segments. When the fetcher dies just continue on to the next step as if it completed successfully.
 
You wont have all the pages in there, but you can always setup another fetch list, fetch it (fully or partially) and then merge the segments together and re-index.
 
There actually isn't much of a reason to generate "huge" multi-million page fetch lists when you can create lots of smaller ones and merge them together. This allows for more of a ladder-style approach, and in some cases reduces the risk of errors in terms of Hadoop versions (0.8+) with large unrecoverable fetches or failed parse-reduce stages.
 
Hope this helps.


----- Original Message ----
From: shrinivas patwardhan <[hidden email]>
To: [hidden email]
Sent: Monday, January 1, 2007 11:48:52 PM
Subject: fetcher : some doubts


hello
while fetching a certain fetchlist say about 2 - 3 million pages there are
some errors that might the fetcher process to stop
1: low disk space
2 : any severe error (like internet connection faliure for a long time )
3: some more like  java heap space
these are some of the resons that i faced ..
now straight answer would be that i should be taking care of all those
before i start fetching ..
i agree but in case the fetcher stops ...
Is there any way to continue fetching from where it stopped ?
if not can we all contribute towards that ?
i have been through the re fetching threads but wud that help me in this
case
example :
if i have fetched around a million pages and still some 2 million pages a
left to be fetched and the fetcher stops  due to low disk space is there any
way to continue from where i stopped after i organise everything (arrange
for another disk or free partition )

Thanks & Regards
Shrinivas Patwardhan
Reply | Threaded
Open this post in threaded view
|

Re: fetcher : some doubts

Justin Hartman
On 1/2/07, Sean Dean <[hidden email]> wrote:
> There actually isn't much of a reason to generate "huge" multi-million page fetch lists when you can create lots of smaller ones and merge them together. This allows for more of a ladder-style approach, and in some cases reduces the risk of errors in terms of Hadoop versions (0.8+) with large unrecoverable fetches or failed parse-reduce stag

The problem I am faced with is I'm not sure how to merge my indexes
together. For example I run a fetch of about 200,000 pages in about 3
or 4 different fetches. Once done I run the index command and all goes
very well and my index is built.

That said if I try and run a new fetch and then try and index the new
fetch I get an error saying "crawl/indexes" already exists.

How does one actually merge different fetches to the same index
without having to recreate the index each time?

Thanks!
Justin
Reply | Threaded
Open this post in threaded view
|

Re: fetcher : some doubts

shrinivas patwardhan
thank you Sean  Dean
that sounds good .. i will try it out .
tell me if i am  rite :
i case of a dmoz index file is injected in the db .. then i generate only
few segments by  using -subset and then fetch them ..
and then go on and generate the next set of segments i hope i am heading the
right way
and for the previous problem of the searching being slow .. it wasnt my
hardware but my segments were corrupt i fixed them and the search runs fine
now

Thanks & Regards
Shrinivas Patwardhan
Reply | Threaded
Open this post in threaded view
|

Re: fetcher : some doubts

Sean Dean-3
In reply to this post by shrinivas patwardhan
You need to delete the old index before you re-index when working within the same directory structure.
 
This is the procedure I follow, which is pretty much what your doing. This assumes you already have at least one active segment and index. Edit as needed.
 
bin/nutch generate crawl/crawldb crawl/segments -topN 1000000
bin/nutch fetch $$
bin/nutch updatedb crawl/crawldb $$
bin/nutch invertlinks crawl/linkdb $$
 
bin/nutch mergesegs crawl/segments/merged -dir crawl/segments
 
rm -fdr crawl/indexes/
rm -fdr crawl/segments/2*
mv crawl/segments/merged/2* crawl/segments/
rm -fdr crawl/segments/merged/
 
bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb $$
bin/nutch dedup crawl/indexes

$$ = your current segment, note that after the merge takes place it will be a newly created directory.
 
----- Original Message ----
From: Justin Hartman <[hidden email]>
To: [hidden email]
Sent: Tuesday, January 2, 2007 4:18:36 AM
Subject: Re: fetcher : some doubts


On 1/2/07, Sean Dean <[hidden email]> wrote:
> There actually isn't much of a reason to generate "huge" multi-million page fetch lists when you can create lots of smaller ones and merge them together. This allows for more of a ladder-style approach, and in some cases reduces the risk of errors in terms of Hadoop versions (0.8+) with large unrecoverable fetches or failed parse-reduce stag

The problem I am faced with is I'm not sure how to merge my indexes
together. For example I run a fetch of about 200,000 pages in about 3
or 4 different fetches. Once done I run the index command and all goes
very well and my index is built.

That said if I try and run a new fetch and then try and index the new
fetch I get an error saying "crawl/indexes" already exists.

How does one actually merge different fetches to the same index
without having to recreate the index each time?

Thanks!
Justin
Reply | Threaded
Open this post in threaded view
|

Re: fetcher : some doubts

Sean Dean-3
In reply to this post by shrinivas patwardhan
I'm glad you got the slowness issue straightened out.
 
When you import the dmoz urls into your Nutch DB, the "-subset" command isn't really meant to limit the size of your fetch lists. This becomes even more true when you start re-fetching. You can actually skip the subset command and allow all of them to go in, unless you have your own custom filtering method/requirement.
 
You should use the "-topN" command instead when you generate your segment file. This will create a segment with an exact number of urls. Below are examples of creating a segment with 1 million urls to fetch for each Nutch architecture;
 
(Nutch 0.7) bin/nutch generate db segments -topN 1000000

(Nutch 0.8+) bin/nutch generate crawl/crawldb crawl/segments -topN 1000000
 
----- Original Message ----
From: shrinivas patwardhan <[hidden email]>
To: [hidden email]
Sent: Tuesday, January 2, 2007 4:25:13 AM
Subject: Re: fetcher : some doubts


thank you Sean  Dean
that sounds good .. i will try it out .
tell me if i am  rite :
i case of a dmoz index file is injected in the db .. then i generate only
few segments by  using -subset and then fetch them ..
and then go on and generate the next set of segments i hope i am heading the
right way
and for the previous problem of the searching being slow .. it wasnt my
hardware but my segments were corrupt i fixed them and the search runs fine
now

Thanks & Regards
Shrinivas Patwardhan
Reply | Threaded
Open this post in threaded view
|

Re: fetcher : some doubts

shrinivas patwardhan
okkkk i understand it now ..
well and thanks again for ur help  sean
i was wondering if anyone wud be interested in making a gui to setup and run
the crawl .. say for no voice users
i dont know if there is any ..
i wud be glad to help if people are keen on making one
Thanks & Regards
Shrinivas Patwardhan
Reply | Threaded
Open this post in threaded view
|

Re: fetcher : some doubts

Sean Dean-3
In reply to this post by shrinivas patwardhan
There currently is open development on a Nutch administration GUI for version 0.9. I have not tested, or even really looked at it myself but apparently most of the features work. This will not work on your version, but here is the link to JIRA where you can find the patches and ongoing comments;
 
http://issues.apache.org/jira/browse/NUTCH-251
 
This is an old link to a non-working demo of what the patches do and how the interface looks. Its very nice actually;
 
http://www.media-style.com/gfx/nutchadmin/index.html


----- Original Message ----
From: shrinivas patwardhan <[hidden email]>
To: [hidden email]
Sent: Tuesday, January 2, 2007 5:03:19 AM
Subject: Re: fetcher : some doubts


okkkk i understand it now ..
well and thanks again for ur help  sean
i was wondering if anyone wud be interested in making a gui to setup and run
the crawl .. say for no voice users
i dont know if there is any ..
i wud be glad to help if people are keen on making one
Thanks & Regards
Shrinivas Patwardhan
Reply | Threaded
Open this post in threaded view
|

RE: fetcher : some doubts

Alan Tanaman
In reply to this post by shrinivas patwardhan
There is an initiative to develop this:
http://issues.apache.org/jira/browse/NUTCH-251
Enis Söztutar, who is working on this, might be interested in your
assistance.

Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions

-----Original Message-----
From: shrinivas patwardhan [mailto:[hidden email]]
Sent: 02 January 2007 10:03
To: [hidden email]
Subject: Re: fetcher : some doubts

okkkk i understand it now ..
well and thanks again for ur help  sean
i was wondering if anyone wud be interested in making a gui to setup and run
the crawl .. say for no voice users
i dont know if there is any ..
i wud be glad to help if people are keen on making one
Thanks & Regards
Shrinivas Patwardhan

Reply | Threaded
Open this post in threaded view
|

Re: fetcher : some doubts

Justin Hartman
In reply to this post by Sean Dean-3
On 1/2/07, Sean Dean <[hidden email]> wrote:
> You need to delete the old index before you re-index when working within the same directory structure
> This is the procedure I follow, which is pretty much what your doing. This assumes you already have at least one active segment and index. Edit as needed.

Thanks for the prompt and efficient response - it is much appreciated.
The procedure seems fine to me with the exception of having to delete
the index before re-indexing. While in a test environment I don't mind
this but what happens when I go into a production state. I can't
delete the index as people will have nothing to search for while the
index is being re-built.

Is there another way of doing this or am I missing the plot here big time?
--
Regards
Justin Hartman
PGP Key ID: 102CC123
Reply | Threaded
Open this post in threaded view
|

RE: fetcher : some doubts

Alan Tanaman
As an interim solution when using the Nutch front end, what we did is
generate the new index in a temporary folder.  Then our script (Ant
actually) would turn off the web server (Tomcat in our case) to free the
existing index from the Nutch bean, and do a quick switcheroo using OS
rename commands.  Then restart web server and the old index is deleted.

The outage time would be less than a second, but I agree with you that this
is not a great solution.  We are not entirely happy with the way Nutch
forces you to build a new index each time instead of incrementally changing
an existing index, and are interested in writing a modification to handle
this better (subject to our other scheduled work).

We would be happy if you are interested in collaborating on this.

Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions

-----Original Message-----
From: Justin Hartman [mailto:[hidden email]]
Sent: 02 January 2007 10:42
To: [hidden email]
Subject: Re: fetcher : some doubts

On 1/2/07, Sean Dean <[hidden email]> wrote:
> You need to delete the old index before you re-index when working within
the same directory structure
> This is the procedure I follow, which is pretty much what your doing. This
assumes you already have at least one active segment and index. Edit as
needed.

Thanks for the prompt and efficient response - it is much appreciated.
The procedure seems fine to me with the exception of having to delete
the index before re-indexing. While in a test environment I don't mind
this but what happens when I go into a production state. I can't
delete the index as people will have nothing to search for while the
index is being re-built.

Is there another way of doing this or am I missing the plot here big time?
--
Regards
Justin Hartman
PGP Key ID: 102CC123

Reply | Threaded
Open this post in threaded view
|

Re: fetcher : some doubts

Sean Dean-3
In reply to this post by shrinivas patwardhan
Looking at what I wrote, yes, it will not be acceptable for a production environment.
 
What I failed to mention is that I copy the completed crawl directory somewhere else, and point Tomcat to look there instead via nutch-site.xml.
 
When you completed all those steps, and have your new index and segment created just copy them over what you presently have and do a hot-restart of the web application. This can be done by "touch"ing the web.xml file in your package.


----- Original Message ----
From: Justin Hartman <[hidden email]>
To: [hidden email]
Sent: Tuesday, January 2, 2007 5:41:51 AM
Subject: Re: fetcher : some doubts


On 1/2/07, Sean Dean <[hidden email]> wrote:
> You need to delete the old index before you re-index when working within the same directory structure
> This is the procedure I follow, which is pretty much what your doing. This assumes you already have at least one active segment and index. Edit as needed.

Thanks for the prompt and efficient response - it is much appreciated.
The procedure seems fine to me with the exception of having to delete
the index before re-indexing. While in a test environment I don't mind
this but what happens when I go into a production state. I can't
delete the index as people will have nothing to search for while the
index is being re-built.

Is there another way of doing this or am I missing the plot here big time?
--
Regards
Justin Hartman
PGP Key ID: 102CC123