help - distributed crawl in 0.7.1

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

help - distributed crawl in 0.7.1

Olive g
Hi I am new here.
Could someone please let me know the step-by-step instructions to set up
distributed crawl in 0.7.1?
Thank you.

_________________________________________________________________
Is your PC infected? Get a FREE online computer virus scan from McAfee®
Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963

Reply | Threaded
Open this post in threaded view
|

Re: help - distributed crawl in 0.7.1

Stefan Groschupf-2
Better you use nutch .8 to run a crawl using several machines.
There is some documentation in the wiki now.

Am 08.03.2006 um 17:49 schrieb Olive g:

> Hi I am new here.
> Could someone please let me know the step-by-step instructions to  
> set up
> distributed crawl in 0.7.1?
> Thank you.
>
> _________________________________________________________________
> Is your PC infected? Get a FREE online computer virus scan from  
> McAfee® Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?
> cid=3963
>
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net


Reply | Threaded
Open this post in threaded view
|

Re: help - distributed crawl in 0.7.1

Nuther
In reply to this post by Olive g
Hi,Olive.

Use www.nutch.org
Though tutorial is for 0.7, you can apply it to 0.7.1 version
If you have more exact question - ask :)
You wrote 8 марта 2006 г., 20:49:26:

> Hi I am new here.
> Could someone please let me know the step-by-step instructions to set up
> distributed crawl in 0.7.1?
> Thank you.

> _________________________________________________________________
> Is your PC infected? Get a FREE online computer virus scan from McAfee®
> Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963



> __________ NOD32 1.1434 (20060308) Information __________

> This message was checked by NOD32 antivirus system.
> http://www.eset.com




--
Regards,
 Dima                          mailto:[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: help - distributed crawl in 0.7.1

Thomas Delnoij-3
In reply to this post by Olive g
You can start here http://wiki.apache.org/nutch/NutchDistributedFileSystem

Also, I think there have been several posts in the mailing list that contain
such a step-by-step overview.

Rgrds, Thomas

On 3/8/06, Olive g <[hidden email]> wrote:

>
> Hi I am new here.
> Could someone please let me know the step-by-step instructions to set up
> distributed crawl in 0.7.1?
> Thank you.
>
> _________________________________________________________________
> Is your PC infected? Get a FREE online computer virus scan from McAfee(r)
> Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963
>
>
Reply | Threaded
Open this post in threaded view
|

Re[2]: help - distributed crawl in 0.7.1

Nuther
In reply to this post by Stefan Groschupf-2
Hi,Stefan.

I don't think so. 0.8 is more complicated.


> Better you use nutch .8 to run a crawl using several machines.
> There is some documentation in the wiki now.

> Am 08.03.2006 um 17:49 schrieb Olive g:

>> Hi I am new here.
>> Could someone please let me know the step-by-step instructions to  
>> set up
>> distributed crawl in 0.7.1?
>> Thank you.
>>
>> _________________________________________________________________
>> Is your PC infected? Get a FREE online computer virus scan from  
>> McAfee® Security.
>> http://clinic.mcafee.com/clinic/ibuy/campaign.asp?
>> cid=3963
>>
>>

> ---------------------------------------------------------------
> company:        http://www.media-style.com
> forum:        http://www.text-mining.org
> blog:            http://www.find23.net




> __________ NOD32 1.1434 (20060308) Information __________

> This message was checked by NOD32 antivirus system.
> http://www.eset.com




--
Regards,
 Dima                          mailto:[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: help - distributed crawl in 0.7.1

Olive g
In reply to this post by Stefan Groschupf-2
Thank you so much for your reply!
I just sent another message - because I am having other issues with 0.8 and
somehow the
TOTAL urls is always 1 when I search big sites such as www.yahoo.com. I
thought 0.7.1 might
be more stable?

THe stats:
060308 064418 Client connection to 9.2.13.8:8010 : starting
060308 064418 Client connection to 9.2.13.8:8009: starting
060308 064418 parsing file:/root/nutch/conf/nutch-default.xml
060308 064418 parsing file:/root/nutch/conf/nutch- site.xml
060308 064419 Running job: job_ljydgp
060308 064420  map 0%
060308 064427  map 100%
060308 064433  reduce 100%
060308 064433 Job complete: job_ljydgp
060308 064434 parsing file:/root/nutch/conf/nutch- default.xml
060308 064434 parsing file:/root/nutch/conf/nutch-site.xml
060308 064436 Statistics for CrawlDb:
/user/root/crawl-20060307224144/crawldb
060308 064436 TOTAL urls:       1
060308 064436 avg score:        1.0
060308 064436 max score:        1.0
060308 064436 min score:        1.0
060308 064436 retry 0:  1
060308 064436 status 2 (DB_fetched):    1
060308 064437 CrawlDb statistics: done





>From: Stefan Groschupf <[hidden email]>
>Reply-To: [hidden email]
>To: [hidden email]
>Subject: Re: help - distributed crawl in 0.7.1
>Date: Wed, 8 Mar 2006 17:51:11 +0100
>MIME-Version: 1.0 (Apple Message framework v746.2)
>Received: from mail.apache.org ([209.237.227.199]) by
>bay0-mc7-f18.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.211); Wed, 8
>Mar 2006 08:51:36 -0800
>Received: (qmail 65663 invoked by uid 500); 8 Mar 2006 16:51:35 -0000
>Received: (qmail 65652 invoked by uid 99); 8 Mar 2006 16:51:35 -0000
>Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49)    by
>apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar 2006 08:51:35 -0800
>Received: pass (asf.osuosl.org: local policy)
>Received: from [212.122.60.61] (HELO mslinux.media-style.com)
>(212.122.60.61)    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar
>2006 08:51:32 -0800
>Received: from localhost (localhost [127.0.0.1])by mslinux.media-style.com
>(Postfix) with ESMTP id 21540144450for <[hidden email]>; Wed,
>  8 Mar 2006 17:43:21 +0100 (CET)
>Received: from mslinux.media-style.com ([127.0.0.1])by localhost
>(mslinux.media-style.com [127.0.0.1]) (amavisd-new, port 10024)with ESMTP
>id 18258-01 for <[hidden email]>;Wed, 8 Mar 2006 17:43:20
>+0100 (CET)
>Received: from [192.168.200.39] (unknown [212.122.60.61])by
>mslinux.media-style.com (Postfix) with ESMTP id D81A1144417for
><[hidden email]>; Wed,  8 Mar 2006 17:43:20 +0100 (CET)
>X-Message-Info: JGTYoYF78jEHjJx36Oi8+Z3TmmkSEdPtfpLB7P/ybN8=
>Mailing-List: contact [hidden email]; run by ezmlm
>Precedence: bulk
>List-Help: <mailto:[hidden email]>
>List-Unsubscribe: <mailto:[hidden email]>
>List-Post: <mailto:[hidden email]>
>List-Id: <nutch-user.lucene.apache.org>
>Delivered-To: mailing list [hidden email]
>X-ASF-Spam-Status: No, hits=0.0 required=10.0tests=HTML_MESSAGE
>X-Spam-Check-By: apache.org
>References: <[hidden email]>
>X-Mailer: Apple Mail (2.746.2)
>X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at media-style.com
>X-Virus-Checked: Checked by ClamAV on apache.org
>Return-Path:
>nutch-user-return-4454-oliveg2005=[hidden email]
>X-OriginalArrivalTime: 08 Mar 2006 16:51:36.0503 (UTC)
>FILETIME=[901C1C70:01C642D0]
>
>Better you use nutch .8 to run a crawl using several machines.
>There is some documentation in the wiki now.
>
>Am 08.03.2006 um 17:49 schrieb Olive g:
>
>>Hi I am new here.
>>Could someone please let me know the step-by-step instructions to  set up
>>distributed crawl in 0.7.1?
>>Thank you.
>>
>>_________________________________________________________________
>>Is your PC infected? Get a FREE online computer virus scan from  McAfee®
>>Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp? cid=3963
>>
>>
>
>---------------------------------------------------------------
>company:        http://www.media-style.com
>forum:        http://www.text-mining.org
>blog:            http://www.find23.net
>
>

_________________________________________________________________
On the road to retirement? Check out MSN Life Events for advice on how to
get there! http://lifeevents.msn.com/category.aspx?cid=Retirement

Reply | Threaded
Open this post in threaded view
|

Re[2]: help - distributed crawl in 0.7.1

Nuther
Hi,Olive.


It is more stable.
I spared one week on learning 0.8's conception.
But, unfortunately rolled back to 0.7.1 version.
The only thing I needed in 0.8 is SWF Parser.


> Thank you so much for your reply!
> I just sent another message - because I am having other issues with 0.8 and
> somehow the
> TOTAL urls is always 1 when I search big sites such as www.yahoo.com. I
> thought 0.7.1 might
> be more stable?

> THe stats:
> 060308 064418 Client connection to 9.2.13.8:8010 : starting
> 060308 064418 Client connection to 9.2.13.8:8009: starting
> 060308 064418 parsing file:/root/nutch/conf/nutch-default.xml
> 060308 064418 parsing file:/root/nutch/conf/nutch- site.xml
> 060308 064419 Running job: job_ljydgp
> 060308 064420  map 0%
> 060308 064427  map 100%
> 060308 064433  reduce 100%
> 060308 064433 Job complete: job_ljydgp
> 060308 064434 parsing file:/root/nutch/conf/nutch- default.xml
> 060308 064434 parsing file:/root/nutch/conf/nutch-site.xml
> 060308 064436 Statistics for CrawlDb:
> /user/root/crawl-20060307224144/crawldb
> 060308 064436 TOTAL urls:       1
> 060308 064436 avg score:        1.0
> 060308 064436 max score:        1.0
> 060308 064436 min score:        1.0
> 060308 064436 retry 0:  1
> 060308 064436 status 2 (DB_fetched):    1
> 060308 064437 CrawlDb statistics: done





>>From: Stefan Groschupf <[hidden email]>
>>Reply-To: [hidden email]
>>To: [hidden email]
>>Subject: Re: help - distributed crawl in 0.7.1
>>Date: Wed, 8 Mar 2006 17:51:11 +0100
>>MIME-Version: 1.0 (Apple Message framework v746.2)
>>Received: from mail.apache.org ([209.237.227.199]) by
>>bay0-mc7-f18.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.211); Wed, 8
>>Mar 2006 08:51:36 -0800
>>Received: (qmail 65663 invoked by uid 500); 8 Mar 2006 16:51:35 -0000
>>Received: (qmail 65652 invoked by uid 99); 8 Mar 2006 16:51:35 -0000
>>Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49)    by
>>apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar 2006 08:51:35 -0800
>>Received: pass (asf.osuosl.org: local policy)
>>Received: from [212.122.60.61] (HELO mslinux.media-style.com)
>>(212.122.60.61)    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar
>>2006 08:51:32 -0800
>>Received: from localhost (localhost [127.0.0.1])by mslinux.media-style.com
>>(Postfix) with ESMTP id 21540144450for
>><[hidden email]>; Wed,
>>  8 Mar 2006 17:43:21 +0100 (CET)
>>Received: from mslinux.media-style.com ([127.0.0.1])by localhost
>>(mslinux.media-style.com [127.0.0.1]) (amavisd-new, port 10024)with ESMTP
>>id 18258-01 for <[hidden email]>;Wed, 8 Mar 2006 17:43:20
>>+0100 (CET)
>>Received: from [192.168.200.39] (unknown [212.122.60.61])by
>>mslinux.media-style.com (Postfix) with ESMTP id D81A1144417for
>><[hidden email]>; Wed,  8 Mar 2006 17:43:20 +0100 (CET)
>>X-Message-Info: JGTYoYF78jEHjJx36Oi8+Z3TmmkSEdPtfpLB7P/ybN8=
>>Mailing-List: contact [hidden email]; run by ezmlm
>>Precedence: bulk
>>List-Help: <mailto:[hidden email]>
>>List-Unsubscribe: <mailto:[hidden email]>
>>List-Post: <mailto:[hidden email]>
>>List-Id: <nutch-user.lucene.apache.org>
>>Delivered-To: mailing list [hidden email]
>>X-ASF-Spam-Status: No, hits=0.0 required=10.0tests=HTML_MESSAGE
>>X-Spam-Check-By: apache.org
>>References: <[hidden email]>
>>X-Mailer: Apple Mail (2.746.2)
>>X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at media-style.com
>>X-Virus-Checked: Checked by ClamAV on apache.org
>>Return-Path:
>>nutch-user-return-4454-oliveg2005=[hidden email]
>>X-OriginalArrivalTime: 08 Mar 2006 16:51:36.0503 (UTC)
>>FILETIME=[901C1C70:01C642D0]
>>
>>Better you use nutch .8 to run a crawl using several machines.
>>There is some documentation in the wiki now.
>>
>>Am 08.03.2006 um 17:49 schrieb Olive g:
>>
>>>Hi I am new here.
>>>Could someone please let me know the step-by-step instructions to  set up
>>>distributed crawl in 0.7.1?
>>>Thank you.
>>>
>>>_________________________________________________________________
>>>Is your PC infected? Get a FREE online computer virus scan from McAfee®
>>>Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp? cid=3963
>>>
>>>
>>
>>---------------------------------------------------------------
>>company:        http://www.media-style.com
>>forum:        http://www.text-mining.org
>>blog:            http://www.find23.net
>>
>>

> _________________________________________________________________
> On the road to retirement? Check out MSN Life Events for advice on how to
> get there! http://lifeevents.msn.com/category.aspx?cid=Retirement



> __________ NOD32 1.1434 (20060308) Information __________

> This message was checked by NOD32 antivirus system.
> http://www.eset.com




--
Regards,
 Dima                          mailto:[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: help - distributed crawl in 0.7.1

Thomas Delnoij-3
In reply to this post by Olive g
Detailed distributed crawl implementation:

http://www.mail-archive.com/nutch-user@.../msg02270.html

I am not sure it applies to 0.7 though, but it  has a lot of info.

Rgrds, Thomas
Reply | Threaded
Open this post in threaded view
|

Re: help - distributed crawl in 0.7.1

Olive g

Thanks! I saw that one too, but according to Doug, it was for 0.8 only. Does
anyone have
step-by-step introductions like the one for 0.8?
Also, anyone knows why URL total is always 1 when I ran 0.8?
060308 064420  map 0%
060308 064427  map 100%
060308 064433  reduce 100%
060308 064433 Job complete: job_ljydgp
060308 064434 parsing file:/root/nutch/conf/nutch- default.xml
060308 064434 parsing file:/root/nutch/conf/nutch-site.xml
060308 064436 Statistics for CrawlDb:
/user/root/crawl-20060307224144/crawldb
060308 064436 TOTAL urls:       1
060308 064436 avg score:        1.0
060308 064436 max score:        1.0
060308 064436 min score:        1.0
060308 064436 retry 0:  1
060308 064436 status 2 (DB_fetched):    1
060308 064437 CrawlDb statistics: done


>From: TDLN <[hidden email]>
>Reply-To: [hidden email]
>To: [hidden email]
>Subject: Re: help - distributed crawl in 0.7.1
>Date: Wed, 8 Mar 2006 18:00:06 +0100
>MIME-Version: 1.0
>Received: from mail.apache.org ([209.237.227.199]) by
>bay0-mc7-f2.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.211); Wed, 8
>Mar 2006 09:00:31 -0800
>Received: (qmail 90576 invoked by uid 500); 8 Mar 2006 17:00:31 -0000
>Received: (qmail 90565 invoked by uid 99); 8 Mar 2006 17:00:31 -0000
>Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49)    by
>apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar 2006 09:00:30 -0800
>Received: pass (asf.osuosl.org: domain of [hidden email] designates
>64.233.162.200 as permitted sender)
>Received: from [64.233.162.200] (HELO zproxy.gmail.com) (64.233.162.200)    
>by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar 2006 09:00:29 -0800
>Received: by zproxy.gmail.com with SMTP id 4so235445nzn        for
><[hidden email]>; Wed, 08 Mar 2006 09:00:08 -0800 (PST)
>Received: by 10.36.74.1 with SMTP id w1mr2304954nza;        Wed, 08 Mar
>2006 09:00:06 -0800 (PST)
>Received: by 10.36.227.12 with HTTP; Wed, 8 Mar 2006 09:00:06 -0800 (PST)
>X-Message-Info: JGTYoYF78jEHjJx36Oi8+Z3TmmkSEdPtfpLB7P/ybN8=
>Mailing-List: contact [hidden email]; run by ezmlm
>Precedence: bulk
>List-Help: <mailto:[hidden email]>
>List-Unsubscribe: <mailto:[hidden email]>
>List-Post: <mailto:[hidden email]>
>List-Id: <nutch-user.lucene.apache.org>
>Delivered-To: mailing list [hidden email]
>X-ASF-Spam-Status: No, hits=0.0 required=10.0tests=HTML_MESSAGE,SPF_PASS
>X-Spam-Check-By: apache.org
>DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;        s=beta;
>d=gmail.com;        
>h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references;
>        
>b=dmLqLQUJPgvrB9Wiu1h1sG1pvL2DrxRpUM2bkCW36RjiyAo0t2/HebGIq4aNBW3Aoh83ko2xae64rHfJlg/+wzZIIayNqxJt0sq7xgLN3xuxfxBFltuBHVBPwkGK8WiyKTuk9ADXPG+G4yC1UGAUpVfc4fYGhcVDwsEC5GO2FAQ=
>References: <[hidden email]>
><[hidden email]>
>X-Virus-Checked: Checked by ClamAV on apache.org
>Return-Path:
>nutch-user-return-4462-oliveg2005=[hidden email]
>X-OriginalArrivalTime: 08 Mar 2006 17:00:32.0169 (UTC)
>FILETIME=[CF644190:01C642D1]
>
>Detailed distributed crawl implementation:
>
>http://www.mail-archive.com/nutch-user@.../msg02270.html
>
>I am not sure it applies to 0.7 though, but it  has a lot of info.
>
>Rgrds, Thomas

_________________________________________________________________
Don’t just search. Find. Check out the new MSN Search!
http://search.msn.click-url.com/go/onm00200636ave/direct/01/

Reply | Threaded
Open this post in threaded view
|

Re: Re[2]: help - distributed crawl in 0.7.1

Stefan Groschupf-2
In reply to this post by Nuther
I personal found the very latest source the most stable and easiest  
to use nutch version i ever used.
Just my point of view.
A lot of map reduce issues are fixed now, if distributed means run on  
serveral machines, I suggest 0.8.

Am 08.03.2006 um 19:03 schrieb Dima Mazmanov:

> Hi,Stefan.
>
> I don't think so. 0.8 is more complicated.
>
>
>> Better you use nutch .8 to run a crawl using several machines.
>> There is some documentation in the wiki now.
>
>> Am 08.03.2006 um 17:49 schrieb Olive g:
>
>>> Hi I am new here.
>>> Could someone please let me know the step-by-step instructions to
>>> set up
>>> distributed crawl in 0.7.1?
>>> Thank you.
>>>
>>> _________________________________________________________________
>>> Is your PC infected? Get a FREE online computer virus scan from
>>> McAfee® Security.
>>> http://clinic.mcafee.com/clinic/ibuy/campaign.asp?
>>> cid=3963
>>>
>>>
>
>> ---------------------------------------------------------------
>> company:        http://www.media-style.com
>> forum:        http://www.text-mining.org
>> blog:            http://www.find23.net
>
>
>
>
>> __________ NOD32 1.1434 (20060308) Information __________
>
>> This message was checked by NOD32 antivirus system.
>> http://www.eset.com
>
>
>
>
> --
> Regards,
>  Dima                          mailto:[hidden email]
>
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net


Reply | Threaded
Open this post in threaded view
|

Re[4]: help - distributed crawl in 0.7.1

Nuther
Hi,Stefan.

Strange, I found  it more complicated..
Never mind, it's just my point of view :)
You wrote 8 ìàðòà 2006 ã., 21:11:09:

> I personal found the very latest source the most stable and easiest
> to use nutch version i ever used.
> Just my point of view.
> A lot of map reduce issues are fixed now, if distributed means run on
> serveral machines, I suggest 0.8.

> Am 08.03.2006 um 19:03 schrieb Dima Mazmanov:

>> Hi,Stefan.
>>
>> I don't think so. 0.8 is more complicated.
>>
>>
>>> Better you use nutch .8 to run a crawl using several machines.
>>> There is some documentation in the wiki now.
>>
>>> Am 08.03.2006 um 17:49 schrieb Olive g:
>>
>>>> Hi I am new here.
>>>> Could someone please let me know the step-by-step instructions to
>>>> set up
>>>> distributed crawl in 0.7.1?
>>>> Thank you.
>>>>
>>>> _________________________________________________________________
>>>> Is your PC infected? Get a FREE online computer virus scan from
>>>> McAfee® Security.
>>>> http://clinic.mcafee.com/clinic/ibuy/campaign.asp?
>>>> cid=3963
>>>>
>>>>
>>
>>> ---------------------------------------------------------------
>>> company:        http://www.media-style.com
>>> forum:        http://www.text-mining.org
>>> blog:            http://www.find23.net
>>
>>
>>
>>
>>> __________ NOD32 1.1434 (20060308) Information __________
>>
>>> This message was checked by NOD32 antivirus system.
>>> http://www.eset.com
>>
>>
>>
>>
>> --
>> Regards,
>>  Dima                          mailto:[hidden email]
>>
>>

> ---------------------------------------------------------------
> company:        http://www.media-style.com
> forum:        http://www.text-mining.org
> blog:            http://www.find23.net




> __________ NOD32 1.1434 (20060308) Information __________

> This message was checked by NOD32 antivirus system.
> http://www.eset.com




--
Regards,
 Dima                          mailto:[hidden email]