Nutch crawler is breadth-first ?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch crawler is breadth-first ?

Jack.Tang
Hi All

Is nutch crawler breadth-first one? It seems a lot of URLs are lost
while I try do breadth-first crawling, I set the depth to 3.
Any comments?

Regards
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: Nutch crawler is breadth-first ?

Andrzej Białecki-2
Jack Tang wrote:
> Hi All
>
> Is nutch crawler breadth-first one? It seems a lot of URLs are lost
> while I try do breadth-first crawling, I set the depth to 3.
> Any comments?

Yes, and yes - there is a possiblity that some urls are lost, if they
require maintaining a single session. If you encounter such sites, a
depth-first crawler would be better.

It's not too difficult to build one, using the tools already present in
Nutch. Contributions are welcome... ;-)

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Nutch crawler is breadth-first ?

Jack.Tang
Hi Andrzej

First of all, thanks for your quick response.

On 9/7/05, Andrzej Bialecki <[hidden email]> wrote:

> Jack Tang wrote:
> > Hi All
> >
> > Is nutch crawler breadth-first one? It seems a lot of URLs are lost
> > while I try do breadth-first crawling, I set the depth to 3.
> > Any comments?
>
> Yes, and yes - there is a possiblity that some urls are lost, if they
> require maintaining a single session. If you encounter such sites, a
> depth-first crawler would be better.

The website does not require maintaining a single session.
my experimentation is designed like this:

X.html contains a list of URLs, say
http://www.a.com/x1.html
http://www.a.com/x2.html
http://www.a.com/x3.html
http://www.a.com/x4.html
http://www.a.com/x5.html
http://www.a.com/x6.html
http://www.a.com/x7.html
....
http://www.a.com/x30.html

I set the depth of crawler is 3 and X.html as its url feed.
And I use urlfilter-prefix as URL filter. (prefix=http://www.a.com)
In my parser, I count the URL, it is 10.

However, If I put all 30 URL into url feed file, in parser, it is right.
Odd?

Regards
/Jack

> It's not too difficult to build one, using the tools already present in
> Nutch. Contributions are welcome... ;-)
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

nutch/mapred tutorial

Earl Cahill
howdy,

I have been looking around for a nutch/mapred tutorial
and haven't had much luck.  I found this one

http://lucene.apache.org/nutch/tutorial.html

which did help me get a crawl going on trunk, but no
such luck in branches/mapred.  I set the urls file and
the filter in the same way that I did for trunk and I
get

050907 013817 parsing
file:/home/nutch/nutch/branches/mapred/conf/nutch-site.xml
java.io.IOException: No input files in:
[Ljava.io.File;@32b0bad7
        at
org.apache.nutch.mapred.InputFormatBase.listFiles(InputFormatBase.java:74)
        at
org.apache.nutch.mapred.InputFormatBase.getSplits(InputFormatBase.java:84)
        at
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:59)

Guess I am wondering if a detailed tutorial for mapred
exists.  Seems like doug was saying that it didn't.  I
would be up for walking through getting a crawl going
and documenting my steps, but won't dive in if one
already exists.  Also wondering if I would/could put
my doc on the wiki.

Thanks,
Earl

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

Re: nutch/mapred tutorial

Earl Cahill
Though, my last email was more about documenting the
whole setup process, it looks like the error I
mentioned was fixed by creating a directory and
putting a urls file in that directory.  It also looks
like the name of the file doesn't matter.  So I made a
myurls directory, put a urls file in there and then
ran

bin/nutch crawl myurls -dir crawl.test -depth 3

But, yeah, would like to put such steps in a tutorial.
 

It looks like the front page got hit, and that's about
it, so there is more to do.

Earl

--- Earl Cahill <[hidden email]> wrote:

> howdy,
>
> I have been looking around for a nutch/mapred
> tutorial
> and haven't had much luck.  I found this one
>
> http://lucene.apache.org/nutch/tutorial.html
>
> which did help me get a crawl going on trunk, but no
> such luck in branches/mapred.  I set the urls file
> and
> the filter in the same way that I did for trunk and
> I
> get
>
> 050907 013817 parsing
>
file:/home/nutch/nutch/branches/mapred/conf/nutch-site.xml
> java.io.IOException: No input files in:
> [Ljava.io.File;@32b0bad7
>         at
>
org.apache.nutch.mapred.InputFormatBase.listFiles(InputFormatBase.java:74)
>         at
>
org.apache.nutch.mapred.InputFormatBase.getSplits(InputFormatBase.java:84)
>         at
>
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:59)

>
> Guess I am wondering if a detailed tutorial for
> mapred
> exists.  Seems like doug was saying that it didn't.
> I
> would be up for walking through getting a crawl
> going
> and documenting my steps, but won't dive in if one
> already exists.  Also wondering if I would/could put
> my doc on the wiki.
>
> Thanks,
> Earl
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam
> protection around
> http://mail.yahoo.com 
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

Re: nutch/mapred tutorial

Fredrik Andersson-2-2
This is a new feature in the 0.7 version. Previously, the url listing was a
file, but it's now a directory. It's most probably documented in the release
notes, but the change hasn't followed through to the tutorials just yet. If
you check the mailing list archive, there are a couple of threads on this
topic.

Fredrik

On 9/7/05, Earl Cahill <[hidden email]> wrote:

>
> Though, my last email was more about documenting the
> whole setup process, it looks like the error I
> mentioned was fixed by creating a directory and
> putting a urls file in that directory. It also looks
> like the name of the file doesn't matter. So I made a
> myurls directory, put a urls file in there and then
> ran
>
> bin/nutch crawl myurls -dir crawl.test -depth 3
>
> But, yeah, would like to put such steps in a tutorial.
>
>
> It looks like the front page got hit, and that's about
> it, so there is more to do.
>
> Earl
>
> --- Earl Cahill <[hidden email]> wrote:
>
> > howdy,
> >
> > I have been looking around for a nutch/mapred
> > tutorial
> > and haven't had much luck. I found this one
> >
> > http://lucene.apache.org/nutch/tutorial.html
> >
> > which did help me get a crawl going on trunk, but no
> > such luck in branches/mapred. I set the urls file
> > and
> > the filter in the same way that I did for trunk and
> > I
> > get
> >
> > 050907 013817 parsing
> >
> file:/home/nutch/nutch/branches/mapred/conf/nutch-site.xml
> > java.io.IOException: No input files in:
> > [Ljava.io.File;@32b0bad7
> > at
> >
> org.apache.nutch.mapred.InputFormatBase.listFiles(InputFormatBase.java:74)
> > at
> >
> org.apache.nutch.mapred.InputFormatBase.getSplits(InputFormatBase.java:84)
> > at
> >
> org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:59)
> >
> > Guess I am wondering if a detailed tutorial for
> > mapred
> > exists. Seems like doug was saying that it didn't.
> > I
> > would be up for walking through getting a crawl
> > going
> > and documenting my steps, but won't dive in if one
> > already exists. Also wondering if I would/could put
> > my doc on the wiki.
> >
> > Thanks,
> > Earl
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam? Yahoo! Mail has the best spam
> > protection around
> > http://mail.yahoo.com
> >
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
Reply | Threaded
Open this post in threaded view
|

Re: Nutch crawler is breadth-first ?

Jack.Tang
In reply to this post by Jack.Tang
Hi

I found the reason. The value of maximum number of outlinks that nutch
willl process for a page is 100. And the website contains more than
300 URLs in the page.
Now, everything is ok.

/Jack

On 9/7/05, Jack Tang <[hidden email]> wrote:

> Hi Andrzej
>
> First of all, thanks for your quick response.
>
> On 9/7/05, Andrzej Bialecki <[hidden email]> wrote:
> > Jack Tang wrote:
> > > Hi All
> > >
> > > Is nutch crawler breadth-first one? It seems a lot of URLs are lost
> > > while I try do breadth-first crawling, I set the depth to 3.
> > > Any comments?
> >
> > Yes, and yes - there is a possiblity that some urls are lost, if they
> > require maintaining a single session. If you encounter such sites, a
> > depth-first crawler would be better.
>
> The website does not require maintaining a single session.
> my experimentation is designed like this:
>
> X.html contains a list of URLs, say
> http://www.a.com/x1.html
> http://www.a.com/x2.html
> http://www.a.com/x3.html
> http://www.a.com/x4.html
> http://www.a.com/x5.html
> http://www.a.com/x6.html
> http://www.a.com/x7.html
> ....
> http://www.a.com/x30.html
>
> I set the depth of crawler is 3 and X.html as its url feed.
> And I use urlfilter-prefix as URL filter. (prefix=http://www.a.com)
> In my parser, I count the URL, it is 10.
>
> However, If I put all 30 URL into url feed file, in parser, it is right.
> Odd?
>
> Regards
> /Jack
> > It's not too difficult to build one, using the tools already present in
> > Nutch. Contributions are welcome... ;-)
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >   ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars