Effect of no topN argument in generate

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Effect of no topN argument in generate

Smith Norton
In the bin/generate command, if I omit the 'topN' argument, what is
the behavior?

Does it generate all possible URLs or does it assume a default topN value?

I tried omitting topN value in my crawl script and I find that my
crawl is running much faster. Earlier I had a -topN 2000 argument and
it used to take 4-5 days to finish a crawl of depth 5.

Now, without the topN argument, it finished a crawl of depth 5 in 6
hours. Can anyone explain what's going on?
Reply | Threaded
Open this post in threaded view
|

Re: Effect of no topN argument in generate

Rikard Lindner
There is a default value in nutch-default.xml

/Rikard

2007/9/6, Smith Norton <[hidden email]>:

>
> In the bin/generate command, if I omit the 'topN' argument, what is
> the behavior?
>
> Does it generate all possible URLs or does it assume a default topN value?
>
> I tried omitting topN value in my crawl script and I find that my
> crawl is running much faster. Earlier I had a -topN 2000 argument and
> it used to take 4-5 days to finish a crawl of depth 5.
>
> Now, without the topN argument, it finished a crawl of depth 5 in 6
> hours. Can anyone explain what's going on?
>
Reply | Threaded
Open this post in threaded view
|

Re: Effect of no topN argument in generate

Smith Norton
Thanks for the response. What is the property name for this default
value of topN in nutch-default.xml?

On 9/6/07, Rikard Lindner <[hidden email]> wrote:

> There is a default value in nutch-default.xml
>
> /Rikard
>
> 2007/9/6, Smith Norton <[hidden email]>:
> >
> > In the bin/generate command, if I omit the 'topN' argument, what is
> > the behavior?
> >
> > Does it generate all possible URLs or does it assume a default topN value?
> >
> > I tried omitting topN value in my crawl script and I find that my
> > crawl is running much faster. Earlier I had a -topN 2000 argument and
> > it used to take 4-5 days to finish a crawl of depth 5.
> >
> > Now, without the topN argument, it finished a crawl of depth 5 in 6
> > hours. Can anyone explain what's going on?
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Effect of no topN argument in generate

Rikard Lindner
Now im getting a bit uncertain but i think you can add crawl.topN in your
nutch-site.xml, i couldnt find it in nutch-default either but im quite sure
it is set somerwhere!

/Rikard

2007/9/6, Smith Norton <[hidden email]>:

>
> Thanks for the response. What is the property name for this default
> value of topN in nutch-default.xml?
>
> On 9/6/07, Rikard Lindner <[hidden email]> wrote:
> > There is a default value in nutch-default.xml
> >
> > /Rikard
> >
> > 2007/9/6, Smith Norton <[hidden email]>:
> > >
> > > In the bin/generate command, if I omit the 'topN' argument, what is
> > > the behavior?
> > >
> > > Does it generate all possible URLs or does it assume a default topN
> value?
> > >
> > > I tried omitting topN value in my crawl script and I find that my
> > > crawl is running much faster. Earlier I had a -topN 2000 argument and
> > > it used to take 4-5 days to finish a crawl of depth 5.
> > >
> > > Now, without the topN argument, it finished a crawl of depth 5 in 6
> > > hours. Can anyone explain what's going on?
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Effect of no topN argument in generate

Smith Norton
I have not added any such thing in my nutch-site.xml and I have
omitted -topN argument in bin/generate command.

So my question is what would be the effect in this case. I was
expecting that it would be same as -topN <infinity>. So it should
generate all possible URLs in the generate phase.

I tried omitting topN value in my crawl script and I find that my
crawl is running much faster. Earlier I had a -topN 2000 argument and
it used to take 4-5 days to finish a crawl of depth 5.

Now, without the topN argument, it finished a crawl of depth 5 in 6
hours. How?

On 9/7/07, Rikard Lindner <[hidden email]> wrote:

> Now im getting a bit uncertain but i think you can add crawl.topN in your
> nutch-site.xml, i couldnt find it in nutch-default either but im quite sure
> it is set somerwhere!
>
> /Rikard
>
> 2007/9/6, Smith Norton <[hidden email]>:
> >
> > Thanks for the response. What is the property name for this default
> > value of topN in nutch-default.xml?
> >
> > On 9/6/07, Rikard Lindner <[hidden email]> wrote:
> > > There is a default value in nutch-default.xml
> > >
> > > /Rikard
> > >
> > > 2007/9/6, Smith Norton <[hidden email]>:
> > > >
> > > > In the bin/generate command, if I omit the 'topN' argument, what is
> > > > the behavior?
> > > >
> > > > Does it generate all possible URLs or does it assume a default topN
> > value?
> > > >
> > > > I tried omitting topN value in my crawl script and I find that my
> > > > crawl is running much faster. Earlier I had a -topN 2000 argument and
> > > > it used to take 4-5 days to finish a crawl of depth 5.
> > > >
> > > > Now, without the topN argument, it finished a crawl of depth 5 in 6
> > > > hours. Can anyone explain what's going on?
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Re: Effect of no topN argument in generate

Marcin Okraszewski-3
According to http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch%20generate
the value is Long.MAX_VALUE.

Do you run both tests in the same conditions? Or maybe you have first run the crawl with topN 2000 and then without the parameter on the same crawl db? It may happen that there is not so much to crawl anymore ...

Regards,
Marcin


> I have not added any such thing in my nutch-site.xml and I have
> omitted -topN argument in bin/generate command.
>
> So my question is what would be the effect in this case. I was
> expecting that it would be same as -topN <infinity>. So it should
> generate all possible URLs in the generate phase.
>
> I tried omitting topN value in my crawl script and I find that my
> crawl is running much faster. Earlier I had a -topN 2000 argument and
> it used to take 4-5 days to finish a crawl of depth 5.
>
> Now, without the topN argument, it finished a crawl of depth 5 in 6
> hours. How?
>
> On 9/7/07, Rikard Lindner <[hidden email]> wrote:
> > Now im getting a bit uncertain but i think you can add crawl.topN in your
> > nutch-site.xml, i couldnt find it in nutch-default either but im quite sure
> > it is set somerwhere!
> >
> > /Rikard
> >
> > 2007/9/6, Smith Norton <[hidden email]>:
> > >
> > > Thanks for the response. What is the property name for this default
> > > value of topN in nutch-default.xml?
> > >
> > > On 9/6/07, Rikard Lindner <[hidden email]> wrote:
> > > > There is a default value in nutch-default.xml
> > > >
> > > > /Rikard
> > > >
> > > > 2007/9/6, Smith Norton <[hidden email]>:
> > > > >
> > > > > In the bin/generate command, if I omit the 'topN' argument, what is
> > > > > the behavior?
> > > > >
> > > > > Does it generate all possible URLs or does it assume a default topN
> > > value?
> > > > >
> > > > > I tried omitting topN value in my crawl script and I find that my
> > > > > crawl is running much faster. Earlier I had a -topN 2000 argument and
> > > > > it used to take 4-5 days to finish a crawl of depth 5.
> > > > >
> > > > > Now, without the topN argument, it finished a crawl of depth 5 in 6
> > > > > hours. Can anyone explain what's going on?
> > > > >
> > > >
> > >
> >

Reply | Threaded
Open this post in threaded view
|

Re: Re: Effect of no topN argument in generate

misc

Hello-

    I don't know if this is the same problem, but as I reported a couple of
days ago I am seeing very disproportionate times in generate times.  I have
been able to generate urls in minutes or many hours.  I think this is a bug
in the current version of Nutch, but I have not been able to track it down
yet.

    In my case, when generate is acting slowly it seems to generate a bunch
of urls then pause for a second, over and over again.  When acting quickly
it just generates in batch.  Try changing logging to debug and watch the
processing of urls.  If you see a scroll-halt-scroll-halt pattern, you are
seeing the same behavior I am seeing.  If you just see constant scroll, then
the problem is not present, and you should get quick results.

                        thanks
                            -Jim


----- Original Message -----
From: "Marcin Okraszewski" <[hidden email]>
To: <[hidden email]>
Sent: Thursday, September 06, 2007 12:28 PM
Subject: Re: Re: Effect of no topN argument in generate


> According to
> http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch%20generate
> the value is Long.MAX_VALUE.
>
> Do you run both tests in the same conditions? Or maybe you have first run
> the crawl with topN 2000 and then without the parameter on the same crawl
> db? It may happen that there is not so much to crawl anymore ...
>
> Regards,
> Marcin
>
>
>> I have not added any such thing in my nutch-site.xml and I have
>> omitted -topN argument in bin/generate command.
>>
>> So my question is what would be the effect in this case. I was
>> expecting that it would be same as -topN <infinity>. So it should
>> generate all possible URLs in the generate phase.
>>
>> I tried omitting topN value in my crawl script and I find that my
>> crawl is running much faster. Earlier I had a -topN 2000 argument and
>> it used to take 4-5 days to finish a crawl of depth 5.
>>
>> Now, without the topN argument, it finished a crawl of depth 5 in 6
>> hours. How?
>>
>> On 9/7/07, Rikard Lindner <[hidden email]> wrote:
>> > Now im getting a bit uncertain but i think you can add crawl.topN in
>> > your
>> > nutch-site.xml, i couldnt find it in nutch-default either but im quite
>> > sure
>> > it is set somerwhere!
>> >
>> > /Rikard
>> >
>> > 2007/9/6, Smith Norton <[hidden email]>:
>> > >
>> > > Thanks for the response. What is the property name for this default
>> > > value of topN in nutch-default.xml?
>> > >
>> > > On 9/6/07, Rikard Lindner <[hidden email]> wrote:
>> > > > There is a default value in nutch-default.xml
>> > > >
>> > > > /Rikard
>> > > >
>> > > > 2007/9/6, Smith Norton <[hidden email]>:
>> > > > >
>> > > > > In the bin/generate command, if I omit the 'topN' argument, what
>> > > > > is
>> > > > > the behavior?
>> > > > >
>> > > > > Does it generate all possible URLs or does it assume a default
>> > > > > topN
>> > > value?
>> > > > >
>> > > > > I tried omitting topN value in my crawl script and I find that my
>> > > > > crawl is running much faster. Earlier I had a -topN 2000 argument
>> > > > > and
>> > > > > it used to take 4-5 days to finish a crawl of depth 5.
>> > > > >
>> > > > > Now, without the topN argument, it finished a crawl of depth 5 in
>> > > > > 6
>> > > > > hours. Can anyone explain what's going on?
>> > > > >
>> > > >
>> > >
>> >
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Re: Effect of no topN argument in generate

Smith Norton
In reply to this post by Marcin Okraszewski-3
There is a little difference in the condition.

A. First condition when a complete crawl of depth 5 takes around 5 days:-

  1. Only 7 URLs in the seed URL file 'urls/url'.
  2. -topN 2000 is the argument to generate

B. Second condition when a complete crawl of depth 5 takes around 6 hours:-

  1. Around 60 URLs in the seed URL file 'urls/url'.
  2. No '-topN' argument for generate. This argument is omitted.

I would also like to mention what the extra 53 URLs are in case B.

In case A, there is one url called  'http://central/'. The home page
of "http://central/" has a side bar with lots of URLs to other
important pages of the 'central' site. As with most sidebars, this set
of sidebar URLs appear in all pages of 'central' site.

I picked up these sidebar URLs (which happens to be 53 in number) and
placed them in the seed URLs file in case B.

Can anyone explain why case B should drastically reduce crawl duration
from 5 days to 6 hours?

On 9/7/07, Marcin Okraszewski <[hidden email]> wrote:

> According to http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch%20generate
> the value is Long.MAX_VALUE.
>
> Do you run both tests in the same conditions? Or maybe you have first run the crawl with topN 2000 and then without the parameter on the same crawl db? It may happen that there is not so much to crawl anymore ...
>
> Regards,
> Marcin
>
>
> > I have not added any such thing in my nutch-site.xml and I have
> > omitted -topN argument in bin/generate command.
> >
> > So my question is what would be the effect in this case. I was
> > expecting that it would be same as -topN <infinity>. So it should
> > generate all possible URLs in the generate phase.
> >
> > I tried omitting topN value in my crawl script and I find that my
> > crawl is running much faster. Earlier I had a -topN 2000 argument and
> > it used to take 4-5 days to finish a crawl of depth 5.
> >
> > Now, without the topN argument, it finished a crawl of depth 5 in 6
> > hours. How?
> >
> > On 9/7/07, Rikard Lindner <[hidden email]> wrote:
> > > Now im getting a bit uncertain but i think you can add crawl.topN in your
> > > nutch-site.xml, i couldnt find it in nutch-default either but im quite sure
> > > it is set somerwhere!
> > >
> > > /Rikard
> > >
> > > 2007/9/6, Smith Norton <[hidden email]>:
> > > >
> > > > Thanks for the response. What is the property name for this default
> > > > value of topN in nutch-default.xml?
> > > >
> > > > On 9/6/07, Rikard Lindner <[hidden email]> wrote:
> > > > > There is a default value in nutch-default.xml
> > > > >
> > > > > /Rikard
> > > > >
> > > > > 2007/9/6, Smith Norton <[hidden email]>:
> > > > > >
> > > > > > In the bin/generate command, if I omit the 'topN' argument, what is
> > > > > > the behavior?
> > > > > >
> > > > > > Does it generate all possible URLs or does it assume a default topN
> > > > value?
> > > > > >
> > > > > > I tried omitting topN value in my crawl script and I find that my
> > > > > > crawl is running much faster. Earlier I had a -topN 2000 argument and
> > > > > > it used to take 4-5 days to finish a crawl of depth 5.
> > > > > >
> > > > > > Now, without the topN argument, it finished a crawl of depth 5 in 6
> > > > > > hours. Can anyone explain what's going on?
> > > > > >
> > > > >
> > > >
> > >
>
>