"db.max.outlinks.per.page" is misunderstood?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

"db.max.outlinks.per.page" is misunderstood?

Jack.Tang
Hi All

Here is the "db.max.outlinks.per.page" property and its description in
nutch-default.xml
        <property>
          <name>db.max.outlinks.per.page</name>
          <value>100</value>
          <description>The maximum number of outlinks that we'll process for a page.
          </description>
       </property>

I don't think the description is right.
Say, my crawler feeds are:
http://www.a.com/index.php (90 outlinks)
http://www.b.com/index.jsp  (80 outlinks)
http://www.c.com/index.html (50 outlinks)

and the number of crawler thread is 30. Do you think the reminder URLs
( (80 -10) outlinks + 50  outlinks) will be fetched?

I think the description should be "The maximum number of outlinks in
one fecthing phase."


Regards
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: "db.max.outlinks.per.page" is misunderstood?

AJ Chen
My understanding is that only up to the maximum number of outlinks are
processed for a page when updating the web db. I assume the same page
won't get fetched and processed again in the next fetch/update cycles,
then you won't get those outlinks exceeding the maximum number no matter
how many cycles you are running.

To make sure all of the outlinks are processed for a page, the
db.max.outlinks.per.page must be set to a number that is larger than the
number of outlinks on the page. If these is true, then the max number
has to be determined in real time since the number of outlinks varies
from page to page.

Is my understanding correct?

AJ


Jack Tang wrote:

>Hi All
>
>Here is the "db.max.outlinks.per.page" property and its description in
>nutch-default.xml
> <property>
>  <name>db.max.outlinks.per.page</name>
>  <value>100</value>
>  <description>The maximum number of outlinks that we'll process for a page.
>  </description>
>       </property>
>
>I don't think the description is right.
>Say, my crawler feeds are:
>http://www.a.com/index.php (90 outlinks)
>http://www.b.com/index.jsp  (80 outlinks)
>http://www.c.com/index.html (50 outlinks)
>
>and the number of crawler thread is 30. Do you think the reminder URLs
>( (80 -10) outlinks + 50  outlinks) will be fetched?
>
>I think the description should be "The maximum number of outlinks in
>one fecthing phase."
>
>
>Regards
>/Jack
>  
>

--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting
Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [hidden email]
---------------------------------------------------
Reply | Threaded
Open this post in threaded view
|

Re: "db.max.outlinks.per.page" is misunderstood?

Jack.Tang
Hi Chen

I don't think it is the limitation of ONE page but ONE fetching phase (cycle).
In my previous example,

feed urls:
http://www.a.com/index.php (90 outlinks)
http://www.b.com/index.jsp  (80 outlinks)
http://www.c.com/index.html (50 outlinks)
90 + 80 + 50 = 220 outlinks, they are totally different. And I used
protocol-httpclient plugin.
In one fetching cycle, if the sum of fecthing outlink is 100, then the
others will be abandoned. Right?

/Jack

On 9/8/05, AJ Chen <[hidden email]> wrote:

> My understanding is that only up to the maximum number of outlinks are
> processed for a page when updating the web db. I assume the same page
> won't get fetched and processed again in the next fetch/update cycles,
> then you won't get those outlinks exceeding the maximum number no matter
> how many cycles you are running.
>
> To make sure all of the outlinks are processed for a page, the
> db.max.outlinks.per.page must be set to a number that is larger than the
> number of outlinks on the page. If these is true, then the max number
> has to be determined in real time since the number of outlinks varies
> from page to page.
>
> Is my understanding correct?
>
> AJ
>
>
> Jack Tang wrote:
>
> >Hi All
> >
> >Here is the "db.max.outlinks.per.page" property and its description in
> >nutch-default.xml
> >       <property>
> >         <name>db.max.outlinks.per.page</name>
> >         <value>100</value>
> >         <description>The maximum number of outlinks that we'll process for a page.
> >         </description>
> >       </property>
> >
> >I don't think the description is right.
> >Say, my crawler feeds are:
> >http://www.a.com/index.php (90 outlinks)
> >http://www.b.com/index.jsp  (80 outlinks)
> >http://www.c.com/index.html (50 outlinks)
> >
> >and the number of crawler thread is 30. Do you think the reminder URLs
> >( (80 -10) outlinks + 50  outlinks) will be fetched?
> >
> >I think the description should be "The maximum number of outlinks in
> >one fecthing phase."
> >
> >
> >Regards
> >/Jack
> >
> >
>
> --
> AJ (Anjun) Chen, Ph.D.
> Canova Bioconsulting
> Marketing * BD * Software Development
> 748 Matadero Ave., Palo Alto, CA 94306, USA
> Cell 650-283-4091, [hidden email]
> ---------------------------------------------------
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: "db.max.outlinks.per.page" is misunderstood?

Stefan Groschupf-2
In reply to this post by Jack.Tang
Jack,
That is max outlinks per html page.
All your example pages have less than 100 outlinks, right?!
Stefan

Am 07.09.2005 um 18:43 schrieb Jack Tang:

> Hi All
>
> Here is the "db.max.outlinks.per.page" property and its description in
> nutch-default.xml
>     <property>
>       <name>db.max.outlinks.per.page</name>
>       <value>100</value>
>       <description>The maximum number of outlinks that we'll  
> process for a page.
>       </description>
>        </property>
>
> I don't think the description is right.
> Say, my crawler feeds are:
> http://www.a.com/index.php (90 outlinks)
> http://www.b.com/index.jsp  (80 outlinks)
> http://www.c.com/index.html (50 outlinks)
>
> and the number of crawler thread is 30. Do you think the reminder URLs
> ( (80 -10) outlinks + 50  outlinks) will be fetched?
>
> I think the description should be "The maximum number of outlinks in
> one fecthing phase."
>
>
> Regards
> /Jack
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net


Reply | Threaded
Open this post in threaded view
|

Re: "db.max.outlinks.per.page" is misunderstood?

Jack.Tang
Yes, Stefan.
But it missed some URLs, and I set the value to 3000, then everything is OK

/Jack

On 9/8/05, Stefan Groschupf <[hidden email]> wrote:

> Jack,
> That is max outlinks per html page.
> All your example pages have less than 100 outlinks, right?!
> Stefan
>
> Am 07.09.2005 um 18:43 schrieb Jack Tang:
>
> > Hi All
> >
> > Here is the "db.max.outlinks.per.page" property and its description in
> > nutch-default.xml
> >     <property>
> >       <name>db.max.outlinks.per.page</name>
> >       <value>100</value>
> >       <description>The maximum number of outlinks that we'll
> > process for a page.
> >       </description>
> >        </property>
> >
> > I don't think the description is right.
> > Say, my crawler feeds are:
> > http://www.a.com/index.php (90 outlinks)
> > http://www.b.com/index.jsp  (80 outlinks)
> > http://www.c.com/index.html (50 outlinks)
> >
> > and the number of crawler thread is 30. Do you think the reminder URLs
> > ( (80 -10) outlinks + 50  outlinks) will be fetched?
> >
> > I think the description should be "The maximum number of outlinks in
> > one fecthing phase."
> >
> >
> > Regards
> > /Jack
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
> >
>
> ---------------------------------------------------------------
> company:        http://www.media-style.com
> forum:        http://www.text-mining.org
> blog:            http://www.find23.net
>
>
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: "db.max.outlinks.per.page" is misunderstood?

AJ Chen
Jack,
Set the max to 100, but run 10 cycles (i.e., depth=10) with the
CrawlTool. You may see all the outlinks are collected toward the end.  3
cycles is usually not enough.
-AJ

Jack Tang wrote:

>Yes, Stefan.
>But it missed some URLs, and I set the value to 3000, then everything is OK
>
>/Jack
>
>On 9/8/05, Stefan Groschupf <[hidden email]> wrote:
>  
>
>>Jack,
>>That is max outlinks per html page.
>>All your example pages have less than 100 outlinks, right?!
>>Stefan
>>
>>Am 07.09.2005 um 18:43 schrieb Jack Tang:
>>
>>    
>>
>>>Hi All
>>>
>>>Here is the "db.max.outlinks.per.page" property and its description in
>>>nutch-default.xml
>>>    <property>
>>>      <name>db.max.outlinks.per.page</name>
>>>      <value>100</value>
>>>      <description>The maximum number of outlinks that we'll
>>>process for a page.
>>>      </description>
>>>       </property>
>>>
>>>I don't think the description is right.
>>>Say, my crawler feeds are:
>>>http://www.a.com/index.php (90 outlinks)
>>>http://www.b.com/index.jsp  (80 outlinks)
>>>http://www.c.com/index.html (50 outlinks)
>>>
>>>and the number of crawler thread is 30. Do you think the reminder URLs
>>>( (80 -10) outlinks + 50  outlinks) will be fetched?
>>>
>>>I think the description should be "The maximum number of outlinks in
>>>one fecthing phase."
>>>
>>>
>>>Regards
>>>/Jack
>>>--
>>>Keep Discovering ... ...
>>>http://www.jroller.com/page/jmars
>>>
>>>
>>>      
>>>
>>---------------------------------------------------------------
>>company:        http://www.media-style.com
>>forum:        http://www.text-mining.org
>>blog:            http://www.find23.net
>>
>>
>>
>>
>>    
>>
>
>
>  
>

--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting
Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [hidden email]
---------------------------------------------------
Reply | Threaded
Open this post in threaded view
|

Re: "db.max.outlinks.per.page" is misunderstood?

Jack.Tang
Thanks Chen, I will try that:)

On 9/8/05, AJ Chen <[hidden email]> wrote:

> Jack,
> Set the max to 100, but run 10 cycles (i.e., depth=10) with the
> CrawlTool. You may see all the outlinks are collected toward the end.  3
> cycles is usually not enough.
> -AJ
>
> Jack Tang wrote:
>
> >Yes, Stefan.
> >But it missed some URLs, and I set the value to 3000, then everything is OK
> >
> >/Jack
> >
> >On 9/8/05, Stefan Groschupf <[hidden email]> wrote:
> >
> >
> >>Jack,
> >>That is max outlinks per html page.
> >>All your example pages have less than 100 outlinks, right?!
> >>Stefan
> >>
> >>Am 07.09.2005 um 18:43 schrieb Jack Tang:
> >>
> >>
> >>
> >>>Hi All
> >>>
> >>>Here is the "db.max.outlinks.per.page" property and its description in
> >>>nutch-default.xml
> >>>    <property>
> >>>      <name>db.max.outlinks.per.page</name>
> >>>      <value>100</value>
> >>>      <description>The maximum number of outlinks that we'll
> >>>process for a page.
> >>>      </description>
> >>>       </property>
> >>>
> >>>I don't think the description is right.
> >>>Say, my crawler feeds are:
> >>>http://www.a.com/index.php (90 outlinks)
> >>>http://www.b.com/index.jsp  (80 outlinks)
> >>>http://www.c.com/index.html (50 outlinks)
> >>>
> >>>and the number of crawler thread is 30. Do you think the reminder URLs
> >>>( (80 -10) outlinks + 50  outlinks) will be fetched?
> >>>
> >>>I think the description should be "The maximum number of outlinks in
> >>>one fecthing phase."
> >>>
> >>>
> >>>Regards
> >>>/Jack
> >>>--
> >>>Keep Discovering ... ...
> >>>http://www.jroller.com/page/jmars
> >>>
> >>>
> >>>
> >>>
> >>---------------------------------------------------------------
> >>company:        http://www.media-style.com
> >>forum:        http://www.text-mining.org
> >>blog:            http://www.find23.net
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> >
>
> --
> AJ (Anjun) Chen, Ph.D.
> Canova Bioconsulting
> Marketing * BD * Software Development
> 748 Matadero Ave., Palo Alto, CA 94306, USA
> Cell 650-283-4091, [hidden email]
> ---------------------------------------------------
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars