problems with link limits

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

problems with link limits

wynz lo
Hi everyone,

I've spent hours searching around trying to solve this and it's starting to
drive me a little nuts. You all might be my last hope in staying out of a
padded room.

I have one small site I'm trying to crawl. The site is a handful of
different JSPs that are essentially templates for people's profiles. The
different profile pages are generated by passing a uri parameter. Nutch is
actually doing a fine job of crawling the smaller pages, but the main index
is causing trouble.

The main index has a single list of 772 links in alphabetical order like
this:

http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2111&name=Adams+Rebecca
http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual4421&name=Decker+Alice
http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual5602&name=Lincoln+Robert
http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2452&name=Small+Harry
http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2431&name=Whittaker+Bob
...and so on...

Nutch fetches about the first 90-110 (usually all the A's and B's) but
that's it. I got real excited when I found the db.max.outlinks.per.page
setting was at a default of 100. However, changing that to -1 or a high
value doesn't fix the problem. When I change it to a small value, like 15,
the fetcher grabs even fewer links, so it is definitely working.

Any suggestions? Thanks so much.

Wynz
Reply | Threaded
Open this post in threaded view
|

Re: problems with link limits

Otis Gospodnetic-2-2
Hi,

There is also a setting for the maximal number of bytes to fetch.  If your main index page is large, maybe it's just getting cut off because of that.  The property has "content" in the name, I believe, so look for that in nutch-default.xml.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----

> From: wynz lo <[hidden email]>
> To: [hidden email]
> Sent: Tuesday, June 17, 2008 6:18:26 PM
> Subject: problems with link limits
>
> Hi everyone,
>
> I've spent hours searching around trying to solve this and it's starting to
> drive me a little nuts. You all might be my last hope in staying out of a
> padded room.
>
> I have one small site I'm trying to crawl. The site is a handful of
> different JSPs that are essentially templates for people's profiles. The
> different profile pages are generated by passing a uri parameter. Nutch is
> actually doing a fine job of crawling the smaller pages, but the main index
> is causing trouble.
>
> The main index has a single list of 772 links in alphabetical order like
> this:
>
> http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2111&name=Adams+Rebecca
> http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual4421&name=Decker+Alice
> http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual5602&name=Lincoln+Robert
> http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2452&name=Small+Harry
> http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2431&name=Whittaker+Bob
> ...and so on...
>
> Nutch fetches about the first 90-110 (usually all the A's and B's) but
> that's it. I got real excited when I found the db.max.outlinks.per.page
> setting was at a default of 100. However, changing that to -1 or a high
> value doesn't fix the problem. When I change it to a small value, like 15,
> the fetcher grabs even fewer links, so it is definitely working.
>
> Any suggestions? Thanks so much.
>
> Wynz

Reply | Threaded
Open this post in threaded view
|

Re: problems with link limits

wynz lo
Otis,
Thank you so much, that fixed the problem immediately! The default for
http.content.limit was 64k and my index list was over 400k (those long URIs
really beef up the document size).

-Wynz

On Wed, Jun 18, 2008 at 12:45 AM, Otis Gospodnetic <[hidden email]>
wrote:

> Hi,
>
> There is also a setting for the maximal number of bytes to fetch.  If your
> main index page is large, maybe it's just getting cut off because of that.
>  The property has "content" in the name, I believe, so look for that in
> nutch-default.xml.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> ----- Original Message ----
> > From: wynz lo <[hidden email]>
> > To: [hidden email]
> > Sent: Tuesday, June 17, 2008 6:18:26 PM
> > Subject: problems with link limits
> >
> > Hi everyone,
> >
> > I've spent hours searching around trying to solve this and it's starting
> to
> > drive me a little nuts. You all might be my last hope in staying out of a
> > padded room.
> >
> > I have one small site I'm trying to crawl. The site is a handful of
> > different JSPs that are essentially templates for people's profiles. The
> > different profile pages are generated by passing a uri parameter. Nutch
> is
> > actually doing a fine job of crawling the smaller pages, but the main
> index
> > is causing trouble.
> >
> > The main index has a single list of 772 links in alphabetical order like
> > this:
> >
> >
> http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2111&name=Adams+Rebecca
> >
> http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual4421&name=Decker+Alice
> >
> http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual5602&name=Lincoln+Robert
> >
> http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2452&name=Small+Harry
> >
> http://localhost:8080/person.jsp?uri=http%3a%2f%2fmy.domain.com%2fns%2findividual2431&name=Whittaker+Bob
> > ...and so on...
> >
> > Nutch fetches about the first 90-110 (usually all the A's and B's) but
> > that's it. I got real excited when I found the db.max.outlinks.per.page
> > setting was at a default of 100. However, changing that to -1 or a high
> > value doesn't fix the problem. When I change it to a small value, like
> 15,
> > the fetcher grabs even fewer links, so it is definitely working.
> >
> > Any suggestions? Thanks so much.
> >
> > Wynz
>
>