can not deal too many files under one folder

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

can not deal too many files under one folder

宫照
Hi all,

I have post this porblem before, but not solved.

I use nutch to crawl on intranet to crawl some documents. For some urls there
are many documents under it.

I find after crawling, if there are more than 32 files under one
folder, I only can search 32 documents before ,other documents after
can not be searched. I check it at luke, it have the same situation.

It means it only deal with first 32 documents,  if we have more
document than 32,
it can not be crawled, and every url have the same problem.

Anybody know the reason?

regards,
Gong Zhao
Reply | Threaded
Open this post in threaded view
|

Re: can not deal too many files under one folder

Onur Deniz
hi,

if 32 is a limit for all urls, i think you did not edit nutch-default.xml maybe...

take a look at that file unde conf folder.

setting db.max.outlinks.per.page as -1 may solve your problem. but also take a look at other variables. those alse may cause a problem in future, like http.content.limit...

hope this helps..

regards

onur deniz
 


--- On Tue, 9/2/08, 宫照 <[hidden email]> wrote:

> From: 宫照 <[hidden email]>
> Subject: can not deal too many files under one folder
> To: [hidden email]
> Date: Tuesday, September 2, 2008, 6:43 AM
> Hi all,
>
> I have post this porblem before, but not solved.
>
> I use nutch to crawl on intranet to crawl some documents.
> For some urls there
> are many documents under it.
>
> I find after crawling, if there are more than 32 files
> under one
> folder, I only can search 32 documents before ,other
> documents after
> can not be searched. I check it at luke, it have the same
> situation.
>
> It means it only deal with first 32 documents,  if we have
> more
> document than 32,
> it can not be crawled, and every url have the same problem.
>
> Anybody know the reason?
>
> regards,
> Gong Zhao



Reply | Threaded
Open this post in threaded view
|

Re: can not deal too many files under one folder

Srinivas Gokavarapu
In reply to this post by 宫照
hi
         First check whether u have kept the settings for crawling intranet
correctly. Here is a link check it out.
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch

And try one thing Just try to index only one folder containing more than 32
files.

Regards,
Srinivas.

On Tue, Sep 2, 2008 at 9:13 AM, 宫照 <[hidden email]> wrote:

> Hi all,
>
> I have post this porblem before, but not solved.
>
> I use nutch to crawl on intranet to crawl some documents. For some urls
> there
> are many documents under it.
>
> I find after crawling, if there are more than 32 files under one
> folder, I only can search 32 documents before ,other documents after
> can not be searched. I check it at luke, it have the same situation.
>
> It means it only deal with first 32 documents,  if we have more
> document than 32,
> it can not be crawled, and every url have the same problem.
>
> Anybody know the reason?
>
> regards,
> Gong Zhao
>
Reply | Threaded
Open this post in threaded view
|

Re: can not deal too many files under one folder

宫照
In reply to this post by Onur Deniz
Hi,

Thank you for your help.
I setted the db.max.outlinks.per.page as -1. Now, there is no limit for
files number under one folder.


2008/9/2 Onur Deniz <[hidden email]>

> hi,
>
> if 32 is a limit for all urls, i think you did not edit nutch-default.xml
> maybe...
>
> take a look at that file unde conf folder.
>
> setting db.max.outlinks.per.page as -1 may solve your problem. but also
> take a look at other variables. those alse may cause a problem in future,
> like http.content.limit...
>
> hope this helps..
>
> regards
>
> onur deniz
>
>
>
> --- On Tue, 9/2/08, 宫照 <[hidden email]> wrote:
>
> > From: 宫照 <[hidden email]>
> > Subject: can not deal too many files under one folder
> > To: [hidden email]
> > Date: Tuesday, September 2, 2008, 6:43 AM
> > Hi all,
> >
> > I have post this porblem before, but not solved.
> >
> > I use nutch to crawl on intranet to crawl some documents.
> > For some urls there
> > are many documents under it.
> >
> > I find after crawling, if there are more than 32 files
> > under one
> > folder, I only can search 32 documents before ,other
> > documents after
> > can not be searched. I check it at luke, it have the same
> > situation.
> >
> > It means it only deal with first 32 documents,  if we have
> > more
> > document than 32,
> > it can not be crawled, and every url have the same problem.
> >
> > Anybody know the reason?
> >
> > regards,
> > Gong Zhao
>
>
>
>