Why is lucene so slow indexing in nfs file system ?

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Why is lucene so slow indexing in nfs file system ?

Ariel Isaac Romero Cartaya
Hi:
I have seen the post in
http://www.mail-archive.com/lucene-user@.../msg12700.html and
I am implementing a similar application in a distributed enviroment, a
cluster of nodes only 5 nodes. The operating system I use is Linux(Centos)
so I am using nfs file system too to access the home directory where the
documents to be indexed reside and I would like to know how much time an
application spends to index a big amount of documents like 10 Gb ?
I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every
nodes, LAN: 1Gbits/s.

The problem I have is that my application spends a lot of time to index all
the documents, the delay to index 10 gb of pdf documents is about 2 days (to
convert pdf to text I am using pdfbox) that is of course a lot of time,
others applications based in lucene, for instance ibm omnifind only takes 5
hours to index the same amount of pdfs documents. I would like to find out
why my application has this big delay to index, any help is welcome.
Dou you know others distributed architecture application that uses lucene to
index big amounts of documents ? How long time it takes to index ?
I hope yo can help me
Greetings
Reply | Threaded
Open this post in threaded view
|

Re: Why is lucene so slow indexing in nfs file system ?

Erick Erickson
<<< would like to find out why my application has this big
delay to index>>>

Well, then you have to measure <G>. Tthe first thing I'd do
is pinpoint where the time was being spent. Until you have
that answered, you simply cannot take any meaningful action.

1> don't do any of the indexing. No new Documents, don't
add any fields, etc. This will just time the PDF parsing.
(I'd run this for set number of documents rather than the
whole 10G). This'll tell you whether the issue is indexing or
PDFBox.

2> Perhaps try the above with local files rather than files
on the nfs mount.

3> Put back some of the indexing and measure each
step. For instance, create the new documents but don't
add them to the index.

4>Then go ahead and add them to the index.

The numbers you get for these measurements will tell
you a lot. At that point, perhaps folks will have more useful
suggestions.

The reason I'm being so unhelpful is that without lots more
detail, there's really nothing we can help with since there
are so many variables that it's just impossible to say
which one is the problem. For instance, is it a single
10G document and you're swapping like crazy? Are you
CPU bound or IO bound? Have you tried profiling your
process at all to find the choke points?

Best
Erick


On Jan 9, 2008 8:50 AM, Ariel <[hidden email]> wrote:

> Hi:
> I have seen the post in
> http://www.mail-archive.com/lucene-user@.../msg12700.htmland
> I am implementing a similar application in a distributed enviroment, a
> cluster of nodes only 5 nodes. The operating system I use is Linux(Centos)
> so I am using nfs file system too to access the home directory where the
> documents to be indexed reside and I would like to know how much time an
> application spends to index a big amount of documents like 10 Gb ?
> I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
> every
> nodes, LAN: 1Gbits/s.
>
> The problem I have is that my application spends a lot of time to index
> all
> the documents, the delay to index 10 gb of pdf documents is about 2 days
> (to
> convert pdf to text I am using pdfbox) that is of course a lot of time,
> others applications based in lucene, for instance ibm omnifind only takes
> 5
> hours to index the same amount of pdfs documents. I would like to find out
> why my application has this big delay to index, any help is welcome.
> Dou you know others distributed architecture application that uses lucene
> to
> index big amounts of documents ? How long time it takes to index ?
> I hope yo can help me
> Greetings
>
Reply | Threaded
Open this post in threaded view
|

RE: Why is lucene so slow indexing in nfs file system ?

steve_rowe
In reply to this post by Ariel Isaac Romero Cartaya
Hi Ariel,

On 01/09/2008 at 8:50 AM, Ariel wrote:
> Dou you know others distributed architecture application that
> uses lucene to index big amounts of documents ?

Apache Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Tomcat.

http://lucene.apache.org/solr/

Steve


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Why is lucene so slow indexing in nfs file system ?

Grant Ingersoll-2
There's also Nutch.  However, 10GB isn't that big...  Perhaps you can  
index where the docs/index lives, then just make the index available  
via NFS?  Or, better yet, use rsync to replicate it like Solr does.

-Grant

On Jan 9, 2008, at 10:49 AM, Steven A Rowe wrote:

> Hi Ariel,
>
> On 01/09/2008 at 8:50 AM, Ariel wrote:
>> Dou you know others distributed architecture application that
>> uses lucene to index big amounts of documents ?
>
> Apache Solr is an open source enterprise search server based on the  
> Lucene Java search library, with XML/HTTP and JSON APIs, hit  
> highlighting, faceted search, caching, replication, and a web  
> administration interface. It runs in a Java servlet container such  
> as Tomcat.
>
> http://lucene.apache.org/solr/
>
> Steve
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: Why is lucene so slow indexing in nfs file system ?

adb
In reply to this post by Ariel Isaac Romero Cartaya
Ariel wrote:

> The problem I have is that my application spends a lot of time to index all
> the documents, the delay to index 10 gb of pdf documents is about 2 days (to
> convert pdf to text I am using pdfbox) that is of course a lot of time,
> others applications based in lucene, for instance ibm omnifind only takes 5
> hours to index the same amount of pdfs documents. I would like to find out

If you are using log4j, make sure you have the pdfbox log4j categories set to
info or higher, otherwise this really slows it down (factor of 10) or make sure
you are using the non log4j version.  See
http://sourceforge.net/forum/message.php?msg_id=3947448

Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Why is lucene so slow indexing in nfs file system ?

Otis Gospodnetic-2
In reply to this post by Ariel Isaac Romero Cartaya
Ariel,

I believe PDFBox is not the fastest thing and was built more to handle all possible PDFs than for speed (just my impression - Ben, PDFBox's author might still be on this list and might comment).  Pulling data from NFS to index seems like a bad idea.  I hope at least the indices are local and not on a remote NFS...

We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) and indexing overNFS was slooooooow.

Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Ariel <[hidden email]>
To: [hidden email]
Sent: Wednesday, January 9, 2008 2:50:41 PM
Subject: Why is lucene so slow indexing in nfs file system ?

Hi:
I have seen the post in
http://www.mail-archive.com/lucene-user@.../msg12700.html
 and
I am implementing a similar application in a distributed enviroment, a
cluster of nodes only 5 nodes. The operating system I use is
 Linux(Centos)
so I am using nfs file system too to access the home directory where
 the
documents to be indexed reside and I would like to know how much time
 an
application spends to index a big amount of documents like 10 Gb ?
I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
 every
nodes, LAN: 1Gbits/s.

The problem I have is that my application spends a lot of time to index
 all
the documents, the delay to index 10 gb of pdf documents is about 2
 days (to
convert pdf to text I am using pdfbox) that is of course a lot of time,
others applications based in lucene, for instance ibm omnifind only
 takes 5
hours to index the same amount of pdfs documents. I would like to find
 out
why my application has this big delay to index, any help is welcome.
Dou you know others distributed architecture application that uses
 lucene to
index big amounts of documents ? How long time it takes to index ?
I hope yo can help me
Greetings




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Why is lucene so slow indexing in nfs file system ?

Ariel Isaac Romero Cartaya
Thanks all you for yours answers, I going to change a few things in my
application and make tests.
One thing I haven't find another good pdfToText converter like pdfBox Do you
know any other faster ?
Greetings
Thanks for yours answers
Ariel

On Jan 9, 2008 11:08 PM, Otis Gospodnetic <[hidden email]>
wrote:

> Ariel,
>
> I believe PDFBox is not the fastest thing and was built more to handle all
> possible PDFs than for speed (just my impression - Ben, PDFBox's author
> might still be on this list and might comment).  Pulling data from NFS to
> index seems like a bad idea.  I hope at least the indices are local and not
> on a remote NFS...
>
> We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one)
> and indexing overNFS was slooooooow.
>
> Otis
>
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Ariel <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, January 9, 2008 2:50:41 PM
> Subject: Why is lucene so slow indexing in nfs file system ?
>
> Hi:
> I have seen the post in
> http://www.mail-archive.com/lucene-user@.../msg12700.html
>  and
> I am implementing a similar application in a distributed enviroment, a
> cluster of nodes only 5 nodes. The operating system I use is
>  Linux(Centos)
> so I am using nfs file system too to access the home directory where
>  the
> documents to be indexed reside and I would like to know how much time
>  an
> application spends to index a big amount of documents like 10 Gb ?
> I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
>  every
> nodes, LAN: 1Gbits/s.
>
> The problem I have is that my application spends a lot of time to index
>  all
> the documents, the delay to index 10 gb of pdf documents is about 2
>  days (to
> convert pdf to text I am using pdfbox) that is of course a lot of time,
> others applications based in lucene, for instance ibm omnifind only
>  takes 5
> hours to index the same amount of pdfs documents. I would like to find
>  out
> why my application has this big delay to index, any help is welcome.
> Dou you know others distributed architecture application that uses
>  lucene to
> index big amounts of documents ? How long time it takes to index ?
> I hope yo can help me
> Greetings
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Why is lucene so slow indexing in nfs file system ?

Ariel Isaac Romero Cartaya
In a distributed enviroment the application should make an exhaustive use of
the network and there is not another way to access to the documents in a
remote repository but accessing in nfs file system.
One thing I must clarify: I index the documents in memory, I use
RAMDirectory to do that, then when the RAMDirectory reach the limit(I have
put about 10 Mb) then I serialize to disk(nfs) the index to merge it with
the central index(the central index is in nfs file system), is that correct?
I hope you can help me.
I have take in consideration the suggestions you have make me before, I
going to do some things to test it.
Ariel


On Jan 10, 2008 8:45 AM, Ariel <[hidden email]> wrote:

> Thanks all you for yours answers, I going to change a few things in my
> application and make tests.
> One thing I haven't find another good pdfToText converter like pdfBox Do
> you know any other faster ?
> Greetings
> Thanks for yours answers
> Ariel
>
>
> On Jan 9, 2008 11:08 PM, Otis Gospodnetic <[hidden email]>
> wrote:
>
> > Ariel,
> >
> > I believe PDFBox is not the fastest thing and was built more to handle
> > all possible PDFs than for speed (just my impression - Ben, PDFBox's author
> > might still be on this list and might comment).  Pulling data from NFS to
> > index seems like a bad idea.  I hope at least the indices are local and not
> > on a remote NFS...
> >
> > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one)
> > and indexing overNFS was slooooooow.
> >
> > Otis
> >
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> > ----- Original Message ----
> > From: Ariel <[hidden email]>
> > To: [hidden email]
> > Sent: Wednesday, January 9, 2008 2:50:41 PM
> > Subject: Why is lucene so slow indexing in nfs file system ?
> >
> > Hi:
> > I have seen the post in
> > http://www.mail-archive.com/lucene-user@.../msg12700.html
> >  and
> > I am implementing a similar application in a distributed enviroment, a
> > cluster of nodes only 5 nodes. The operating system I use is
> >  Linux(Centos)
> > so I am using nfs file system too to access the home directory where
> >  the
> > documents to be indexed reside and I would like to know how much time
> >  an
> > application spends to index a big amount of documents like 10 Gb ?
> > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
> >  every
> > nodes, LAN: 1Gbits/s.
> >
> > The problem I have is that my application spends a lot of time to index
> >  all
> > the documents, the delay to index 10 gb of pdf documents is about 2
> >  days (to
> > convert pdf to text I am using pdfbox) that is of course a lot of time,
> > others applications based in lucene, for instance ibm omnifind only
> >  takes 5
> > hours to index the same amount of pdfs documents. I would like to find
> >  out
> > why my application has this big delay to index, any help is welcome.
> > Dou you know others distributed architecture application that uses
> >  lucene to
> > index big amounts of documents ? How long time it takes to index ?
> > I hope yo can help me
> > Greetings
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Why is lucene so slow indexing in nfs file system ?

Erick Erickson
This seems really clunky. Especially if your merge step also optimizes.

There's not much point in indexing into RAM then merging explicitly.
Just use an FSDirectory rather than a RAMDirectory. There is *already*
buffering built in to FSDirectory, and your merge factor etc. control
how much RAM is used before flushing to disk. There's considerable
discussion of this on the Wiki I believe, but in the mail archive for sure.
And I believe there's a RAM usage based flushing policy somewhere.

You're adding complexity where it's probably not necessary. Did you
adopt this scheme because you *thought* it would be faster or because
you were addressing a *known* problem? Don't *ever* write complex code
to support a theoretical case unless you have considerable certainty
that it really is a problem. "It would be faster" is a weak argument when
you don't know whether you're talking about saving 1% or 95%. The
added maintenance is just not worth it.

There's a famous quote about that from Donald Knuth
(paraphrasing Hoare) "We should forget about small efficiencies,
say about 97% of the time: premature optimization is the root of
all evil." It's true.

So the very *first* measurement I'd take is to get rid of the in-RAM
stuff and just write the index to local disk. I suspect you'll be *far*
better off doing this then just copying your index to the nfs mount.

Best
Erick

On Jan 10, 2008 10:05 AM, Ariel <[hidden email]> wrote:

> In a distributed enviroment the application should make an exhaustive use
> of
> the network and there is not another way to access to the documents in a
> remote repository but accessing in nfs file system.
> One thing I must clarify: I index the documents in memory, I use
> RAMDirectory to do that, then when the RAMDirectory reach the limit(I have
> put about 10 Mb) then I serialize to disk(nfs) the index to merge it with
> the central index(the central index is in nfs file system), is that
> correct?
> I hope you can help me.
> I have take in consideration the suggestions you have make me before, I
> going to do some things to test it.
> Ariel
>
>
> On Jan 10, 2008 8:45 AM, Ariel <[hidden email]> wrote:
>
> > Thanks all you for yours answers, I going to change a few things in my
> > application and make tests.
> > One thing I haven't find another good pdfToText converter like pdfBox Do
> > you know any other faster ?
> > Greetings
> > Thanks for yours answers
> > Ariel
> >
> >
> > On Jan 9, 2008 11:08 PM, Otis Gospodnetic <[hidden email]>
> > wrote:
> >
> > > Ariel,
> > >
> > > I believe PDFBox is not the fastest thing and was built more to handle
> > > all possible PDFs than for speed (just my impression - Ben, PDFBox's
> author
> > > might still be on this list and might comment).  Pulling data from NFS
> to
> > > index seems like a bad idea.  I hope at least the indices are local
> and not
> > > on a remote NFS...
> > >
> > > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
> one)
> > > and indexing overNFS was slooooooow.
> > >
> > > Otis
> > >
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > >
> > > ----- Original Message ----
> > > From: Ariel <[hidden email]>
> > > To: [hidden email]
> > > Sent: Wednesday, January 9, 2008 2:50:41 PM
> > > Subject: Why is lucene so slow indexing in nfs file system ?
> > >
> > > Hi:
> > > I have seen the post in
> > >
> http://www.mail-archive.com/lucene-user@.../msg12700.html
> > >  and
> > > I am implementing a similar application in a distributed enviroment, a
> > > cluster of nodes only 5 nodes. The operating system I use is
> > >  Linux(Centos)
> > > so I am using nfs file system too to access the home directory where
> > >  the
> > > documents to be indexed reside and I would like to know how much time
> > >  an
> > > application spends to index a big amount of documents like 10 Gb ?
> > > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
> > >  every
> > > nodes, LAN: 1Gbits/s.
> > >
> > > The problem I have is that my application spends a lot of time to
> index
> > >  all
> > > the documents, the delay to index 10 gb of pdf documents is about 2
> > >  days (to
> > > convert pdf to text I am using pdfbox) that is of course a lot of
> time,
> > > others applications based in lucene, for instance ibm omnifind only
> > >  takes 5
> > > hours to index the same amount of pdfs documents. I would like to find
> > >  out
> > > why my application has this big delay to index, any help is welcome.
> > > Dou you know others distributed architecture application that uses
> > >  lucene to
> > > index big amounts of documents ? How long time it takes to index ?
> > > I hope yo can help me
> > > Greetings
> > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Why is lucene so slow indexing in nfs file system ?

Michael McCandless-2

If possible you should also test the soon-to-be-released version 2.3,  
which has a number of speedups to indexing.

Also try the steps here:

   http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

You should also try an A/B test: A) writing your index to the NFS  
directory and then B) to a local IO system, to see how much NFS is  
really slowing you down.

Mike

Erick Erickson wrote:

> This seems really clunky. Especially if your merge step also  
> optimizes.
>
> There's not much point in indexing into RAM then merging explicitly.
> Just use an FSDirectory rather than a RAMDirectory. There is *already*
> buffering built in to FSDirectory, and your merge factor etc. control
> how much RAM is used before flushing to disk. There's considerable
> discussion of this on the Wiki I believe, but in the mail archive  
> for sure.
> And I believe there's a RAM usage based flushing policy somewhere.
>
> You're adding complexity where it's probably not necessary. Did you
> adopt this scheme because you *thought* it would be faster or because
> you were addressing a *known* problem? Don't *ever* write complex code
> to support a theoretical case unless you have considerable certainty
> that it really is a problem. "It would be faster" is a weak  
> argument when
> you don't know whether you're talking about saving 1% or 95%. The
> added maintenance is just not worth it.
>
> There's a famous quote about that from Donald Knuth
> (paraphrasing Hoare) "We should forget about small efficiencies,
> say about 97% of the time: premature optimization is the root of
> all evil." It's true.
>
> So the very *first* measurement I'd take is to get rid of the in-RAM
> stuff and just write the index to local disk. I suspect you'll be  
> *far*
> better off doing this then just copying your index to the nfs mount.
>
> Best
> Erick
>
> On Jan 10, 2008 10:05 AM, Ariel <[hidden email]> wrote:
>
>> In a distributed enviroment the application should make an  
>> exhaustive use
>> of
>> the network and there is not another way to access to the  
>> documents in a
>> remote repository but accessing in nfs file system.
>> One thing I must clarify: I index the documents in memory, I use
>> RAMDirectory to do that, then when the RAMDirectory reach the limit
>> (I have
>> put about 10 Mb) then I serialize to disk(nfs) the index to merge  
>> it with
>> the central index(the central index is in nfs file system), is that
>> correct?
>> I hope you can help me.
>> I have take in consideration the suggestions you have make me  
>> before, I
>> going to do some things to test it.
>> Ariel
>>
>>
>> On Jan 10, 2008 8:45 AM, Ariel <[hidden email]> wrote:
>>
>>> Thanks all you for yours answers, I going to change a few things  
>>> in my
>>> application and make tests.
>>> One thing I haven't find another good pdfToText converter like  
>>> pdfBox Do
>>> you know any other faster ?
>>> Greetings
>>> Thanks for yours answers
>>> Ariel
>>>
>>>
>>> On Jan 9, 2008 11:08 PM, Otis Gospodnetic  
>>> <[hidden email]>
>>> wrote:
>>>
>>>> Ariel,
>>>>
>>>> I believe PDFBox is not the fastest thing and was built more to  
>>>> handle
>>>> all possible PDFs than for speed (just my impression - Ben,  
>>>> PDFBox's
>> author
>>>> might still be on this list and might comment).  Pulling data  
>>>> from NFS
>> to
>>>> index seems like a bad idea.  I hope at least the indices are local
>> and not
>>>> on a remote NFS...
>>>>
>>>> We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
>> one)
>>>> and indexing overNFS was slooooooow.
>>>>
>>>> Otis
>>>>
>>>> --
>>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>>
>>>> ----- Original Message ----
>>>> From: Ariel <[hidden email]>
>>>> To: [hidden email]
>>>> Sent: Wednesday, January 9, 2008 2:50:41 PM
>>>> Subject: Why is lucene so slow indexing in nfs file system ?
>>>>
>>>> Hi:
>>>> I have seen the post in
>>>>
>> http://www.mail-archive.com/lucene-user@.../ 
>> msg12700.html
>>>>  and
>>>> I am implementing a similar application in a distributed  
>>>> enviroment, a
>>>> cluster of nodes only 5 nodes. The operating system I use is
>>>>  Linux(Centos)
>>>> so I am using nfs file system too to access the home directory  
>>>> where
>>>>  the
>>>> documents to be indexed reside and I would like to know how much  
>>>> time
>>>>  an
>>>> application spends to index a big amount of documents like 10 Gb ?
>>>> I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512  
>>>> Mb in
>>>>  every
>>>> nodes, LAN: 1Gbits/s.
>>>>
>>>> The problem I have is that my application spends a lot of time to
>> index
>>>>  all
>>>> the documents, the delay to index 10 gb of pdf documents is about 2
>>>>  days (to
>>>> convert pdf to text I am using pdfbox) that is of course a lot of
>> time,
>>>> others applications based in lucene, for instance ibm omnifind only
>>>>  takes 5
>>>> hours to index the same amount of pdfs documents. I would like  
>>>> to find
>>>>  out
>>>> why my application has this big delay to index, any help is  
>>>> welcome.
>>>> Dou you know others distributed architecture application that uses
>>>>  lucene to
>>>> index big amounts of documents ? How long time it takes to index ?
>>>> I hope yo can help me
>>>> Greetings
>>>>
>>>>
>>>>
>>>>
>>>> -------------------------------------------------------------------
>>>> --
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Why is lucene so slow indexing in nfs file system ?

Ariel Isaac Romero Cartaya
In reply to this post by Erick Erickson
I am indexing into RAM then merging explicitly because my application demand
it due to I have design it as a distributed enviroment so many threads or
workers are in different machines indexing into RAM serialize to disk an
another thread in another machine access the segment index to merge it with
the principal one, that is faster than if I had just one thread indexing the
documents, doesn' it ?
Yours suggestions are very useful.
I hope you can help me.
Greetings
Ariel

On Jan 10, 2008 10:21 AM, Erick Erickson <[hidden email]> wrote:

> This seems really clunky. Especially if your merge step also optimizes.
>
> There's not much point in indexing into RAM then merging explicitly.
> Just use an FSDirectory rather than a RAMDirectory. There is *already*
> buffering built in to FSDirectory, and your merge factor etc. control
> how much RAM is used before flushing to disk. There's considerable
> discussion of this on the Wiki I believe, but in the mail archive for
> sure.
> And I believe there's a RAM usage based flushing policy somewhere.
>
> You're adding complexity where it's probably not necessary. Did you
> adopt this scheme because you *thought* it would be faster or because
> you were addressing a *known* problem? Don't *ever* write complex code
> to support a theoretical case unless you have considerable certainty
> that it really is a problem. "It would be faster" is a weak argument when
> you don't know whether you're talking about saving 1% or 95%. The
> added maintenance is just not worth it.
>
> There's a famous quote about that from Donald Knuth
> (paraphrasing Hoare) "We should forget about small efficiencies,
> say about 97% of the time: premature optimization is the root of
> all evil." It's true.
>
> So the very *first* measurement I'd take is to get rid of the in-RAM
> stuff and just write the index to local disk. I suspect you'll be *far*
> better off doing this then just copying your index to the nfs mount.
>
> Best
> Erick
>
> On Jan 10, 2008 10:05 AM, Ariel <[hidden email]> wrote:
>
> > In a distributed enviroment the application should make an exhaustive
> use
> > of
> > the network and there is not another way to access to the documents in a
> > remote repository but accessing in nfs file system.
> > One thing I must clarify: I index the documents in memory, I use
> > RAMDirectory to do that, then when the RAMDirectory reach the limit(I
> have
> > put about 10 Mb) then I serialize to disk(nfs) the index to merge it
> with
> > the central index(the central index is in nfs file system), is that
> > correct?
> > I hope you can help me.
> > I have take in consideration the suggestions you have make me before, I
> > going to do some things to test it.
> > Ariel
> >
> >
> > On Jan 10, 2008 8:45 AM, Ariel <[hidden email]> wrote:
> >
> > > Thanks all you for yours answers, I going to change a few things in my
> > > application and make tests.
> > > One thing I haven't find another good pdfToText converter like pdfBox
> Do
> > > you know any other faster ?
> > > Greetings
> > > Thanks for yours answers
> > > Ariel
> > >
> > >
> > > On Jan 9, 2008 11:08 PM, Otis Gospodnetic <[hidden email]>
> > > wrote:
> > >
> > > > Ariel,
> > > >
> > > > I believe PDFBox is not the fastest thing and was built more to
> handle
> > > > all possible PDFs than for speed (just my impression - Ben, PDFBox's
> > author
> > > > might still be on this list and might comment).  Pulling data from
> NFS
> > to
> > > > index seems like a bad idea.  I hope at least the indices are local
> > and not
> > > > on a remote NFS...
> > > >
> > > > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
> > one)
> > > > and indexing overNFS was slooooooow.
> > > >
> > > > Otis
> > > >
> > > > --
> > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > > >
> > > > ----- Original Message ----
> > > > From: Ariel <[hidden email]>
> > > > To: [hidden email]
> > > > Sent: Wednesday, January 9, 2008 2:50:41 PM
> > > > Subject: Why is lucene so slow indexing in nfs file system ?
> > > >
> > > > Hi:
> > > > I have seen the post in
> > > >
> > http://www.mail-archive.com/lucene-user@.../msg12700.html
> > > >  and
> > > > I am implementing a similar application in a distributed enviroment,
> a
> > > > cluster of nodes only 5 nodes. The operating system I use is
> > > >  Linux(Centos)
> > > > so I am using nfs file system too to access the home directory where
> > > >  the
> > > > documents to be indexed reside and I would like to know how much
> time
> > > >  an
> > > > application spends to index a big amount of documents like 10 Gb ?
> > > > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb
> in
> > > >  every
> > > > nodes, LAN: 1Gbits/s.
> > > >
> > > > The problem I have is that my application spends a lot of time to
> > index
> > > >  all
> > > > the documents, the delay to index 10 gb of pdf documents is about 2
> > > >  days (to
> > > > convert pdf to text I am using pdfbox) that is of course a lot of
> > time,
> > > > others applications based in lucene, for instance ibm omnifind only
> > > >  takes 5
> > > > hours to index the same amount of pdfs documents. I would like to
> find
> > > >  out
> > > > why my application has this big delay to index, any help is welcome.
> > > > Dou you know others distributed architecture application that uses
> > > >  lucene to
> > > > index big amounts of documents ? How long time it takes to index ?
> > > > I hope yo can help me
> > > > Greetings
> > > >
> > > >
> > > >
> > > >
> > > >
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [hidden email]
> > > > For additional commands, e-mail: [hidden email]
> > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Why is lucene so slow indexing in nfs file system ?

Otis Gospodnetic-2
In reply to this post by Ariel Isaac Romero Cartaya
Ariel,
 
Comments inline.


----- Original Message ----
From: Ariel <[hidden email]>
To: [hidden email]
Sent: Thursday, January 10, 2008 10:05:28 AM
Subject: Re: Why is lucene so slow indexing in nfs file system ?

In a distributed enviroment the application should make an exhaustive
 use of
the network and there is not another way to access to the documents in
 a
remote repository but accessing in nfs file system.

OG: What about SAN connected over FC for example?

One thing I must clarify: I index the documents in memory, I use
RAMDirectory to do that, then when the RAMDirectory reach the limit(I
 have
put about 10 Mb) then I serialize to disk(nfs) the index to merge it
 with
the central index(the central index is in nfs file system), is that
 correct?

OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will do in-memory thing for you.  Make good use of your RAM and use 2.3 which gives you more control over RAM use during indexing.  Parallelizing indexing over multiple machines and merging at the end is faster, so that's a good approach.  Also, if your boxes have multiple CPUs write your code so that it has multiple worker threads that do indexing and feed docs to IndexWriter.addDocument(Document) to keep the CPUs fully utilized.

OG: Oh, something faster than PDFBox?  There is (can't remember the name now... itextstream or something like that?), though it may not be free like PDFBox.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


On Jan 10, 2008 8:45 AM, Ariel <[hidden email]> wrote:

> Thanks all you for yours answers, I going to change a few things in
 my
> application and make tests.
> One thing I haven't find another good pdfToText converter like pdfBox
 Do
> you know any other faster ?
> Greetings
> Thanks for yours answers
> Ariel
>
>
> On Jan 9, 2008 11:08 PM, Otis Gospodnetic
 <[hidden email]>
> wrote:
>
> > Ariel,
> >
> > I believe PDFBox is not the fastest thing and was built more to
 handle
> > all possible PDFs than for speed (just my impression - Ben,
 PDFBox's author
> > might still be on this list and might comment).  Pulling data from
 NFS to
> > index seems like a bad idea.  I hope at least the indices are local
 and not
> > on a remote NFS...
> >
> > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
 one)

> > and indexing overNFS was slooooooow.
> >
> > Otis
> >
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> > ----- Original Message ----
> > From: Ariel <[hidden email]>
> > To: [hidden email]
> > Sent: Wednesday, January 9, 2008 2:50:41 PM
> > Subject: Why is lucene so slow indexing in nfs file system ?
> >
> > Hi:
> > I have seen the post in
> >
 http://www.mail-archive.com/lucene-user@.../msg12700.html
> >  and
> > I am implementing a similar application in a distributed
 enviroment, a
> > cluster of nodes only 5 nodes. The operating system I use is
> >  Linux(Centos)
> > so I am using nfs file system too to access the home directory
 where
> >  the
> > documents to be indexed reside and I would like to know how much
 time
> >  an
> > application spends to index a big amount of documents like 10 Gb ?
> > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb
 in
> >  every
> > nodes, LAN: 1Gbits/s.
> >
> > The problem I have is that my application spends a lot of time to
 index
> >  all
> > the documents, the delay to index 10 gb of pdf documents is about 2
> >  days (to
> > convert pdf to text I am using pdfbox) that is of course a lot of
 time,
> > others applications based in lucene, for instance ibm omnifind only
> >  takes 5
> > hours to index the same amount of pdfs documents. I would like to
 find
> >  out
> > why my application has this big delay to index, any help is
 welcome.

> > Dou you know others distributed architecture application that uses
> >  lucene to
> > index big amounts of documents ? How long time it takes to index ?
> > I hope yo can help me
> > Greetings
> >
> >
> >
> >
> >
 ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Why is lucene so slow indexing in nfs file system ?

Ariel Isaac Romero Cartaya
Thanks for yours suggestions.

I'm sorry I didn't know but I would want to know what Do you mean with "SAN"
and "FC"?

Another thing, I have visited the lucene home page and there is not released
the 2.3 version, could you tell me where is the download link ?

Thanks in advance.
Ariel

On Jan 10, 2008 2:59 PM, Otis Gospodnetic <[hidden email]>
wrote:

> Ariel,
>
> Comments inline.
>
>
> ----- Original Message ----
> From: Ariel <[hidden email]>
> To: [hidden email]
> Sent: Thursday, January 10, 2008 10:05:28 AM
> Subject: Re: Why is lucene so slow indexing in nfs file system ?
>
> In a distributed enviroment the application should make an exhaustive
>  use of
> the network and there is not another way to access to the documents in
>  a
> remote repository but accessing in nfs file system.
>
> OG: What about SAN connected over FC for example?
>
> One thing I must clarify: I index the documents in memory, I use
> RAMDirectory to do that, then when the RAMDirectory reach the limit(I
>  have
> put about 10 Mb) then I serialize to disk(nfs) the index to merge it
>  with
> the central index(the central index is in nfs file system), is that
>  correct?
>
> OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will
> do in-memory thing for you.  Make good use of your RAM and use 2.3 which
> gives you more control over RAM use during indexing.  Parallelizing indexing
> over multiple machines and merging at the end is faster, so that's a good
> approach.  Also, if your boxes have multiple CPUs write your code so that it
> has multiple worker threads that do indexing and feed docs to
> IndexWriter.addDocument(Document) to keep the CPUs fully utilized.
>
> OG: Oh, something faster than PDFBox?  There is (can't remember the name
> now... itextstream or something like that?), though it may not be free like
> PDFBox.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> On Jan 10, 2008 8:45 AM, Ariel <[hidden email]> wrote:
>
> > Thanks all you for yours answers, I going to change a few things in
>  my
> > application and make tests.
> > One thing I haven't find another good pdfToText converter like pdfBox
>  Do
> > you know any other faster ?
> > Greetings
> > Thanks for yours answers
> > Ariel
> >
> >
> > On Jan 9, 2008 11:08 PM, Otis Gospodnetic
>  <[hidden email]>
> > wrote:
> >
> > > Ariel,
> > >
> > > I believe PDFBox is not the fastest thing and was built more to
>  handle
> > > all possible PDFs than for speed (just my impression - Ben,
>  PDFBox's author
> > > might still be on this list and might comment).  Pulling data from
>  NFS to
> > > index seems like a bad idea.  I hope at least the indices are local
>  and not
> > > on a remote NFS...
> > >
> > > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
>  one)
> > > and indexing overNFS was slooooooow.
> > >
> > > Otis
> > >
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > >
> > > ----- Original Message ----
> > > From: Ariel <[hidden email]>
> > > To: [hidden email]
> > > Sent: Wednesday, January 9, 2008 2:50:41 PM
> > > Subject: Why is lucene so slow indexing in nfs file system ?
> > >
> > > Hi:
> > > I have seen the post in
> > >
>  http://www.mail-archive.com/lucene-user@.../msg12700.html
> > >  and
> > > I am implementing a similar application in a distributed
>  enviroment, a
> > > cluster of nodes only 5 nodes. The operating system I use is
> > >  Linux(Centos)
> > > so I am using nfs file system too to access the home directory
>  where
> > >  the
> > > documents to be indexed reside and I would like to know how much
>  time
> > >  an
> > > application spends to index a big amount of documents like 10 Gb ?
> > > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb
>  in
> > >  every
> > > nodes, LAN: 1Gbits/s.
> > >
> > > The problem I have is that my application spends a lot of time to
>  index
> > >  all
> > > the documents, the delay to index 10 gb of pdf documents is about 2
> > >  days (to
> > > convert pdf to text I am using pdfbox) that is of course a lot of
>  time,
> > > others applications based in lucene, for instance ibm omnifind only
> > >  takes 5
> > > hours to index the same amount of pdfs documents. I would like to
>  find
> > >  out
> > > why my application has this big delay to index, any help is
>  welcome.
> > > Dou you know others distributed architecture application that uses
> > >  lucene to
> > > index big amounts of documents ? How long time it takes to index ?
> > > I hope yo can help me
> > > Greetings
> > >
> > >
> > >
> > >
> > >
>  ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Why is lucene so slow indexing in nfs file system ?

chrislusf
SAN is Storage Area Network. FC is fiber channel.

I can confirm by one customer experience that using SAN does scale
pretty well, and pretty simple. Well, it costs some money.

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request)
got 2.6 Million Euro funding!


On Jan 10, 2008 3:26 PM, Ariel <[hidden email]> wrote:

> Thanks for yours suggestions.
>
> I'm sorry I didn't know but I would want to know what Do you mean with "SAN"
> and "FC"?
>
> Another thing, I have visited the lucene home page and there is not released
> the 2.3 version, could you tell me where is the download link ?
>
> Thanks in advance.
> Ariel
>
> On Jan 10, 2008 2:59 PM, Otis Gospodnetic <[hidden email]>
>
> wrote:
>
> > Ariel,
> >
> > Comments inline.
> >
> >
> > ----- Original Message ----
> > From: Ariel <[hidden email]>
> > To: [hidden email]
> > Sent: Thursday, January 10, 2008 10:05:28 AM
> > Subject: Re: Why is lucene so slow indexing in nfs file system ?
> >
> > In a distributed enviroment the application should make an exhaustive
> >  use of
> > the network and there is not another way to access to the documents in
> >  a
> > remote repository but accessing in nfs file system.
> >
> > OG: What about SAN connected over FC for example?
> >
> > One thing I must clarify: I index the documents in memory, I use
> > RAMDirectory to do that, then when the RAMDirectory reach the limit(I
> >  have
> > put about 10 Mb) then I serialize to disk(nfs) the index to merge it
> >  with
> > the central index(the central index is in nfs file system), is that
> >  correct?
> >
> > OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will
> > do in-memory thing for you.  Make good use of your RAM and use 2.3 which
> > gives you more control over RAM use during indexing.  Parallelizing indexing
> > over multiple machines and merging at the end is faster, so that's a good
> > approach.  Also, if your boxes have multiple CPUs write your code so that it
> > has multiple worker threads that do indexing and feed docs to
> > IndexWriter.addDocument(Document) to keep the CPUs fully utilized.
> >
> > OG: Oh, something faster than PDFBox?  There is (can't remember the name
> > now... itextstream or something like that?), though it may not be free like
> > PDFBox.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> > On Jan 10, 2008 8:45 AM, Ariel <[hidden email]> wrote:
> >
> > > Thanks all you for yours answers, I going to change a few things in
> >  my
> > > application and make tests.
> > > One thing I haven't find another good pdfToText converter like pdfBox
> >  Do
> > > you know any other faster ?
> > > Greetings
> > > Thanks for yours answers
> > > Ariel
> > >
> > >
> > > On Jan 9, 2008 11:08 PM, Otis Gospodnetic
> >  <[hidden email]>
> > > wrote:
> > >
> > > > Ariel,
> > > >
> > > > I believe PDFBox is not the fastest thing and was built more to
> >  handle
> > > > all possible PDFs than for speed (just my impression - Ben,
> >  PDFBox's author
> > > > might still be on this list and might comment).  Pulling data from
> >  NFS to
> > > > index seems like a bad idea.  I hope at least the indices are local
> >  and not
> > > > on a remote NFS...
> > > >
> > > > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
> >  one)
> > > > and indexing overNFS was slooooooow.
> > > >
> > > > Otis
> > > >
> > > > --
> > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > > >
> > > > ----- Original Message ----
> > > > From: Ariel <[hidden email]>
> > > > To: [hidden email]
> > > > Sent: Wednesday, January 9, 2008 2:50:41 PM
> > > > Subject: Why is lucene so slow indexing in nfs file system ?
> > > >
> > > > Hi:
> > > > I have seen the post in
> > > >
> >  http://www.mail-archive.com/lucene-user@.../msg12700.html
> > > >  and
> > > > I am implementing a similar application in a distributed
> >  enviroment, a
> > > > cluster of nodes only 5 nodes. The operating system I use is
> > > >  Linux(Centos)
> > > > so I am using nfs file system too to access the home directory
> >  where
> > > >  the
> > > > documents to be indexed reside and I would like to know how much
> >  time
> > > >  an
> > > > application spends to index a big amount of documents like 10 Gb ?
> > > > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb
> >  in
> > > >  every
> > > > nodes, LAN: 1Gbits/s.
> > > >
> > > > The problem I have is that my application spends a lot of time to
> >  index
> > > >  all
> > > > the documents, the delay to index 10 gb of pdf documents is about 2
> > > >  days (to
> > > > convert pdf to text I am using pdfbox) that is of course a lot of
> >  time,
> > > > others applications based in lucene, for instance ibm omnifind only
> > > >  takes 5
> > > > hours to index the same amount of pdfs documents. I would like to
> >  find
> > > >  out
> > > > why my application has this big delay to index, any help is
> >  welcome.
> > > > Dou you know others distributed architecture application that uses
> > > >  lucene to
> > > > index big amounts of documents ? How long time it takes to index ?
> > > > I hope yo can help me
> > > > Greetings
> > > >
> > > >
> > > >
> > > >
> > > >
> >  ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [hidden email]
> > > > For additional commands, e-mail: [hidden email]
> > > >
> > > >
> > >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Why is lucene so slow indexing in nfs file system ?

Otis Gospodnetic-2
In reply to this post by Ariel Isaac Romero Cartaya
2.3 is in the process of being released.  Give it another week to 10 days and it will be out.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Ariel <[hidden email]>
To: [hidden email]
Sent: Thursday, January 10, 2008 6:26:44 PM
Subject: Re: Why is lucene so slow indexing in nfs file system ?

Thanks for yours suggestions.

I'm sorry I didn't know but I would want to know what Do you mean with
 "SAN"
and "FC"?

Another thing, I have visited the lucene home page and there is not
 released
the 2.3 version, could you tell me where is the download link ?

Thanks in advance.
Ariel

On Jan 10, 2008 2:59 PM, Otis Gospodnetic <[hidden email]>
wrote:

> Ariel,
>
> Comments inline.
>
>
> ----- Original Message ----
> From: Ariel <[hidden email]>
> To: [hidden email]
> Sent: Thursday, January 10, 2008 10:05:28 AM
> Subject: Re: Why is lucene so slow indexing in nfs file system ?
>
> In a distributed enviroment the application should make an exhaustive
>  use of
> the network and there is not another way to access to the documents
 in

>  a
> remote repository but accessing in nfs file system.
>
> OG: What about SAN connected over FC for example?
>
> One thing I must clarify: I index the documents in memory, I use
> RAMDirectory to do that, then when the RAMDirectory reach the limit(I
>  have
> put about 10 Mb) then I serialize to disk(nfs) the index to merge it
>  with
> the central index(the central index is in nfs file system), is that
>  correct?
>
> OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it
 will
> do in-memory thing for you.  Make good use of your RAM and use 2.3
 which
> gives you more control over RAM use during indexing.  Parallelizing
 indexing
> over multiple machines and merging at the end is faster, so that's a
 good
> approach.  Also, if your boxes have multiple CPUs write your code so
 that it
> has multiple worker threads that do indexing and feed docs to
> IndexWriter.addDocument(Document) to keep the CPUs fully utilized.
>
> OG: Oh, something faster than PDFBox?  There is (can't remember the
 name
> now... itextstream or something like that?), though it may not be
 free like

> PDFBox.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> On Jan 10, 2008 8:45 AM, Ariel <[hidden email]> wrote:
>
> > Thanks all you for yours answers, I going to change a few things in
>  my
> > application and make tests.
> > One thing I haven't find another good pdfToText converter like
 pdfBox

>  Do
> > you know any other faster ?
> > Greetings
> > Thanks for yours answers
> > Ariel
> >
> >
> > On Jan 9, 2008 11:08 PM, Otis Gospodnetic
>  <[hidden email]>
> > wrote:
> >
> > > Ariel,
> > >
> > > I believe PDFBox is not the fastest thing and was built more to
>  handle
> > > all possible PDFs than for speed (just my impression - Ben,
>  PDFBox's author
> > > might still be on this list and might comment).  Pulling data
 from
>  NFS to
> > > index seems like a bad idea.  I hope at least the indices are
 local
>  and not
> > > on a remote NFS...
> > >
> > > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall
 which

>  one)
> > > and indexing overNFS was slooooooow.
> > >
> > > Otis
> > >
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > >
> > > ----- Original Message ----
> > > From: Ariel <[hidden email]>
> > > To: [hidden email]
> > > Sent: Wednesday, January 9, 2008 2:50:41 PM
> > > Subject: Why is lucene so slow indexing in nfs file system ?
> > >
> > > Hi:
> > > I have seen the post in
> > >
>
  http://www.mail-archive.com/lucene-user@.../msg12700.html

> > >  and
> > > I am implementing a similar application in a distributed
>  enviroment, a
> > > cluster of nodes only 5 nodes. The operating system I use is
> > >  Linux(Centos)
> > > so I am using nfs file system too to access the home directory
>  where
> > >  the
> > > documents to be indexed reside and I would like to know how much
>  time
> > >  an
> > > application spends to index a big amount of documents like 10 Gb
 ?
> > > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512
 Mb
>  in
> > >  every
> > > nodes, LAN: 1Gbits/s.
> > >
> > > The problem I have is that my application spends a lot of time to
>  index
> > >  all
> > > the documents, the delay to index 10 gb of pdf documents is about
 2
> > >  days (to
> > > convert pdf to text I am using pdfbox) that is of course a lot of
>  time,
> > > others applications based in lucene, for instance ibm omnifind
 only
> > >  takes 5
> > > hours to index the same amount of pdfs documents. I would like to
>  find
> > >  out
> > > why my application has this big delay to index, any help is
>  welcome.
> > > Dou you know others distributed architecture application that
 uses
> > >  lucene to
> > > index big amounts of documents ? How long time it takes to index
 ?
> > > I hope yo can help me
> > > Greetings
> > >
> > >
> > >
> > >
> > >
>
  ---------------------------------------------------------------------

> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]