Making lucene indexing multi threaded

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Making lucene indexing multi threaded

nischal reddy
Hi,

I am thinking to make my lucene indexing multi threaded, can someone throw
some light on the best approach to be followed for achieving this.

I will give short gist about what i am trying to do, please suggest me the
best way to tackle this.

What am i trying to do?

I am building an index for files (around 30000 files), and later will use
this index to search the contents of the files. The usual sequential
approach works fine but is taking humungous amount of time (around 30
minutes is this the expected time or am i screwing up things somewhere?).

What am i thinking to do?

So to improve the performance i am thinking to make my application
multithreaded

Need suggestions :)

Please suggest me best ways to do this and normally how long does lucene
take to index 30k files?

Please suggest me some links of examples (or probably best practices for
multithreading lucene) for making my application more robust.

TIA,
Nischal Y
Reply | Threaded
Open this post in threaded view
|

Re: Making lucene indexing multi threaded

Adrien Grand
Hi,

Lucene's IndexWriter can safely accept updates coming from several
threads, just make sure to share the same IndexWriter instance across
all threads, no extrenal locking is necessary.

30 minutes sound slike a lot for 30000 files unless they are large.
You can have a look at
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed which gives
good advices on how to improve Lucene indexing speed.

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Making lucene indexing multi threaded

Erick Erickson
In reply to this post by nischal reddy
Stop. Back up. Test. <G>....

The very _first_ thing I'd do is just comment out the bit that
actually indexes the content. I'm guessing you have some
loop like:

while (more files) {
  read the file
   transform the data
   create a Lucene document
   index the document
}

Just comment out the "index the document" line and see how
long _that_ takes. 9 times out of 10, the bottleneck is here.
As a comparison, I can index 3-4K docs/second on my laptop.
This is using Solr and is the Wikipedia dump so the docs
are several K each.

So, if you're going to multi-thread, you'll probably want to
multi-thread the acquisition of the data and feed that
through a separate thread that actually does the indexing,
you don't want multiple IndexWriters active at once.

FWIW,
Erick



On Mon, Sep 2, 2013 at 10:13 AM, nischal reddy
<[hidden email]>wrote:

> Hi,
>
> I am thinking to make my lucene indexing multi threaded, can someone throw
> some light on the best approach to be followed for achieving this.
>
> I will give short gist about what i am trying to do, please suggest me the
> best way to tackle this.
>
> What am i trying to do?
>
> I am building an index for files (around 30000 files), and later will use
> this index to search the contents of the files. The usual sequential
> approach works fine but is taking humungous amount of time (around 30
> minutes is this the expected time or am i screwing up things somewhere?).
>
> What am i thinking to do?
>
> So to improve the performance i am thinking to make my application
> multithreaded
>
> Need suggestions :)
>
> Please suggest me best ways to do this and normally how long does lucene
> take to index 30k files?
>
> Please suggest me some links of examples (or probably best practices for
> multithreading lucene) for making my application more robust.
>
> TIA,
> Nischal Y
>
Reply | Threaded
Open this post in threaded view
|

Re: Making lucene indexing multi threaded

nischal reddy
Hi Eric,

I have commented out the indexing part (indexwriter.addDocument()) part in
my application and it is taking around 90 seconds, but when i uncomment the
indexing part it is taking lot of time.

My machine specs are

windows 7, intel i7 processor, 4gb ram and doest have an ssd harddisk.

can you please tell me how are you able to index 3-4k files in 1 second,
what is the approach you are following.

is reading files (io) eating up lot of time?

Any suggestions would help me a lot.

Thanks,
Nischal Y


On Mon, Sep 2, 2013 at 8:07 PM, Erick Erickson <[hidden email]>wrote:

> Stop. Back up. Test. <G>....
>
> The very _first_ thing I'd do is just comment out the bit that
> actually indexes the content. I'm guessing you have some
> loop like:
>
> while (more files) {
>   read the file
>    transform the data
>    create a Lucene document
>    index the document
> }
>
> Just comment out the "index the document" line and see how
> long _that_ takes. 9 times out of 10, the bottleneck is here.
> As a comparison, I can index 3-4K docs/second on my laptop.
> This is using Solr and is the Wikipedia dump so the docs
> are several K each.
>
> So, if you're going to multi-thread, you'll probably want to
> multi-thread the acquisition of the data and feed that
> through a separate thread that actually does the indexing,
> you don't want multiple IndexWriters active at once.
>
> FWIW,
> Erick
>
>
>
> On Mon, Sep 2, 2013 at 10:13 AM, nischal reddy
> <[hidden email]>wrote:
>
> > Hi,
> >
> > I am thinking to make my lucene indexing multi threaded, can someone
> throw
> > some light on the best approach to be followed for achieving this.
> >
> > I will give short gist about what i am trying to do, please suggest me
> the
> > best way to tackle this.
> >
> > What am i trying to do?
> >
> > I am building an index for files (around 30000 files), and later will use
> > this index to search the contents of the files. The usual sequential
> > approach works fine but is taking humungous amount of time (around 30
> > minutes is this the expected time or am i screwing up things somewhere?).
> >
> > What am i thinking to do?
> >
> > So to improve the performance i am thinking to make my application
> > multithreaded
> >
> > Need suggestions :)
> >
> > Please suggest me best ways to do this and normally how long does lucene
> > take to index 30k files?
> >
> > Please suggest me some links of examples (or probably best practices for
> > multithreading lucene) for making my application more robust.
> >
> > TIA,
> > Nischal Y
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Making lucene indexing multi threaded

nischal reddy
In reply to this post by Erick Erickson
Hi,

Some more update on my progress,

i have multithreaded indexing in my application, i have used thread pool
executor and used a pool size of 4 but had a very slight increase in the
performace very negligible, still it is taking around 20 minutes of time to
index around 30k files,

Some more info on what am i doing

method where indexing is done:

private void indexAllFields(IResource resource) {
        IFile ifile = (IFile) resource;
        File file = resource.getLocation().toFile();
        Document doc = new Document();
        try {
            doc.add(new StringField(FIELD_FILE_PATH,
getIndexFilePath(resource), Store.YES));
            doc.add(new StringField(FIELD_FILE_TYPE,
ifile.getFileExtension().toLowerCase(), Store.YES));
            //indexContents(file, doc);
            /**
             * Calling updateDocument will make sure that only one indexed
document will be added per IFile.
             * Because this method deletes any existing document with the
given Term and adds a new document.
             * This Fixes Sonic00039677
             */
            //iWriter.addDocument(doc);
            iWriter.updateDocument(new Term(FIELD_FILE_PATH,
getIndexFilePath(resource)), doc);
            iWriter.commit();
        } catch (FileNotFoundException e) {

        } catch (IOException e) {

        }
    }


//Runnable to schedule a indexing job
class IndexingJob implements Runnable{

        private IResource resource;

        public IndexingJob(IResource resource) {
            this.resource = resource;
        }

        @Override
        public void run() {
            indexAllFields(resource);
        }

    }

//method to queue files to be indexed

void doJob(){

 ThreadPoolExecutor executor = new ThreadPoolExecutor(4, 6, Long.MAX_VALUE,
TimeUnit.SECONDS, workQueue);
                        for (IResource iResource : files) {
                            addToIndexQueue(iResource,executor);
                            //updateBasedOnTimeStamp(iResource);
                        }
                        executor.shutdown();

                        try {
                            executor.awaitTermination(Long.MAX_VALUE,
TimeUnit.SECONDS);
                        } catch (InterruptedException e) {
                            // TODO Auto-generated catch block
                            e.printStackTrace();
                        }

}

Still with the multi threaded approach it is taking very long.

TIA,
Nischal Y




On Mon, Sep 2, 2013 at 8:07 PM, Erick Erickson <[hidden email]>wrote:

> Stop. Back up. Test. <G>....
>
> The very _first_ thing I'd do is just comment out the bit that
> actually indexes the content. I'm guessing you have some
> loop like:
>
> while (more files) {
>   read the file
>    transform the data
>    create a Lucene document
>    index the document
> }
>
> Just comment out the "index the document" line and see how
> long _that_ takes. 9 times out of 10, the bottleneck is here.
> As a comparison, I can index 3-4K docs/second on my laptop.
> This is using Solr and is the Wikipedia dump so the docs
> are several K each.
>
> So, if you're going to multi-thread, you'll probably want to
> multi-thread the acquisition of the data and feed that
> through a separate thread that actually does the indexing,
> you don't want multiple IndexWriters active at once.
>
> FWIW,
> Erick
>
>
>
> On Mon, Sep 2, 2013 at 10:13 AM, nischal reddy
> <[hidden email]>wrote:
>
> > Hi,
> >
> > I am thinking to make my lucene indexing multi threaded, can someone
> throw
> > some light on the best approach to be followed for achieving this.
> >
> > I will give short gist about what i am trying to do, please suggest me
> the
> > best way to tackle this.
> >
> > What am i trying to do?
> >
> > I am building an index for files (around 30000 files), and later will use
> > this index to search the contents of the files. The usual sequential
> > approach works fine but is taking humungous amount of time (around 30
> > minutes is this the expected time or am i screwing up things somewhere?).
> >
> > What am i thinking to do?
> >
> > So to improve the performance i am thinking to make my application
> > multithreaded
> >
> > Need suggestions :)
> >
> > Please suggest me best ways to do this and normally how long does lucene
> > take to index 30k files?
> >
> > Please suggest me some links of examples (or probably best practices for
> > multithreading lucene) for making my application more robust.
> >
> > TIA,
> > Nischal Y
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Making lucene indexing multi threaded

Danil ŢORIN
Don't commit after adding each and every document.




On Tue, Sep 3, 2013 at 7:20 AM, nischal reddy <[hidden email]>wrote:

> Hi,
>
> Some more update on my progress,
>
> i have multithreaded indexing in my application, i have used thread pool
> executor and used a pool size of 4 but had a very slight increase in the
> performace very negligible, still it is taking around 20 minutes of time to
> index around 30k files,
>
> Some more info on what am i doing
>
> method where indexing is done:
>
> private void indexAllFields(IResource resource) {
>         IFile ifile = (IFile) resource;
>         File file = resource.getLocation().toFile();
>         Document doc = new Document();
>         try {
>             doc.add(new StringField(FIELD_FILE_PATH,
> getIndexFilePath(resource), Store.YES));
>             doc.add(new StringField(FIELD_FILE_TYPE,
> ifile.getFileExtension().toLowerCase(), Store.YES));
>             //indexContents(file, doc);
>             /**
>              * Calling updateDocument will make sure that only one indexed
> document will be added per IFile.
>              * Because this method deletes any existing document with the
> given Term and adds a new document.
>              * This Fixes Sonic00039677
>              */
>             //iWriter.addDocument(doc);
>             iWriter.updateDocument(new Term(FIELD_FILE_PATH,
> getIndexFilePath(resource)), doc);
>             iWriter.commit();
>         } catch (FileNotFoundException e) {
>
>         } catch (IOException e) {
>
>         }
>     }
>
>
> //Runnable to schedule a indexing job
> class IndexingJob implements Runnable{
>
>         private IResource resource;
>
>         public IndexingJob(IResource resource) {
>             this.resource = resource;
>         }
>
>         @Override
>         public void run() {
>             indexAllFields(resource);
>         }
>
>     }
>
> //method to queue files to be indexed
>
> void doJob(){
>
>  ThreadPoolExecutor executor = new ThreadPoolExecutor(4, 6, Long.MAX_VALUE,
> TimeUnit.SECONDS, workQueue);
>                         for (IResource iResource : files) {
>                             addToIndexQueue(iResource,executor);
>                             //updateBasedOnTimeStamp(iResource);
>                         }
>                         executor.shutdown();
>
>                         try {
>                             executor.awaitTermination(Long.MAX_VALUE,
> TimeUnit.SECONDS);
>                         } catch (InterruptedException e) {
>                             // TODO Auto-generated catch block
>                             e.printStackTrace();
>                         }
>
> }
>
> Still with the multi threaded approach it is taking very long.
>
> TIA,
> Nischal Y
>
>
>
>
> On Mon, Sep 2, 2013 at 8:07 PM, Erick Erickson <[hidden email]
> >wrote:
>
> > Stop. Back up. Test. <G>....
> >
> > The very _first_ thing I'd do is just comment out the bit that
> > actually indexes the content. I'm guessing you have some
> > loop like:
> >
> > while (more files) {
> >   read the file
> >    transform the data
> >    create a Lucene document
> >    index the document
> > }
> >
> > Just comment out the "index the document" line and see how
> > long _that_ takes. 9 times out of 10, the bottleneck is here.
> > As a comparison, I can index 3-4K docs/second on my laptop.
> > This is using Solr and is the Wikipedia dump so the docs
> > are several K each.
> >
> > So, if you're going to multi-thread, you'll probably want to
> > multi-thread the acquisition of the data and feed that
> > through a separate thread that actually does the indexing,
> > you don't want multiple IndexWriters active at once.
> >
> > FWIW,
> > Erick
> >
> >
> >
> > On Mon, Sep 2, 2013 at 10:13 AM, nischal reddy
> > <[hidden email]>wrote:
> >
> > > Hi,
> > >
> > > I am thinking to make my lucene indexing multi threaded, can someone
> > throw
> > > some light on the best approach to be followed for achieving this.
> > >
> > > I will give short gist about what i am trying to do, please suggest me
> > the
> > > best way to tackle this.
> > >
> > > What am i trying to do?
> > >
> > > I am building an index for files (around 30000 files), and later will
> use
> > > this index to search the contents of the files. The usual sequential
> > > approach works fine but is taking humungous amount of time (around 30
> > > minutes is this the expected time or am i screwing up things
> somewhere?).
> > >
> > > What am i thinking to do?
> > >
> > > So to improve the performance i am thinking to make my application
> > > multithreaded
> > >
> > > Need suggestions :)
> > >
> > > Please suggest me best ways to do this and normally how long does
> lucene
> > > take to index 30k files?
> > >
> > > Please suggest me some links of examples (or probably best practices
> for
> > > multithreading lucene) for making my application more robust.
> > >
> > > TIA,
> > > Nischal Y
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Making lucene indexing multi threaded

Jason Wu
In reply to this post by nischal reddy
Hi Nischal,

I had similar indexing issue. My lucene indexing took 22 mins for 70 MB docs. When i debugged the problem, i found out the indexWriter.addDocument(doc) taking a really long time.

Have you already found the solution about it?

Thank you,
Jason
Reply | Threaded
Open this post in threaded view
|

RE: Making lucene indexing multi threaded

Fuad Efendi
I believe there were many reports of many-thousands-docs per second in
average.
I experienced similar SOLR speeds many years ago too, with small documents
(512-bytes each)

You can check harddrive performance at first (use SSD, etc...); and second,
check your indexing architecture: is it multithreaded? I used multiple
threads (64-128) from client workstation to submit Solr documents
concurrently to (local / or remote) Solr instance, via SolrJ client.

- Fuad Efendi

-----Original Message-----
From: Jason Wu [mailto:[hidden email]]
Sent: October-27-14 10:41 AM
To: [hidden email]
Subject: Re: Making lucene indexing multi threaded

Hi Nischal,

I had similar indexing issue. My lucene indexing took 22 mins for 70 MB
docs. When i debugged the problem, i found out the
indexWriter.addDocument(doc) taking a really long time.

Have you already found the solution about it?

Thank you,
Jason



--
View this message in context:
http://lucene.472066.n3.nabble.com/Making-lucene-indexing-multi-threaded-tp4
087830p4166094.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Making lucene indexing multi threaded

Jason Wu
Hi Fuad,

Thanks for your suggestions and quick response. I am using a single-threaded indexing way to add docs. I will try the multiple-threaded indexing to see if my issue will be resolved.

This issue only exists after I upgraded lucene version from 2.4.1(with Java 1.6) to 4.8.1(with Java 1.7). I don't have this problem in old lucene version.

The indexing speed is fast when i start the application, which only takes 3 mins indexing. But after my application running for a while(a day, etc),  once i send a JMX call to my application to reindex docs, the indexing speed will slow down and take 22 mins.

Did you have any similar experience like the above before?

Thank you,
Jason
Reply | Threaded
Open this post in threaded view
|

Re: Making lucene indexing multi threaded

G.Long
Like Nischal, did you check that you don't call the commit() method
after each indexed document? :)

Regards,
Gary Long

Le 27/10/2014 16:47, Jason Wu a écrit :

> Hi Fuad,
>
> Thanks for your suggestions and quick response. I am using a single-threaded
> indexing way to add docs. I will try the multiple-threaded indexing to see
> if my issue will be resolved.
>
> This issue only exists after I upgraded lucene version from 2.4.1(with Java
> 1.6) to 4.8.1(with Java 1.7). I don't have this problem in old lucene
> version.
>
> The indexing speed is fast when i start the application, which only takes 3
> mins indexing. But after my application running for a while(a day, etc),
> once i send a JMX call to my application to reindex docs, the indexing speed
> will slow down and take 22 mins.
>
> Did you have any similar experience like the above before?
>
> Thank you,
> Jason
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Making-lucene-indexing-multi-threaded-tp4087830p4166116.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Making lucene indexing multi threaded

Jason Wu
Hi Gary,

Thanks for your response. I only call the commit when all my docs are added.

Here is the procedure of my Lucene indexing and re-indexing:

   1. If index data exists inside index directory, remove all the index
   data.
   2. Create IndexWriter with 256MB RAMBUFFERSIZE
   3. Process DB result set
   - When I loop the result set, I reuse the same Document instance.
      - At the end of each loop, I call indexWriter.addDocument(doc)
   4. After all docs are added, call IndexWriter.commit()
   5. IndexWriter.close();

Thank you,
Jason
Reply | Threaded
Open this post in threaded view
|

Indexing Weighted Tags per Document

Ralf Bierig-2
In reply to this post by G.Long
I want to index documents together with a list of tags (usually between
10-30) that represent meta information about this document. Normally, i
would create an extra field "tag" store every tag, by its name, inside
that field and create my 10-30 fields that and adding it to the document
before adding the document to the index and writing the index.

However, I have the following extra requirements:

a) I need to have a weight in the range of [0,1] being associated with
the tag that represents the probability of this tag being true.

b) These tags must be associated with the document and not with the
terms of the document.

c) I must be able to associate many tags to a document instance.

d) I must be able to use the weight in the weighting process of the
search engine.

e) The weight must be for the document instance, as the weight
represents the probability for that tag for that particular document. E.g.

fieldname: tag
fieldvalue: tree
fieldweight: 0.8

meaning that this particular document is with a probability of 0.8 about
trees.

What is the best way to do that?
Can somebody point me to an example or something quite similar that
captures such a problem?

Best,
Ralf

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing Weighted Tags per Document

Ramkumar R. Aiyengar
There are a few approaches possible here, we had a similar use case and
went for the second one below. I primarily deal with Solr, so I don't know
of Lucene-only examples, but hopefully you can dig this up..

(1) You can attach payloads to each occurrence of the tag, and modify the
scoring to use the payload..

(2) Use term frequency as a proxy. You could scale the probability by a
factor and introduce the term as many times as the scaled value
(essentially making it the term frequency). Scoring will know account for
this. Advantage is that you also can achieve score normalisation with
keywords and amongst tags, and you can also filter results by probability.

(3) There potentially is also a solution using child documents and block
join, but I may be mistaken, haven't explored this a lot..
 On 27 Oct 2014 16:10, "Ralf Bierig" <[hidden email]> wrote:

> I want to index documents together with a list of tags (usually between
> 10-30) that represent meta information about this document. Normally, i
> would create an extra field "tag" store every tag, by its name, inside that
> field and create my 10-30 fields that and adding it to the document before
> adding the document to the index and writing the index.
>
> However, I have the following extra requirements:
>
> a) I need to have a weight in the range of [0,1] being associated with the
> tag that represents the probability of this tag being true.
>
> b) These tags must be associated with the document and not with the terms
> of the document.
>
> c) I must be able to associate many tags to a document instance.
>
> d) I must be able to use the weight in the weighting process of the search
> engine.
>
> e) The weight must be for the document instance, as the weight represents
> the probability for that tag for that particular document. E.g.
>
> fieldname: tag
> fieldvalue: tree
> fieldweight: 0.8
>
> meaning that this particular document is with a probability of 0.8 about
> trees.
>
> What is the best way to do that?
> Can somebody point me to an example or something quite similar that
> captures such a problem?
>
> Best,
> Ralf
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Indexing Weighted Tags per Document

Ralf Bierig-2
The second solution sounds great and a lot more natural than payloads.

I know how to overwrite the Similarity class but this one would only be
called at search time and then already use the existing term frequency.
Looking up the probabilities every time a search is performed is
probably also not performing well. So, I suspect I would somehow need to
find a way to store the term frequency directly into the index at the
time when I am indexing documents. Is that correct?

Do you have a code sniplet that would highlight that part of your
elegant solution?

Thanks in advance,
Ralf

On 28.10.2014 09:31, Ramkumar R. Aiyengar wrote:

> There are a few approaches possible here, we had a similar use case and
> went for the second one below. I primarily deal with Solr, so I don't know
> of Lucene-only examples, but hopefully you can dig this up..
>
> (1) You can attach payloads to each occurrence of the tag, and modify the
> scoring to use the payload..
>
> (2) Use term frequency as a proxy. You could scale the probability by a
> factor and introduce the term as many times as the scaled value
> (essentially making it the term frequency). Scoring will know account for
> this. Advantage is that you also can achieve score normalisation with
> keywords and amongst tags, and you can also filter results by probability.
>
> (3) There potentially is also a solution using child documents and block
> join, but I may be mistaken, haven't explored this a lot..
>   On 27 Oct 2014 16:10, "Ralf Bierig" <[hidden email]> wrote:
>
>> I want to index documents together with a list of tags (usually between
>> 10-30) that represent meta information about this document. Normally, i
>> would create an extra field "tag" store every tag, by its name, inside that
>> field and create my 10-30 fields that and adding it to the document before
>> adding the document to the index and writing the index.
>>
>> However, I have the following extra requirements:
>>
>> a) I need to have a weight in the range of [0,1] being associated with the
>> tag that represents the probability of this tag being true.
>>
>> b) These tags must be associated with the document and not with the terms
>> of the document.
>>
>> c) I must be able to associate many tags to a document instance.
>>
>> d) I must be able to use the weight in the weighting process of the search
>> engine.
>>
>> e) The weight must be for the document instance, as the weight represents
>> the probability for that tag for that particular document. E.g.
>>
>> fieldname: tag
>> fieldvalue: tree
>> fieldweight: 0.8
>>
>> meaning that this particular document is with a probability of 0.8 about
>> trees.
>>
>> What is the best way to do that?
>> Can somebody point me to an example or something quite similar that
>> captures such a problem?
>>
>> Best,
>> Ralf
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Making lucene indexing multi threaded

Erick Erickson
In reply to this post by Jason Wu
bq: When I loop the result set, I reuse the same Document instance.

I really, really, _really_ hope you're calling new for the Document in
the loop. Otherwise that single document will eventually contain all
the data from your entire corpus! I'd expect some other errors to pop
out if you are really doing something like
doc = new Document
for (row in result set) {
  add all the fields
  index the doc
}

but the way you phrased it made me wonder....

BTW, please post the code, it's much easier to see what you're doing that way.

Best,
Erick


On Mon, Oct 27, 2014 at 12:05 PM, Jason Wu <[hidden email]> wrote:

> Hi Gary,
>
> Thanks for your response. I only call the commit when all my docs are added.
>
> Here is the procedure of my Lucene indexing and re-indexing:
>
>    1. If index data exists inside index directory, remove all the index
>    data.
>    2. Create IndexWriter with 256MB RAMBUFFERSIZE
>    3. Process DB result set
>    - When I loop the result set, I reuse the same Document instance.
>       - At the end of each loop, I call indexWriter.addDocument(doc)
>    4. After all docs are added, call IndexWriter.commit()
>    5. IndexWriter.close();
>
> Thank you,
> Jason
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Making-lucene-indexing-multi-threaded-tp4087830p4166123.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]