Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

Ning Li-2

Today, applications have to open/close an IndexWriter and open/close an
IndexReader directly or indirectly (via IndexModifier) in order to handle a
mix of inserts and deletes. This performs well when inserts and deletes
come in fairly large batches. However, the performance can degrade
dramatically when inserts and deletes are interleaved in small batches.
This is because the ramDirectory is flushed to disk whenever an IndexWriter
is closed, causing a lot of small segments to be created on disk, which
eventually need to be merged.

We would like to propose a small API change to eliminate this problem. We
are aware that this kind change has come up in discusions before. See
http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
. The difference this time is that we have implemented the change and
tested its performance, as described below.

API Changes
-----------
We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
Using this method, inserts and deletes can be interleaved using the same
IndexWriter.

Note that, with this change it would be very easy to add another method to
IndexWriter for updating documents, allowing applications to avoid a
separate delete and insert to update a document.

Also note that this change can co-exist with the existing APIs for deleting
documents using an IndexReader. But if our proposal is accepted, we think
those APIs should probably be deprecated.

Coding Changes
--------------
Coding changes are localized to IndexWriter. Internally, the new
deleteDocuments() method works by buffering the terms to be deleted.
Deletes are deferred until the ramDirectory is flushed to disk, either
because it becomes full or because the IndexWriter is closed. Using Java
synchronization, care is taken to ensure that an interleaved sequence of
inserts and deletes for the same document are properly serialized.

We have attached a modified version of IndexWriter in Release 1.9.1 with
these changes. Only a few hundred lines of coding changes are needed. All
changes are commented by "CHANGE". We have also attached a modified version
of an example from Chapter 2.2 of Lucene in Action.

Performance Results
-------------------
To test the performance our proposed changes, we ran some experiments using
the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
Xeon server running Linux. The disk storage was configured as RAID0 array
with 5 drives. Before indexes were built, the input documents were parsed
to remove the HTML from them (i.e., only the text was indexed). This was
done to minimize the impact of parsing on performance. A simple
WhitespaceAnalyzer was used during index build.

We experimented with three workloads:
  - Insert only. 1.6M documents were inserted and the final
    index size was 2.3GB.
  - Insert/delete (big batches). The same documents were
    inserted, but 25% were deleted. 1000 documents were
    deleted for every 4000 inserted.
  - Insert/delete (small batches). In this case, 5 documents
    were deleted for every 20 inserted.

                                current       current          new
Workload                      IndexWriter  IndexModifier   IndexWriter
-----------------------------------------------------------------------
Insert only                     116 min       119 min        116 min
Insert/delete (big batches)       --          135 min        125 min
Insert/delete (small batches)     --          338 min        134 min

As the experiments show, with the proposed changes, the performance
improved by 60% when inserts and deletes were interleaved in small batches.
(See attached file: IndexWriter.java)(See attached file:
TestWriterDelete.java)


Regards,
Ning


Ning Li
Search Technologies
IBM Almaden Research Center
650 Harry Road
San Jose, CA 95120

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

Doug Cutting
This sounds very promising.  Can you please attach it to a bug in Jira?

Thanks!

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

Ning Li-2
I will create a bug in Jira.

Let me try to attach the two files here again.
(See attached file: IndexWriter.changed)(See attached file:
TestWriterDelete.changed)


Regards,
Ning


Ning Li
Search Technologies
IBM Almaden Research Center
650 Harry Road
San Jose, CA 95120




|---------+---------------------------->
|         |           Doug Cutting     |
|         |           <[hidden email]|
|         |           rg>              |
|         |                            |
|         |           05/08/2006 04:17 |
|         |           PM               |
|         |           Please respond to|
|         |           java-dev         |
|---------+---------------------------->
  >------------------------------------------------------------------------------------------------------------------------------|
  |                                                                                                                              |
  |       To:       [hidden email]                                                                                   |
  |       cc:                                                                                                                    |
  |       Subject:  Re: Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)                        |
  >------------------------------------------------------------------------------------------------------------------------------|




This sounds very promising.  Can you please attach it to a bug in Jira?

Thanks!

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

Nadav Har'El-2
In reply to this post by Ning Li-2
Ning Li <[hidden email]> wrote on 09/05/2006 02:07:26 AM:
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle
a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.

It appears that this separation of IndexWriter and IndexReader indeed
bothers most new Lucene users, forcing each one to come up with their own
buffering tricks to implement document updates and questions like this,
and workaround-type solutions, appear on the java-user mailing list very
often.
So it's probably a great idea to eradicate this problem once and for all,
and do it in an integrated way (inside IndexWriter), like you did, rather
than in a roundabout way in external objects which do more buffering.

I have a couple of small questions on your proposed changes:

> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.

This will indeed be enough for most uses, but I was thinking if perhaps
we can, and should, provide even more "reading" functionality in the
IndexWriter.

Consider this use case (which actually happened to me): You want to index
mail messages, which have attachements. The mail message, and the text of
each attachment, get indexed as separate Lucene documents so they can be
found independently. When we delete a mail message, we also need to delete
the attached documents, so we keep a list of attachments in the the message
document and when deleting a mail message we need to read that list field
first, and delete all the attachment documents as well. The problem is that
this requires not only a deleteDocuments() method, but also a method which
finds the document and returns it (or better yet, just the one field we
need).

So I wonder if the IndexWriter shouldn't contain more reading features
that previously were only found in IndexReader. In the long run, should
our goal be perhaps to leave only one object, say call it simply "Index",
which is basically the old IndexWriter with all of IndexReader's
capabilities added to it?

> Also note that this change can co-exist with the existing APIs for
deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.

I agree.
IndexModifier should perhaps also be deprecated (or just become an empty
shell around Indexwriter).

> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
>     index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
>     inserted, but 25% were deleted. 1000 documents were
>     deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
>     were deleted for every 20 inserted.

Thanks, these benchmarks are very important.

If you can do it, I'd love to see the results of a fourth benchmark,
which represents a typical situation (which you also mentioned)
of document updates: every single insert is preceded by a delete,
25% of which actually delete (the updated document existed previously)
and the rest end up not finding an old document and not deleting
anything. I expect this benchmark to show an even greater improvment
of your approach over the naive IndexModifier.


--
Nadav Har'El


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

Otis Gospodnetic-2
I agree - a delete (typically for a Term that represents a "primary key" for a Document in an index) followed by re-add of a Document is a very common scenario, and I'd love to see the numbers for that.

Thanks,
Otis

> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
>     index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
>     inserted, but 25% were deleted. 1000 documents were
>     deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
>     were deleted for every 20 inserted.

Thanks, these benchmarks are very important.

If you can do it, I'd love to see the results of a fourth benchmark,
which represents a typical situation (which you also mentioned)
of document updates: every single insert is preceded by a delete,
25% of which actually delete (the updated document existed previously)
and the rest end up not finding an old document and not deleting
anything. I expect this benchmark to show an even greater improvment
of your approach over the naive IndexModifier.


--
Nadav Har'El


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

Ning Li-2
The machine is swamped with tests. I will run the experiment when the
machine is free.


Regards,
Ning


Ning Li
Search Technologies
IBM Almaden Research Center
650 Harry Road
San Jose, CA 95120



|---------+---------------------------->
|         |           Otis Gospodnetic |
|         |           <otis_gospodnetic|
|         |           @yahoo.com>      |
|         |                            |
|         |           05/09/2006 07:30 |
|         |           AM               |
|         |           Please respond to|
|         |           java-dev         |
|---------+---------------------------->
  >------------------------------------------------------------------------------------------------------------------------------|
  |                                                                                                                              |
  |       To:       [hidden email]                                                                                   |
  |       cc:                                                                                                                    |
  |       Subject:  Re: Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)                        |
  >------------------------------------------------------------------------------------------------------------------------------|




I agree - a delete (typically for a Term that represents a "primary key"
for a Document in an index) followed by re-add of a Document is a very
common scenario, and I'd love to see the numbers for that.

Thanks,
Otis

> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
>     index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
>     inserted, but 25% were deleted. 1000 documents were
>     deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
>     were deleted for every 20 inserted.

Thanks, these benchmarks are very important.

If you can do it, I'd love to see the results of a fourth benchmark,
which represents a typical situation (which you also mentioned)
of document updates: every single insert is preceded by a delete,
25% of which actually delete (the updated document existed previously)
and the rest end up not finding an old document and not deleting
anything. I expect this benchmark to show an even greater improvment
of your approach over the naive IndexModifier.


--
Nadav Har'El


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

Ning Li-2
The fourth workload:
  - Upsert. Every operation is a delete followed by an insert.
    75% of the deletes do not match any document already
    inserted. 25% of the deletes match some document inserted.

The new IndexWriter took 136min. The current IndexModifier has
been running for 18 hours and hasn't finished...


For your convenience, here are the performance results for the
first three workloads again.

                                current       current          new
Workload                      IndexWriter  IndexModifier   IndexWriter
-----------------------------------------------------------------------
Insert only                     116 min       119 min        116 min
Insert/delete (big batches)       --          135 min        125 min
Insert/delete (small batches)     --          338 min        134 min


Regards,
Ning


Ning Li
Search Technologies
IBM Almaden Research Center
650 Harry Road
San Jose, CA 95120



|---------+---------------------------->
|         |           Ning             |
|         |           Li/Almaden/IBM@IB|
|         |           MUS              |
|         |                            |
|         |           05/09/2006 04:54 |
|         |           PM               |
|         |           Please respond to|
|         |           java-dev         |
|---------+---------------------------->
  >------------------------------------------------------------------------------------------------------------------------------|
  |                                                                                                                              |
  |       To:       [hidden email]                                                                                   |
  |       cc:                                                                                                                    |
  |       Subject:  Re: Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)                        |
  >------------------------------------------------------------------------------------------------------------------------------|




The machine is swamped with tests. I will run the experiment when the
machine is free.


Regards,
Ning


Ning Li
Search Technologies
IBM Almaden Research Center
650 Harry Road
San Jose, CA 95120



|---------+---------------------------->
|         |           Otis Gospodnetic |
|         |           <otis_gospodnetic|
|         |           @yahoo.com>      |
|         |                            |
|         |           05/09/2006 07:30 |
|         |           AM               |
|         |           Please respond to|
|         |           java-dev         |
|---------+---------------------------->

>------------------------------------------------------------------------------------------------------------------------------|

  |
|
  |       To:       [hidden email]
|
  |       cc:
|
  |       Subject:  Re: Supporting deleteDocuments in IndexWriter (Code and
Performance Results Provided)                        |

>------------------------------------------------------------------------------------------------------------------------------|





I agree - a delete (typically for a Term that represents a "primary key"
for a Document in an index) followed by re-add of a Document is a very
common scenario, and I'd love to see the numbers for that.

Thanks,
Otis

> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
>     index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
>     inserted, but 25% were deleted. 1000 documents were
>     deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
>     were deleted for every 20 inserted.

Thanks, these benchmarks are very important.

If you can do it, I'd love to see the results of a fourth benchmark,
which represents a typical situation (which you also mentioned)
of document updates: every single insert is preceded by a delete,
25% of which actually delete (the updated document existed previously)
and the rest end up not finding an old document and not deleting
anything. I expect this benchmark to show an even greater improvment
of your approach over the naive IndexModifier.


--
Nadav Har'El


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]