deleteDocuments by Term[] for ALL terms

classic Classic list List threaded Threaded
3 messages Options
adb
Reply | Threaded
Open this post in threaded view
|

deleteDocuments by Term[] for ALL terms

adb
Hi,

I'm using IndexReader.deleteDocuments(Term) to delete documents in batches.  I
need the deleted count, so I cannot use IndexWriter.deleteDocuments().

What I want to do is delete documents based on more than one term, but not like
IndexWriter.deleteDocuments(Term[]) which deletes all documents with ANY term.
I want it to delete documents which have ALL terms, e.g.

Term("owner", "ownerUID") AND Term("subject", "something");

I have a new reader being used, so I could make a new IndexSearcher and query
the documents with all terms in a BooleanQuery and then iterate the results, but
that would either mean using the Hits mechanism or TopDocs and seems like a
heavyweight way to do things.

What I want to be able to do is to delete sequentially without storing up a
result set as I may want to delete all and ALL may be rather big.

I see the implementation uses reader.termDocs() to do the deletion for a single
Term, which of course is easy, but is there a simple way to make a deletion for
multiple terms with AND via the reader using, say termDocs, that will not
potentially use large amounts of memory, or should I just go with the searcher
TopDocs mechanism and do that also in batches to avoid the risk of a large
memory hit.

I know there's lots of clever 'expert-mode' stuff under the Lucene API hood, but
does anyone know any good way to do this or have I missed anything obvious in
the API docs?

Thanks
Antony




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: deleteDocuments by Term[] for ALL terms

Michael McCandless-2

You can just create a query with your and'd terms, and then do this:

  Weight weight = query.weight(indexSearcher);
  IndexReader reader = indexSearcher.getIndexReader();
  Scorer scorer = weight.scorer(reader);
  int delCount = 0;
  while(scorer.next()) {
    reader.deleteDocument(scorer.doc());
    delCount++;
  }

that iterates over all the docIDs without scoring them and without
building up a Hit for each, etc.

Mike

"Antony Bowesman" <[hidden email]> wrote:

> Hi,
>
> I'm using IndexReader.deleteDocuments(Term) to delete documents in
> batches.  I
> need the deleted count, so I cannot use IndexWriter.deleteDocuments().
>
> What I want to do is delete documents based on more than one term, but
> not like
> IndexWriter.deleteDocuments(Term[]) which deletes all documents with ANY
> term.
> I want it to delete documents which have ALL terms, e.g.
>
> Term("owner", "ownerUID") AND Term("subject", "something");
>
> I have a new reader being used, so I could make a new IndexSearcher and
> query
> the documents with all terms in a BooleanQuery and then iterate the
> results, but
> that would either mean using the Hits mechanism or TopDocs and seems like
> a
> heavyweight way to do things.
>
> What I want to be able to do is to delete sequentially without storing up
> a
> result set as I may want to delete all and ALL may be rather big.
>
> I see the implementation uses reader.termDocs() to do the deletion for a
> single
> Term, which of course is easy, but is there a simple way to make a
> deletion for
> multiple terms with AND via the reader using, say termDocs, that will not
> potentially use large amounts of memory, or should I just go with the
> searcher
> TopDocs mechanism and do that also in batches to avoid the risk of a
> large
> memory hit.
>
> I know there's lots of clever 'expert-mode' stuff under the Lucene API
> hood, but
> does anyone know any good way to do this or have I missed anything
> obvious in
> the API docs?
>
> Thanks
> Antony
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: deleteDocuments by Term[] for ALL terms

adb
Thanks Mike, just what I was after.
Antony


Michael McCandless wrote:

> You can just create a query with your and'd terms, and then do this:
>
>   Weight weight = query.weight(indexSearcher);
>   IndexReader reader = indexSearcher.getIndexReader();
>   Scorer scorer = weight.scorer(reader);
>   int delCount = 0;
>   while(scorer.next()) {
>     reader.deleteDocument(scorer.doc());
>     delCount++;
>   }
>
> that iterates over all the docIDs without scoring them and without
> building up a Hit for each, etc.
>
> Mike
>
> "Antony Bowesman" <[hidden email]> wrote:
>> Hi,
>>
>> I'm using IndexReader.deleteDocuments(Term) to delete documents in
>> batches.  I
>> need the deleted count, so I cannot use IndexWriter.deleteDocuments().
>>
>> What I want to do is delete documents based on more than one term, but
>> not like
>> IndexWriter.deleteDocuments(Term[]) which deletes all documents with ANY
>> term.
>> I want it to delete documents which have ALL terms, e.g.
>>
>> Term("owner", "ownerUID") AND Term("subject", "something");
>>
>> I have a new reader being used, so I could make a new IndexSearcher and
>> query
>> the documents with all terms in a BooleanQuery and then iterate the
>> results, but
>> that would either mean using the Hits mechanism or TopDocs and seems like
>> a
>> heavyweight way to do things.
>>
>> What I want to be able to do is to delete sequentially without storing up
>> a
>> result set as I may want to delete all and ALL may be rather big.
>>
>> I see the implementation uses reader.termDocs() to do the deletion for a
>> single
>> Term, which of course is easy, but is there a simple way to make a
>> deletion for
>> multiple terms with AND via the reader using, say termDocs, that will not
>> potentially use large amounts of memory, or should I just go with the
>> searcher
>> TopDocs mechanism and do that also in batches to avoid the risk of a
>> large
>> memory hit.
>>
>> I know there's lots of clever 'expert-mode' stuff under the Lucene API
>> hood, but
>> does anyone know any good way to do this or have I missed anything
>> obvious in
>> the API docs?
>>
>> Thanks
>> Antony
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]