Delete problems O.O

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Delete problems O.O

Cesar Ronchese
Hey All.

I'm running problems with document deletion. I tried to use DeleteDocuments() and DeleteDocument() methods, both are with problems, according explained below:


1) DeleteDocuments(term)

This simply doesn't delete anything from the Index.

//see the code sample:
//"theFieldName" was previously stored as Field.Store.YES and Field.Index.TOKENIZED.
Term t = new Terms("theFieldName", "theFieldContent");
objIndexReader.DeleteDocuments(t);

I ask: Am I doing anything wrong here?



2) DeleteDocument(numDoc) <== this problem is a woot problem

After being frustrated ( :P ) with DeleteDocuments(), I ran to test DeleteDocument(). Then I made a query to return a Hits collection. I did a loop through the Hits collection and I called DeleteDocument(docNum) for every document in the Hits collection.

Lets talk about the problem now..... this method DOES delete a document from Index, BUT, it is actually deleting wrong documents. I noticed that instead of deleting the documents found by the Hits collection, it is deleting the documents based in its insertion order!!

I mean, if I call objIndexReader.DeleteDocument(0), it will delete the first document from the entire INDEX, not the first document in the Hits collection. So, it deleted the first documents I have inserted some days ago, in previous indexing sessions.

I ask: is there a way to get the correct docNum from the document retrieved in the Hits collection?
OR: is there a safe way to delete documents
OR: what I'm doing wrong?

Thanks in advance.
Cesar


Reply | Threaded
Open this post in threaded view
|

RE: Delete problems O.O

steve_rowe
Hi Cesar,

On 02/11/2008 at 2:19 PM, Cesar Ronchese wrote:
> I'm running problems with document deletion.
> [...]
> This simply doesn't delete anything from the Index.
>
> //see the code sample:
> //"theFieldName" was previously stored as Field.Store.YES and Field.Index.TOKENIZED.
> Term t = new Terms("theFieldName", "theFieldContent");
> objIndexReader.DeleteDocuments(t);

(You have two typos here - "new Term/s/" and /D/eleteDocuments() - I assume that this is just a transcription error, since you must have gotten this code to run...)

When you construct a Term instance, no analysis will be performed on "theFieldContent".  Since "theFieldName" is TOKENIZED, it was analyzed, and this is likely where the mismatch is occurring.  From <http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/index/IndexReader.html#deleteDocuments(org.apache.lucene.index.Term)>:

    This is useful if one uses a document field to
    hold a unique ID string for the document.

If you're trying to delete documents based on a document ID held as the entire value of a field, then you should be using Field.Index.UN_TOKENIZED.  From http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/document/Field.Index.html#UN_TOKENIZED>:

   Index the field's value without using an Analyzer,
   so it can be searched. As no analyzer is used the
   value will be stored as a single term. This is
   useful for unique Ids like product numbers.

> 2) DeleteDocument(numDoc) <== this problem is a woot problem
>
> [...]
>
> I mean, if I call objIndexReader.DeleteDocument(0), it will
> delete the first document from the entire INDEX, not the
> first document in the Hits collection. So, it deleted the
> first documents I have inserted some days ago, in previous
> indexing sessions.

Yes, this is how this method is designed to function.  The javadoc description is perhaps too brief: "Deletes the document numbered 'docNum'".  As you have discovered, "docNum" is the one-up number assigned internally by Lucene to each document as it is added to the index.
 
> I ask: is there a way to get the correct docNum from the
> document retrieved in the Hits collection?

Check out Hits.id(int): <http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/search/Hits.html#id(int)>

The "id" returned by Hits.id(int) is the same thing as the "docNum" parameter to IndexReader.deleteDocument(int).

It sounds like the documentation could benefit from some more discussion of the "docNum"/document "id" feature...

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Delete problems O.O

Cesar Ronchese
Cool man.

The Hits.id(int) worked fine. Thanks for the detailed info.

And hopefully your answer is going to usefull for future Google searches. ;)

Cesar


Steven A Rowe wrote
Hi Cesar,

On 02/11/2008 at 2:19 PM, Cesar Ronchese wrote:
> I'm running problems with document deletion.
> [...]
> This simply doesn't delete anything from the Index.
>
> //see the code sample:
> //"theFieldName" was previously stored as Field.Store.YES and Field.Index.TOKENIZED.
> Term t = new Terms("theFieldName", "theFieldContent");
> objIndexReader.DeleteDocuments(t);

(You have two typos here - "new Term/s/" and /D/eleteDocuments() - I assume that this is just a transcription error, since you must have gotten this code to run...)

When you construct a Term instance, no analysis will be performed on "theFieldContent".  Since "theFieldName" is TOKENIZED, it was analyzed, and this is likely where the mismatch is occurring.  From <http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/index/IndexReader.html#deleteDocuments(org.apache.lucene.index.Term)>:

    This is useful if one uses a document field to
    hold a unique ID string for the document.

If you're trying to delete documents based on a document ID held as the entire value of a field, then you should be using Field.Index.UN_TOKENIZED.  From http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/document/Field.Index.html#UN_TOKENIZED>:

   Index the field's value without using an Analyzer,
   so it can be searched. As no analyzer is used the
   value will be stored as a single term. This is
   useful for unique Ids like product numbers.

> 2) DeleteDocument(numDoc) <== this problem is a woot problem
>
> [...]
>
> I mean, if I call objIndexReader.DeleteDocument(0), it will
> delete the first document from the entire INDEX, not the
> first document in the Hits collection. So, it deleted the
> first documents I have inserted some days ago, in previous
> indexing sessions.

Yes, this is how this method is designed to function.  The javadoc description is perhaps too brief: "Deletes the document numbered 'docNum'".  As you have discovered, "docNum" is the one-up number assigned internally by Lucene to each document as it is added to the index.
 
> I ask: is there a way to get the correct docNum from the
> document retrieved in the Hits collection?

Check out Hits.id(int): <http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/search/Hits.html#id(int)>

The "id" returned by Hits.id(int) is the same thing as the "docNum" parameter to IndexReader.deleteDocument(int).

It sounds like the documentation could benefit from some more discussion of the "docNum"/document "id" feature...

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org