Reuse single document and fields

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Reuse single document and fields

Jay-98
Hi,

I am trying to use the latest 2.3 API on  Field to improve the indexing
performance by reusing Documents and  Fields.
After reading lucene-java wiki and the java doc on Field, I have a
couple of questions about the comment in Field.setValue(), namely,
"Note that you should only use this method after the Field has been
consumed (ie, the Document  containing this Field has been added to the
index)":

1. Does "consumed" here  include IndexWriter.deleteDocument(...) and
IndexWriter.updateDocument(...)?

2.  Why does it require that the  Field has been consumed first before
it can be modified? For example, I may pre-allocate in a constructor  a
Document and related fields with some default values but do not (or
cannot) add the document. Before I add my first document, I need to set
the fields with valid values.
It looks like the new Lucene core would null some of the fields if I do
so.  I do not understand the logic behind the requirement.

I wonder if any expert can elaborate this new API in more details here?

Thanks!

Jay  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Reuse single document and fields

Michael McCandless-2

yu wrote:

> Hi,
>
> I am trying to use the latest 2.3 API on  Field to improve the  
> indexing performance by reusing Documents and  Fields. After  
> reading lucene-java wiki and the java doc on Field, I have a couple  
> of questions about the comment in Field.setValue(), namely,
> "Note that you should only use this method after the Field has been  
> consumed (ie, the Document  containing this Field has been added to  
> the index)":
>
> 1. Does "consumed" here  include IndexWriter.deleteDocument(...)  
> and IndexWriter.updateDocument(...)?

Yes.  Actually, just add/updateDocument.

> 2.  Why does it require that the  Field has been consumed first  
> before it can be modified? For example, I may pre-allocate in a  
> constructor  a Document and related fields with some default values  
> but do not (or cannot) add the document. Before I add my first  
> document, I need to set the fields with valid values.
> It looks like the new Lucene core would null some of the fields if  
> I do so.  I do not understand the logic behind the requirement.

That use case is fine.  You can absolutely change its value before  
it's consumed when the un-consumed value doesn't matter to you.  That  
pre-allocation is a normal & fine use case, and fields should not be  
null'd by Lucene.

The warning should instead state that "if the field holds a real  
value, then make sure it's consumed first before you change it".

So, for example, if you add docs to a queue, and then use a thread  
pool to pull docs from that queue and call addDocument on them, you  
can't re-use the fields in the Document until a thread has added it  
to the index.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Reuse single document and fields

Jay-98
Thanks, Michael, for your quick reply and explanation.
One related question: is it true that Lucene indexer will reject a field
  that has the empty string value? (I saw an IllegalArgumentException).
Will be nice if lucene just skip such a field silently, esp, for the new
2.3 api.

Jay
Michael McCandless wrote:

>
> yu wrote:
>
>> Hi,
>>
>> I am trying to use the latest 2.3 API on  Field to improve the
>> indexing performance by reusing Documents and  Fields. After reading
>> lucene-java wiki and the java doc on Field, I have a couple of
>> questions about the comment in Field.setValue(), namely,
>> "Note that you should only use this method after the Field has been
>> consumed (ie, the Document  containing this Field has been added to
>> the index)":
>>
>> 1. Does "consumed" here  include IndexWriter.deleteDocument(...) and
>> IndexWriter.updateDocument(...)?
>
> Yes.  Actually, just add/updateDocument.
>
>> 2.  Why does it require that the  Field has been consumed first before
>> it can be modified? For example, I may pre-allocate in a constructor  
>> a Document and related fields with some default values but do not (or
>> cannot) add the document. Before I add my first document, I need to
>> set the fields with valid values.
>> It looks like the new Lucene core would null some of the fields if I
>> do so.  I do not understand the logic behind the requirement.
>
> That use case is fine.  You can absolutely change its value before it's
> consumed when the un-consumed value doesn't matter to you.  That
> pre-allocation is a normal & fine use case, and fields should not be
> null'd by Lucene.
>
> The warning should instead state that "if the field holds a real value,
> then make sure it's consumed first before you change it".
>
> So, for example, if you add docs to a queue, and then use a thread pool
> to pull docs from that queue and call addDocument on them, you can't
> re-use the fields in the Document until a thread has added it to the index.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Reuse single document and fields

Michael McCandless-2

As far as I know, Lucene should accept a field with an empty string  
value -- how did you hit the IllegalArgumentException?

Mike

Jay wrote:

> Thanks, Michael, for your quick reply and explanation.
> One related question: is it true that Lucene indexer will reject a  
> field  that has the empty string value? (I saw an  
> IllegalArgumentException).
> Will be nice if lucene just skip such a field silently, esp, for  
> the new 2.3 api.
>
> Jay
> Michael McCandless wrote:
>> yu wrote:
>>> Hi,
>>>
>>> I am trying to use the latest 2.3 API on  Field to improve the  
>>> indexing performance by reusing Documents and  Fields. After  
>>> reading lucene-java wiki and the java doc on Field, I have a  
>>> couple of questions about the comment in Field.setValue(), namely,
>>> "Note that you should only use this method after the Field has  
>>> been consumed (ie, the Document  containing this Field has been  
>>> added to the index)":
>>>
>>> 1. Does "consumed" here  include IndexWriter.deleteDocument(...)  
>>> and IndexWriter.updateDocument(...)?
>> Yes.  Actually, just add/updateDocument.
>>> 2.  Why does it require that the  Field has been consumed first  
>>> before it can be modified? For example, I may pre-allocate in a  
>>> constructor  a Document and related fields with some default  
>>> values but do not (or cannot) add the document. Before I add my  
>>> first document, I need to set the fields with valid values.
>>> It looks like the new Lucene core would null some of the fields  
>>> if I do so.  I do not understand the logic behind the requirement.
>> That use case is fine.  You can absolutely change its value before  
>> it's consumed when the un-consumed value doesn't matter to you.  
>> That pre-allocation is a normal & fine use case, and fields should  
>> not be null'd by Lucene.
>> The warning should instead state that "if the field holds a real  
>> value, then make sure it's consumed first before you change it".
>> So, for example, if you add docs to a queue, and then use a thread  
>> pool to pull docs from that queue and call addDocument on them,  
>> you can't re-use the fields in the Document until a thread has  
>> added it to the index.
>> Mike
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Reuse single document and fields

Jay-98
You are right, Lucene only gives IllegalArgumentException when the value
is null. I assume it won't skip the field is the value is empty or null?

Thanks!

Jay

Michael McCandless wrote:

>
> As far as I know, Lucene should accept a field with an empty string
> value -- how did you hit the IllegalArgumentException?
>
> Mike
>
> Jay wrote:
>
>> Thanks, Michael, for your quick reply and explanation.
>> One related question: is it true that Lucene indexer will reject a
>> field  that has the empty string value? (I saw an
>> IllegalArgumentException).
>> Will be nice if lucene just skip such a field silently, esp, for the
>> new 2.3 api.
>>
>> Jay
>> Michael McCandless wrote:
>>> yu wrote:
>>>> Hi,
>>>>
>>>> I am trying to use the latest 2.3 API on  Field to improve the
>>>> indexing performance by reusing Documents and  Fields. After reading
>>>> lucene-java wiki and the java doc on Field, I have a couple of
>>>> questions about the comment in Field.setValue(), namely,
>>>> "Note that you should only use this method after the Field has been
>>>> consumed (ie, the Document  containing this Field has been added to
>>>> the index)":
>>>>
>>>> 1. Does "consumed" here  include IndexWriter.deleteDocument(...) and
>>>> IndexWriter.updateDocument(...)?
>>> Yes.  Actually, just add/updateDocument.
>>>> 2.  Why does it require that the  Field has been consumed first
>>>> before it can be modified? For example, I may pre-allocate in a
>>>> constructor  a Document and related fields with some default values
>>>> but do not (or cannot) add the document. Before I add my first
>>>> document, I need to set the fields with valid values.
>>>> It looks like the new Lucene core would null some of the fields if I
>>>> do so.  I do not understand the logic behind the requirement.
>>> That use case is fine.  You can absolutely change its value before
>>> it's consumed when the un-consumed value doesn't matter to you.  That
>>> pre-allocation is a normal & fine use case, and fields should not be
>>> null'd by Lucene.
>>> The warning should instead state that "if the field holds a real
>>> value, then make sure it's consumed first before you change it".
>>> So, for example, if you add docs to a queue, and then use a thread
>>> pool to pull docs from that queue and call addDocument on them, you
>>> can't re-use the fields in the Document until a thread has added it
>>> to the index.
>>> Mike
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Reuse single document and fields

Michael McCandless-2

Right, the value in a Field cannot be null.

Mike

Jay wrote:

> You are right, Lucene only gives IllegalArgumentException when the  
> value is null. I assume it won't skip the field is the value is  
> empty or null?
>
> Thanks!
>
> Jay
>
> Michael McCandless wrote:
>> As far as I know, Lucene should accept a field with an empty  
>> string value -- how did you hit the IllegalArgumentException?
>> Mike
>> Jay wrote:
>>> Thanks, Michael, for your quick reply and explanation.
>>> One related question: is it true that Lucene indexer will reject  
>>> a field  that has the empty string value? (I saw an  
>>> IllegalArgumentException).
>>> Will be nice if lucene just skip such a field silently, esp, for  
>>> the new 2.3 api.
>>>
>>> Jay
>>> Michael McCandless wrote:
>>>> yu wrote:
>>>>> Hi,
>>>>>
>>>>> I am trying to use the latest 2.3 API on  Field to improve the  
>>>>> indexing performance by reusing Documents and  Fields. After  
>>>>> reading lucene-java wiki and the java doc on Field, I have a  
>>>>> couple of questions about the comment in Field.setValue(), namely,
>>>>> "Note that you should only use this method after the Field has  
>>>>> been consumed (ie, the Document  containing this Field has been  
>>>>> added to the index)":
>>>>>
>>>>> 1. Does "consumed" here  include IndexWriter.deleteDocument
>>>>> (...) and IndexWriter.updateDocument(...)?
>>>> Yes.  Actually, just add/updateDocument.
>>>>> 2.  Why does it require that the  Field has been consumed first  
>>>>> before it can be modified? For example, I may pre-allocate in a  
>>>>> constructor  a Document and related fields with some default  
>>>>> values but do not (or cannot) add the document. Before I add my  
>>>>> first document, I need to set the fields with valid values.
>>>>> It looks like the new Lucene core would null some of the fields  
>>>>> if I do so.  I do not understand the logic behind the requirement.
>>>> That use case is fine.  You can absolutely change its value  
>>>> before it's consumed when the un-consumed value doesn't matter  
>>>> to you.  That pre-allocation is a normal & fine use case, and  
>>>> fields should not be null'd by Lucene.
>>>> The warning should instead state that "if the field holds a real  
>>>> value, then make sure it's consumed first before you change it".
>>>> So, for example, if you add docs to a queue, and then use a  
>>>> thread pool to pull docs from that queue and call addDocument on  
>>>> them, you can't re-use the fields in the Document until a thread  
>>>> has added it to the index.
>>>> Mike
>>>> -------------------------------------------------------------------
>>>> --
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>
>>> --------------------------------------------------------------------
>>> -
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]