Empty Sink Tokenizer

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Empty Sink Tokenizer

Grant Ingersoll-2
Has the way fields get added changed recently?  http://www.lucidimagination.com/search/document/954555c478002a3/empty_sinktokenizer

See also:
http://www.lucidimagination.com/search/document/274ec8c1c56fdd54/order_of_field_objects_within_document#5ffce4509ed32511

http://www.lucidimagination.com/search/document/d6b19ab1bd87e30a/order_of_fields_returned_by_document_getfields#d6b19ab1bd87e30a

http://www.lucidimagination.com/search/document/deda4dd3f9041bee/the_order_of_fields_in_document_fields#bb26d84091aebcaa


The following little program confirms that they are indeed in alpha  
order now and not in added order:
public class TestFieldOrdering extends LuceneTestCase {
   protected RAMDirectory dir;

   protected void setUp() throws Exception {
     super.setUp();
     dir = new RAMDirectory();

   }

   public void testAddFields() throws Exception {
     IndexWriter writer = new IndexWriter(dir, new SimpleAnalyzer(),  
true, IndexWriter.MaxFieldLength.LIMITED);

     Document doc = new Document();
     doc.add(new Field("id", "one", Field.Store.YES, Field.Index.NO));
     doc.add(new Field("z", "document z", Field.Store.YES,  
Field.Index.ANALYZED));
     doc.add(new Field("a", "document a", Field.Store.YES,  
Field.Index.ANALYZED));
     doc.add(new Field("e", "document e", Field.Store.YES,  
Field.Index.ANALYZED));
     doc.add(new Field("b", "document b", Field.Store.YES,  
Field.Index.ANALYZED));
     writer.addDocument(doc);
     writer.close();
     IndexReader reader = IndexReader.open(dir);
     Document retreived = reader.document(0);
     assertTrue("retreived is null and it shouldn't be", retreived !=  
null);
     List fields = retreived.getFields();
     for (Iterator iterator = fields.iterator(); iterator.hasNext();) {
       Field name = (Field) iterator.next();
       System.out.println("Name: " + name);
     }
   }

}

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Empty Sink Tokenizer

Michael McCandless-2
Uh-oh: I think this happened as part of LUCENE-843, which landed in 2.3.

IndexWriter now first collates each Field instance, by name, and then
visits those fields in sorted order.  Multiple instances of the same
field name are written in the order that they appeared in the
document.

StoredFieldsWriter taps in to the indexing chain after that per-field collation.

But, if getting back to this is important, we should be able to move
StoredFieldsWriter up in the chain so that it visits the original
document, instead.  Offhand, I'm not sure if there are any tradeoffs
in doing that.

Mike

On Tue, Mar 31, 2009 at 9:30 AM, Grant Ingersoll <[hidden email]> wrote:

> Has the way fields get added changed recently?
>  http://www.lucidimagination.com/search/document/954555c478002a3/empty_sinktokenizer
>
> See also:
> http://www.lucidimagination.com/search/document/274ec8c1c56fdd54/order_of_field_objects_within_document#5ffce4509ed32511
>
> http://www.lucidimagination.com/search/document/d6b19ab1bd87e30a/order_of_fields_returned_by_document_getfields#d6b19ab1bd87e30a
>
> http://www.lucidimagination.com/search/document/deda4dd3f9041bee/the_order_of_fields_in_document_fields#bb26d84091aebcaa
>
>
> The following little program confirms that they are indeed in alpha order
> now and not in added order:
> public class TestFieldOrdering extends LuceneTestCase {
>  protected RAMDirectory dir;
>
>  protected void setUp() throws Exception {
>    super.setUp();
>    dir = new RAMDirectory();
>
>  }
>
>  public void testAddFields() throws Exception {
>    IndexWriter writer = new IndexWriter(dir, new SimpleAnalyzer(), true,
> IndexWriter.MaxFieldLength.LIMITED);
>
>    Document doc = new Document();
>    doc.add(new Field("id", "one", Field.Store.YES, Field.Index.NO));
>    doc.add(new Field("z", "document z", Field.Store.YES,
> Field.Index.ANALYZED));
>    doc.add(new Field("a", "document a", Field.Store.YES,
> Field.Index.ANALYZED));
>    doc.add(new Field("e", "document e", Field.Store.YES,
> Field.Index.ANALYZED));
>    doc.add(new Field("b", "document b", Field.Store.YES,
> Field.Index.ANALYZED));
>    writer.addDocument(doc);
>    writer.close();
>    IndexReader reader = IndexReader.open(dir);
>    Document retreived = reader.document(0);
>    assertTrue("retreived is null and it shouldn't be", retreived != null);
>    List fields = retreived.getFields();
>    for (Iterator iterator = fields.iterator(); iterator.hasNext();) {
>      Field name = (Field) iterator.next();
>      System.out.println("Name: " + name);
>    }
>  }
>
> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Empty Sink Tokenizer

Grant Ingersoll-2
Well, we don't make any guarantees about it in docs, AFAICT, but we  
have in the past advertised it (via the mailing lists) as such.  The  
Tee/Sink stuff does rely on what has been the de facto way of doing  
things up until 2.3 it sounds.  The snippet of code I included can  
easily be converted to a test case if we wish to enforce it going  
forward.

What's the benefit of collation?  I don't know if this is considered a  
back-compatible breakage or not (likely not) but this issue does come  
up from time to time and there are people that have relied on our  
answer.

In the end, we should document whichever it is going to be and then  
make sure the Tee/Sink stuff documents it as well.



On Mar 31, 2009, at 10:51 AM, Michael McCandless wrote:

> Uh-oh: I think this happened as part of LUCENE-843, which landed in  
> 2.3.
>
> IndexWriter now first collates each Field instance, by name, and then
> visits those fields in sorted order.  Multiple instances of the same
> field name are written in the order that they appeared in the
> document.
>
> StoredFieldsWriter taps in to the indexing chain after that per-
> field collation.
>
> But, if getting back to this is important, we should be able to move
> StoredFieldsWriter up in the chain so that it visits the original
> document, instead.  Offhand, I'm not sure if there are any tradeoffs
> in doing that.
>
> Mike
>
> On Tue, Mar 31, 2009 at 9:30 AM, Grant Ingersoll  
> <[hidden email]> wrote:
>> Has the way fields get added changed recently?
>>  http://www.lucidimagination.com/search/document/954555c478002a3/empty_sinktokenizer
>>
>> See also:
>> http://www.lucidimagination.com/search/document/274ec8c1c56fdd54/order_of_field_objects_within_document#5ffce4509ed32511
>>
>> http://www.lucidimagination.com/search/document/d6b19ab1bd87e30a/order_of_fields_returned_by_document_getfields#d6b19ab1bd87e30a
>>
>> http://www.lucidimagination.com/search/document/deda4dd3f9041bee/the_order_of_fields_in_document_fields#bb26d84091aebcaa
>>
>>
>> The following little program confirms that they are indeed in alpha  
>> order
>> now and not in added order:
>> public class TestFieldOrdering extends LuceneTestCase {
>>  protected RAMDirectory dir;
>>
>>  protected void setUp() throws Exception {
>>    super.setUp();
>>    dir = new RAMDirectory();
>>
>>  }
>>
>>  public void testAddFields() throws Exception {
>>    IndexWriter writer = new IndexWriter(dir, new SimpleAnalyzer(),  
>> true,
>> IndexWriter.MaxFieldLength.LIMITED);
>>
>>    Document doc = new Document();
>>    doc.add(new Field("id", "one", Field.Store.YES, Field.Index.NO));
>>    doc.add(new Field("z", "document z", Field.Store.YES,
>> Field.Index.ANALYZED));
>>    doc.add(new Field("a", "document a", Field.Store.YES,
>> Field.Index.ANALYZED));
>>    doc.add(new Field("e", "document e", Field.Store.YES,
>> Field.Index.ANALYZED));
>>    doc.add(new Field("b", "document b", Field.Store.YES,
>> Field.Index.ANALYZED));
>>    writer.addDocument(doc);
>>    writer.close();
>>    IndexReader reader = IndexReader.open(dir);
>>    Document retreived = reader.document(0);
>>    assertTrue("retreived is null and it shouldn't be", retreived !=  
>> null);
>>    List fields = retreived.getFields();
>>    for (Iterator iterator = fields.iterator(); iterator.hasNext();) {
>>      Field name = (Field) iterator.next();
>>      System.out.println("Name: " + name);
>>    }
>>  }
>>
>> }
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Empty Sink Tokenizer

Yonik Seeley-2-2
On Tue, Mar 31, 2009 at 12:26 PM, Grant Ingersoll <[hidden email]> wrote:
> What's the benefit of collation?

AFAIK, the main reason is to handle multi-valued fields.
The need to sort partially stems from the fact that the Document class
does not explicitly handle multi-valued fields.

Solr must also sort/hash the Document returned from Lucene because of
this lack of ability to represent a field with multiple values.

Perhaps this is something that should be fixed in Lucene 3, along with
the input/output document issues (documents you get back shouldn't
have things like boosts)?


-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Empty Sink Tokenizer

Michael McCandless-2
In reply to this post by Grant Ingersoll-2
There are two separate things, here.

First is that indexed fields are now processed in alpha order
(stable/partial sort for multivalued fields), as of 2.3.  That I think
is something internal to Lucene and I'm not sure we should make
promises one way or another in what order Lucene visits the fields on
a document (maybe someday multiple threads will run on fields... who
knows).

This I think was the original problem reported on java-user, because
Lucene tried to pull tokens from the sink before it was filled.

I'm not sure how best to fix Sink/TeeTokenizer to "be flexible".
Maybe we could change it so that whichever TokenStream is pulled
first, it then pulls from the true source and populates the other as a
sink.  Then when the other is used, it's always populated.

So a TeeTokenFilter would take a single source and export two copies
(TokenFilters) and you'd use those copies in your fields.

The second issue is that when you store fields in a document and
retrieve the document later, the fields have been sorted by name.  I
think this is basically another case of "the document you provided
during indexing isn't the same thing as what you retrieve at search
time" (as Yonik said).

I'm also not sure we should promise it (though, reverting to the
pre-2.3 approach is possible, and easier than changing order that we
invert fields), or maybe we should wait until we work out input vs
output documents.  And, apparently not many people have noticed this
2nd issue...

Mike

On Tue, Mar 31, 2009 at 12:26 PM, Grant Ingersoll <[hidden email]> wrote:

> Well, we don't make any guarantees about it in docs, AFAICT, but we have in
> the past advertised it (via the mailing lists) as such.  The Tee/Sink stuff
> does rely on what has been the de facto way of doing things up until 2.3 it
> sounds.  The snippet of code I included can easily be converted to a test
> case if we wish to enforce it going forward.
>
> What's the benefit of collation?  I don't know if this is considered a
> back-compatible breakage or not (likely not) but this issue does come up
> from time to time and there are people that have relied on our answer.
>
> In the end, we should document whichever it is going to be and then make
> sure the Tee/Sink stuff documents it as well.
>
>
>
> On Mar 31, 2009, at 10:51 AM, Michael McCandless wrote:
>
>> Uh-oh: I think this happened as part of LUCENE-843, which landed in 2.3.
>>
>> IndexWriter now first collates each Field instance, by name, and then
>> visits those fields in sorted order.  Multiple instances of the same
>> field name are written in the order that they appeared in the
>> document.
>>
>> StoredFieldsWriter taps in to the indexing chain after that per-field
>> collation.
>>
>> But, if getting back to this is important, we should be able to move
>> StoredFieldsWriter up in the chain so that it visits the original
>> document, instead.  Offhand, I'm not sure if there are any tradeoffs
>> in doing that.
>>
>> Mike
>>
>> On Tue, Mar 31, 2009 at 9:30 AM, Grant Ingersoll <[hidden email]>
>> wrote:
>>>
>>> Has the way fields get added changed recently?
>>>
>>>  http://www.lucidimagination.com/search/document/954555c478002a3/empty_sinktokenizer
>>>
>>> See also:
>>>
>>> http://www.lucidimagination.com/search/document/274ec8c1c56fdd54/order_of_field_objects_within_document#5ffce4509ed32511
>>>
>>>
>>> http://www.lucidimagination.com/search/document/d6b19ab1bd87e30a/order_of_fields_returned_by_document_getfields#d6b19ab1bd87e30a
>>>
>>>
>>> http://www.lucidimagination.com/search/document/deda4dd3f9041bee/the_order_of_fields_in_document_fields#bb26d84091aebcaa
>>>
>>>
>>> The following little program confirms that they are indeed in alpha order
>>> now and not in added order:
>>> public class TestFieldOrdering extends LuceneTestCase {
>>>  protected RAMDirectory dir;
>>>
>>>  protected void setUp() throws Exception {
>>>   super.setUp();
>>>   dir = new RAMDirectory();
>>>
>>>  }
>>>
>>>  public void testAddFields() throws Exception {
>>>   IndexWriter writer = new IndexWriter(dir, new SimpleAnalyzer(), true,
>>> IndexWriter.MaxFieldLength.LIMITED);
>>>
>>>   Document doc = new Document();
>>>   doc.add(new Field("id", "one", Field.Store.YES, Field.Index.NO));
>>>   doc.add(new Field("z", "document z", Field.Store.YES,
>>> Field.Index.ANALYZED));
>>>   doc.add(new Field("a", "document a", Field.Store.YES,
>>> Field.Index.ANALYZED));
>>>   doc.add(new Field("e", "document e", Field.Store.YES,
>>> Field.Index.ANALYZED));
>>>   doc.add(new Field("b", "document b", Field.Store.YES,
>>> Field.Index.ANALYZED));
>>>   writer.addDocument(doc);
>>>   writer.close();
>>>   IndexReader reader = IndexReader.open(dir);
>>>   Document retreived = reader.document(0);
>>>   assertTrue("retreived is null and it shouldn't be", retreived != null);
>>>   List fields = retreived.getFields();
>>>   for (Iterator iterator = fields.iterator(); iterator.hasNext();) {
>>>     Field name = (Field) iterator.next();
>>>     System.out.println("Name: " + name);
>>>   }
>>>  }
>>>
>>> }
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Empty Sink Tokenizer

Grant Ingersoll-2

On Mar 31, 2009, at 2:38 PM, Michael McCandless wrote:

> There are two separate things, here.
>
> First is that indexed fields are now processed in alpha order
> (stable/partial sort for multivalued fields), as of 2.3.  That I think
> is something internal to Lucene and I'm not sure we should make
> promises one way or another in what order Lucene visits the fields on
> a document (maybe someday multiple threads will run on fields... who
> knows).
>
> This I think was the original problem reported on java-user, because
> Lucene tried to pull tokens from the sink before it was filled.
>
> I'm not sure how best to fix Sink/TeeTokenizer to "be flexible".
> Maybe we could change it so that whichever TokenStream is pulled
> first, it then pulls from the true source and populates the other as a
> sink.  Then when the other is used, it's always populated.
>

Interesting, that could work, but it would require some work to  
achieve.  Of course, the workaround is to just document the collation  
factor and then people can rename fields, I guess.  I'll try to find  
some spare time to play around with this.  Personally, the Sink/Tee  
stuff could be used in more places than it is for things like copy  
fields and extraction problems.



> So a TeeTokenFilter would take a single source and export two copies
> (TokenFilters) and you'd use those copies in your fields.
>
> The second issue is that when you store fields in a document and
> retrieve the document later, the fields have been sorted by name.  I
> think this is basically another case of "the document you provided
> during indexing isn't the same thing as what you retrieve at search
> time" (as Yonik said).
>
> I'm also not sure we should promise it (though, reverting to the
> pre-2.3 approach is possible, and easier than changing order that we
> invert fields), or maybe we should wait until we work out input vs
> output documents.  And, apparently not many people have noticed this
> 2nd issue...
>

Agreed.  I think 3.0 warrants some rework of Documents as Hoss, Yonik  
and others have suggested, but are we then supposed to have it figured  
out by 2.9 in order to deprecate the existing approach?

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Empty Sink Tokenizer

Michael McCandless-2
On Wed, Apr 1, 2009 at 10:28 AM, Grant Ingersoll <[hidden email]> wrote:

>
> On Mar 31, 2009, at 2:38 PM, Michael McCandless wrote:
>
>> There are two separate things, here.
>>
>> First is that indexed fields are now processed in alpha order
>> (stable/partial sort for multivalued fields), as of 2.3.  That I think
>> is something internal to Lucene and I'm not sure we should make
>> promises one way or another in what order Lucene visits the fields on
>> a document (maybe someday multiple threads will run on fields... who
>> knows).
>>
>> This I think was the original problem reported on java-user, because
>> Lucene tried to pull tokens from the sink before it was filled.
>>
>> I'm not sure how best to fix Sink/TeeTokenizer to "be flexible".
>> Maybe we could change it so that whichever TokenStream is pulled
>> first, it then pulls from the true source and populates the other as a
>> sink.  Then when the other is used, it's always populated.
>>
>
> Interesting, that could work, but it would require some work to achieve.  Of
> course, the workaround is to just document the collation factor and then
> people can rename fields, I guess.  I'll try to find some spare time to play
> around with this.  Personally, the Sink/Tee stuff could be used in more
> places than it is for things like copy fields and extraction problems.

I'm not sure we should document it, since that implies we won't change it.

I think on the lists we can say "this is how it currently works but it
could change in any release", ie, you're using an "undocumented
feature" when you rely on order that Lucene processes the fields.

>> So a TeeTokenFilter would take a single source and export two copies
>> (TokenFilters) and you'd use those copies in your fields.
>>
>> The second issue is that when you store fields in a document and
>> retrieve the document later, the fields have been sorted by name.  I
>> think this is basically another case of "the document you provided
>> during indexing isn't the same thing as what you retrieve at search
>> time" (as Yonik said).
>>
>> I'm also not sure we should promise it (though, reverting to the
>> pre-2.3 approach is possible, and easier than changing order that we
>> invert fields), or maybe we should wait until we work out input vs
>> output documents.  And, apparently not many people have noticed this
>> 2nd issue...
>>
>
> Agreed.  I think 3.0 warrants some rework of Documents as Hoss, Yonik and
> others have suggested, but are we then supposed to have it figured out by
> 2.9 in order to deprecate the existing approach?

(Yes we'd need to have the new way working, deprecating the old way, in 2.9).

I too would love to see this done in time for 2.9.... any volunteers out there?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]