multiple instances of fields or attributes

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

multiple instances of fields or attributes

André Warnier (tomcat)
Hi.

I am totally new to Lucene, and currently investigating the usage of
Lucene for a new development project. In fact, for evaluation I am using
the C port of Lucene, through the Perl "Lucene" module.  I believe my
question is generic, but please tell me if it is otherwise.
(Please adapt this question to the Java environment if needed, want I
want to know is the fundamentals of Lucene)

In perl, to add items to the Lucene index, I do sonething like
my $doc = new Lucene::document;
$doc->addfield('title','value1');
$doc->addfield('author','value2');
$doc->addfield('subject','value3');
$lucene_writer->addDocument($doc);
and that works fine.

Now my question is : can I have seperate "instances" of the field
'author' in the same document, like
'author' = 'Einstein, Albert'
'author' = 'Newton, Isaac'
'author' = 'Freud, Sigmund'

Could I just do several times
$doc->addfield('author','name');
and would Lucene index separate "instances" of this field for the same
document ?

The reason being that I would like to search something like "Einstein
Albert"~1  (adjacent), but without finding another document which would
have a concatenated field like "Thomas, Albert; Einstein, Joseph".
(The same case occurs for instance for a field "keywords".)

Does this question make sense with Lucene ?
If the above is not possible, then is this type of case usually handled
otherwise in Lucene, and how ?

Thank you in advance,
aw


Reply | Threaded
Open this post in threaded view
|

Re: multiple instances of fields or attributes

Erik Hatcher
Yes, Lucene supports multiple instances of same-named fields.   There  
is one trick you'll need to leverage for the proximity operators to  
work as you expect - positionIncrementGap.

For example, if you index "Doe, John" and "Smith, Fred" as separate  
name field instances on the same document, a phrase query for  
name:"john smith" would match that document.  Setting the position  
increment gap to something greater than your desired phrase slop  
would prevent this.

        Erik


On Feb 3, 2008, at 4:14 PM, André Warnier wrote:

> Hi.
>
> I am totally new to Lucene, and currently investigating the usage of
> Lucene for a new development project. In fact, for evaluation I am  
> using
> the C port of Lucene, through the Perl "Lucene" module.  I believe my
> question is generic, but please tell me if it is otherwise.
> (Please adapt this question to the Java environment if needed, want I
> want to know is the fundamentals of Lucene)
>
> In perl, to add items to the Lucene index, I do sonething like
> my $doc = new Lucene::document;
> $doc->addfield('title','value1');
> $doc->addfield('author','value2');
> $doc->addfield('subject','value3');
> $lucene_writer->addDocument($doc);
> and that works fine.
>
> Now my question is : can I have seperate "instances" of the field
> 'author' in the same document, like
> 'author' = 'Einstein, Albert'
> 'author' = 'Newton, Isaac'
> 'author' = 'Freud, Sigmund'
>
> Could I just do several times
> $doc->addfield('author','name');
> and would Lucene index separate "instances" of this field for the same
> document ?
>
> The reason being that I would like to search something like "Einstein
> Albert"~1  (adjacent), but without finding another document which  
> would
> have a concatenated field like "Thomas, Albert; Einstein, Joseph".
> (The same case occurs for instance for a field "keywords".)
>
> Does this question make sense with Lucene ?
> If the above is not possible, then is this type of case usually  
> handled
> otherwise in Lucene, and how ?
>
> Thank you in advance,
> aw
>

Reply | Threaded
Open this post in threaded view
|

Re: multiple instances of fields or attributes

André Warnier (tomcat)
Erik Hatcher wrote:
> Yes, Lucene supports multiple instances of same-named fields.   There is
> one trick you'll need to leverage for the proximity operators to work as
> you expect - positionIncrementGap.
>
Thanks.
Could you point me to some item of documentation that mentions this
"positionIncrementGap" ?  That's a Query parameter I presume ?

André


Reply | Threaded
Open this post in threaded view
|

Re: multiple instances of fields or attributes

Erik Hatcher

On Feb 4, 2008, at 2:36 PM, André Warnier wrote:
> Erik Hatcher wrote:
>> Yes, Lucene supports multiple instances of same-named fields.    
>> There is one trick you'll need to leverage for the proximity  
>> operators to work as you expect - positionIncrementGap.
> Thanks.
> Could you point me to some item of documentation that mentions this  
> "positionIncrementGap" ?  That's a Query parameter I presume ?

It is an Analyzer setting:

<http://lucene.apache.org/java/2_3_0/api/core/org/apache/lucene/ 
analysis/Analyzer.html#getPositionIncrementGap(java.lang.String)>


Reply | Threaded
Open this post in threaded view
|

Re: multiple instances of fields or attributes

André Warnier (tomcat)

Erik Hatcher wrote:

>
> On Feb 4, 2008, at 2:36 PM, André Warnier wrote:
>> Erik Hatcher wrote:
>>> Yes, Lucene supports multiple instances of same-named fields.   There
>>> is one trick you'll need to leverage for the proximity operators to
>>> work as you expect - positionIncrementGap.
>> Thanks.
>> Could you point me to some item of documentation that mentions this
>> "positionIncrementGap" ?  That's a Query parameter I presume ?
>
> It is an Analyzer setting:
>
> <http://lucene.apache.org/java/2_3_0/api/core/org/apache/lucene/analysis/Analyzer.html#getPositionIncrementGap(java.lang.String)>
>

Allright, I admit, I'm thick, I read it, but I don't get it.
Does anyone have an example of how this works ?
(or an explanation in plain French-speaker-friendly tutorial-like English ?)

Thanks,
André


Reply | Threaded
Open this post in threaded view
|

Re: multiple instances of fields or attributes

Doron Cohen-2
On Thu, Feb 7, 2008 at 6:03 PM, André Warnier <[hidden email]> wrote:

> ...
> Does anyone have an example of how this works ?
> (or an explanation in plain French-speaker-friendly tutorial-like English
> ?)
>

Do you mean "how to make it work for you" or "how does it work inside"?
The first option is easier to explain (though I know no French :))
When you create an IndexWritier you provide it an Analyzer.
That analyzer is used when a document is added to the index.
The analyzer.getPositionIncrementGap() specifies the position
gap between separate additions of same field. By default it
returns 0 (which is not working well in your example). To modify this
you can override this method in "your" analyzer to return a nonzero gap,
for example 5. This is easy when subclassing any existing analyzer.

Doron


> Thanks,
> André
>
Reply | Threaded
Open this post in threaded view
|

Re: multiple instances of fields or attributes

André Warnier (tomcat)


Doron Cohen wrote:

> On Thu, Feb 7, 2008 at 6:03 PM, André Warnier <[hidden email]> wrote:
>
>> ...
>> Does anyone have an example of how this works ?
>> (or an explanation in plain French-speaker-friendly tutorial-like English
>> ?)
>>
>
> Do you mean "how to make it work for you" or "how does it work inside"?
> The first option is easier to explain (though I know no French :))
> When you create an IndexWritier you provide it an Analyzer.
> That analyzer is used when a document is added to the index.
> The analyzer.getPositionIncrementGap() specifies the position
> gap between separate additions of same field. By default it
> returns 0 (which is not working well in your example). To modify this
> you can override this method in "your" analyzer to return a nonzero gap,
> for example 5. This is easy when subclassing any existing analyzer.
>
> Doron
>

Now I may be starting to get it (although we French-speaking guys are
slow (but thorough)).  Do you mean the following (add question mark at
end) :
- imagine that I would create a field "descriptors" for each of my documents
- prior to adding a "phrase" to the "descriptors" field, I pass it
through an Analyser, the Analyser breaks it down into words, and notes
for each word the position in the phrase...
- then the Analyser feeds it into the index, where the individual words
are stored, together with their relative position in the "phrase"...
- so that, for instance (ignoring any stripping of stopwords), the
phrase "the white cat jumped over the sleeping dog"  is now stored in
the "descriptors" index as "1:the 2:white 3:cat 4:jumped 5:over 6:the
7:sleeping 8:dog", the "n:" prefixes (so to speak) being the positions
in the phrase/field..
- so that, if I later search for "white cat"~1 in "dsecriptors", it will
find this document, bacause the "distance" between "white" and "cat" is
1 (or 0, depending how one counts) ..
- now, if I (forcefullly) specify a "PositionIncrementGap" of 10 to my
Analayser, then for the second addition to the same "descriptors" field,
it will start the numbering at 19 (?).
- thus if for instance the second instance of "descriptors" is the
phrase "the cow bit the cat", this will be indexed as "19:the 20:cow
21:bit 22:the 23:cat".
- and when searching for "dog cow"~5, it would not find this document,
because the gap betweeb "8:dog" and "20:cow" is greater than 5 ?

Is it something like that, or have I not got it at all ?

To generalise my question, what I would like to know is this : assuming
I have two "descriptors" for the same document : "Electrical and
Electronic Engineering" and "Engineering Studies".
Is there a way to index this document (among others), and to later do a
search which will find the documents which have a "descriptors"
containing both "Electronic" and "Studies" in the same instance of
"descriptors", thus not finding this one ?

Thanks

Reply | Threaded
Open this post in threaded view
|

Re: multiple instances of fields or attributes

Doron Cohen-2
See below...

On Tue, Feb 12, 2008 at 10:08 PM, André Warnier <[hidden email]> wrote:

>
>
> Doron Cohen wrote:
> > On Thu, Feb 7, 2008 at 6:03 PM, André Warnier <[hidden email]> wrote:
> >
> >> ...
> >> Does anyone have an example of how this works ?
> >> (or an explanation in plain French-speaker-friendly tutorial-like
> English
> >> ?)
> >>
> >
> > Do you mean "how to make it work for you" or "how does it work inside"?
> > The first option is easier to explain (though I know no French :))
> > When you create an IndexWritier you provide it an Analyzer.
> > That analyzer is used when a document is added to the index.
> > The analyzer.getPositionIncrementGap() specifies the position
> > gap between separate additions of same field. By default it
> > returns 0 (which is not working well in your example). To modify this
> > you can override this method in "your" analyzer to return a nonzero gap,
> > for example 5. This is easy when subclassing any existing analyzer.
> >
> > Doron
> >
>
> Now I may be starting to get it (although we French-speaking guys are
> slow (but thorough)).  Do you mean the following (add question mark at
> end) :
> - imagine that I would create a field "descriptors" for each of my
> documents
> - prior to adding a "phrase" to the "descriptors" field, I pass it
> through an Analyser, the Analyser breaks it down into words, and notes
> for each word the position in the phrase...


This is true. Just note that (1) "passing-through-the-analyzer" is usually
done
for you by the IndexWriter, and (2) you are adding text (rather than
phrase),
and that text - depending on the field properties - is analyzed into tokens.

- then the Analyser feeds it into the index, where the individual words
> are stored, together with their relative position in the "phrase"...
> - so that, for instance (ignoring any stripping of stopwords), the
> phrase "the white cat jumped over the sleeping dog"  is now stored in
> the "descriptors" index as "1:the 2:white 3:cat 4:jumped 5:over 6:the
> 7:sleeping 8:dog", the "n:" prefixes (so to speak) being the positions
> in the phrase/field..


Yes, though usually starting in position 0.

- so that, if I later search for "white cat"~1 in "dsecriptors", it will
> find this document, bacause the "distance" between "white" and "cat" is
> 1 (or 0, depending how one counts) ..


Yes, though the default is 0,  so "white jumped" would not match
but "white jumped"~1 will match.

- now, if I (forcefullly) specify a "PositionIncrementGap" of 10 to my
> Analayser, then for the second addition to the same "descriptors" field,
> it will start the numbering at 19 (?).


Yes

- thus if for instance the second instance of "descriptors" is the
> phrase "the cow bit the cat", this will be indexed as "19:the 20:cow
> 21:bit 22:the 23:cat".
> - and when searching for "dog cow"~5, it would not find this document,
> because the gap betweeb "8:dog" and "20:cow" is greater than 5 ?
>
> Is it something like that, or have I not got it at all ?


Yes it is.

To generalise my question, what I would like to know is this : assuming
> I have two "descriptors" for the same document : "Electrical and
> Electronic Engineering" and "Engineering Studies".
> Is there a way to index this document (among others), and to later do a
> search which will find the documents which have a "descriptors"
> containing both "Electronic" and "Studies" in the same instance of
> "descriptors", thus not finding this one ?


Yes, you can do this by specifying a large enough gap, using either sloppy
phrase query (as above) or using span-near-queries.

Luke is a tool that allows to search and inspect a Lucene index.
I think you will find it useful.

- Doron