How to combine multiple fields to a single field for indexing

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

How to combine multiple fields to a single field for indexing

Suba Suresh
In "Lucene In Action" book it says it is better practice to combine two
fields into one field and index it than use the MultiFieldQueryParser.
Do I initially index both the fields and then index them again together?
When I index them together do I index the fieldnames or values? Can
someone give me an example of how to do it?

thanks,
suba suresh.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to combine multiple fields to a single field for indexing

Erik Hatcher

On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
> In "Lucene In Action" book it says it is better practice to combine  
> two fields into one field and index it than use the  
> MultiFieldQueryParser. Do I initially index both the fields and  
> then index them again together? When I index them together do I  
> index the fieldnames or values? Can someone give me an example of  
> how to do it?

What I do is simply index all the fields individually that need to be  
searchable or just stored, but also index a general-purpose  
"contents" field with all of that same text.

You can add multiple fields of the same name to a document, making it  
easy to just keep appending to a "contents" field for a document.  
You can see how this is done in the Lucene in Action code in the  
TestDataDocumentHandler.java - however I took a cruder approach and  
appended the fields together with a space in between them rather than  
using the multiple valued field approach.  Either technique will work  
just fine.

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to combine multiple fields to a single field for indexing

KEGan
Erik,

What is generally the reason for indexing both individual fields, and the
general-purpose "content" field ?

Also, if we search in the general-purpose "content" field, wouldnt this
problem occurs. Let say we have 2 fields and the following values:

name : John Smith
food  : subway sandwich

So the general-purpose "content" would have the following values:

John Smith subway sandwich

Hence, if the user search for "smith subway" (with quotation), the said
document will be returned. On the other hand, if both fields were indexed
seperately, this document would not be returned, since there is no field
that contain the value "smith subway".

How do we go about this problem ?


On 8/24/06, Erik Hatcher <[hidden email]> wrote:

>
>
> On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
> > In "Lucene In Action" book it says it is better practice to combine
> > two fields into one field and index it than use the
> > MultiFieldQueryParser. Do I initially index both the fields and
> > then index them again together? When I index them together do I
> > index the fieldnames or values? Can someone give me an example of
> > how to do it?
>
> What I do is simply index all the fields individually that need to be
> searchable or just stored, but also index a general-purpose
> "contents" field with all of that same text.
>
> You can add multiple fields of the same name to a document, making it
> easy to just keep appending to a "contents" field for a document.
> You can see how this is done in the Lucene in Action code in the
> TestDataDocumentHandler.java - however I took a cruder approach and
> appended the fields together with a space in between them rather than
> using the multiple valued field approach.  Either technique will work
> just fine.
>
>        Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How to combine multiple fields to a single field for indexing

Chris Hostetter-3

: What is generally the reason for indexing both individual fields, and the
: general-purpose "content" field ?

so you can explicitly query for "name:paris" or "city:paris" instead of
just "paris"

: name : John Smith
: food  : subway sandwich
:
: So the general-purpose "content" would have the following values:
:
: John Smith subway sandwich
:
: Hence, if the user search for "smith subway" (with quotation), the said

not exactly ... this is where the position incriment gap of your Analyzer
comes in.  you can say how much gap exists between two seperate values in
the same field, so if your gap is 10 then contents:"smith subway"~5 won't
match ... but contents:(smith subway) will


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to combine multiple fields to a single field for indexing

Gopikrishnan Subramani
In reply to this post by KEGan
Erik's has used a space as the field separator. May be you can use a
different field separator that your analyzer won't eat up, so that will
change the token position by 1.

Gopi

On 8/24/06, KEGan <[hidden email]> wrote:

>
> Erik,
>
> What is generally the reason for indexing both individual fields, and the
> general-purpose "content" field ?
>
> Also, if we search in the general-purpose "content" field, wouldnt this
> problem occurs. Let say we have 2 fields and the following values:
>
> name : John Smith
> food  : subway sandwich
>
> So the general-purpose "content" would have the following values:
>
> John Smith subway sandwich
>
> Hence, if the user search for "smith subway" (with quotation), the said
> document will be returned. On the other hand, if both fields were indexed
> seperately, this document would not be returned, since there is no field
> that contain the value "smith subway".
>
> How do we go about this problem ?
>
>
> On 8/24/06, Erik Hatcher <[hidden email]> wrote:
> >
> >
> > On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
> > > In "Lucene In Action" book it says it is better practice to combine
> > > two fields into one field and index it than use the
> > > MultiFieldQueryParser. Do I initially index both the fields and
> > > then index them again together? When I index them together do I
> > > index the fieldnames or values? Can someone give me an example of
> > > how to do it?
> >
> > What I do is simply index all the fields individually that need to be
> > searchable or just stored, but also index a general-purpose
> > "contents" field with all of that same text.
> >
> > You can add multiple fields of the same name to a document, making it
> > easy to just keep appending to a "contents" field for a document.
> > You can see how this is done in the Lucene in Action code in the
> > TestDataDocumentHandler.java - however I took a cruder approach and
> > appended the fields together with a space in between them rather than
> > using the multiple valued field approach.  Either technique will work
> > just fine.
> >
> >        Erik
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How to combine multiple fields to a single field for indexing

KEGan
I think I start to understand this :) .. Thanks guys.

~KEGan


On 8/24/06, Gopikrishnan Subramani <[hidden email]> wrote:

>
> Erik's has used a space as the field separator. May be you can use a
> different field separator that your analyzer won't eat up, so that will
> change the token position by 1.
>
> Gopi
>
> On 8/24/06, KEGan <[hidden email]> wrote:
> >
> > Erik,
> >
> > What is generally the reason for indexing both individual fields, and
> the
> > general-purpose "content" field ?
> >
> > Also, if we search in the general-purpose "content" field, wouldnt this
> > problem occurs. Let say we have 2 fields and the following values:
> >
> > name : John Smith
> > food  : subway sandwich
> >
> > So the general-purpose "content" would have the following values:
> >
> > John Smith subway sandwich
> >
> > Hence, if the user search for "smith subway" (with quotation), the said
> > document will be returned. On the other hand, if both fields were
> indexed
> > seperately, this document would not be returned, since there is no field
> > that contain the value "smith subway".
> >
> > How do we go about this problem ?
> >
> >
> > On 8/24/06, Erik Hatcher <[hidden email]> wrote:
> > >
> > >
> > > On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
> > > > In "Lucene In Action" book it says it is better practice to combine
> > > > two fields into one field and index it than use the
> > > > MultiFieldQueryParser. Do I initially index both the fields and
> > > > then index them again together? When I index them together do I
> > > > index the fieldnames or values? Can someone give me an example of
> > > > how to do it?
> > >
> > > What I do is simply index all the fields individually that need to be
> > > searchable or just stored, but also index a general-purpose
> > > "contents" field with all of that same text.
> > >
> > > You can add multiple fields of the same name to a document, making it
> > > easy to just keep appending to a "contents" field for a document.
> > > You can see how this is done in the Lucene in Action code in the
> > > TestDataDocumentHandler.java - however I took a cruder approach and
> > > appended the fields together with a space in between them rather than
> > > using the multiple valued field approach.  Either technique will work
> > > just fine.
> > >
> > >        Erik
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How to combine multiple fields to a single field for indexing

Erik Hatcher
In reply to this post by Gopikrishnan Subramani
Yeah, I used a cruder form by appending all the text together into a  
single string with a space separator in that LIA example.

Given the position increment gap between instances of same-named  
fields that is now part of Lucene, I recommend using multiple field  
instances instead.

        Erik



On Aug 24, 2006, at 3:05 AM, Gopikrishnan Subramani wrote:

> Erik's has used a space as the field separator. May be you can use a
> different field separator that your analyzer won't eat up, so that  
> will
> change the token position by 1.
>
> Gopi
>
> On 8/24/06, KEGan <[hidden email]> wrote:
>>
>> Erik,
>>
>> What is generally the reason for indexing both individual fields,  
>> and the
>> general-purpose "content" field ?
>>
>> Also, if we search in the general-purpose "content" field, wouldnt  
>> this
>> problem occurs. Let say we have 2 fields and the following values:
>>
>> name : John Smith
>> food  : subway sandwich
>>
>> So the general-purpose "content" would have the following values:
>>
>> John Smith subway sandwich
>>
>> Hence, if the user search for "smith subway" (with quotation), the  
>> said
>> document will be returned. On the other hand, if both fields were  
>> indexed
>> seperately, this document would not be returned, since there is no  
>> field
>> that contain the value "smith subway".
>>
>> How do we go about this problem ?
>>
>>
>> On 8/24/06, Erik Hatcher <[hidden email]> wrote:
>> >
>> >
>> > On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
>> > > In "Lucene In Action" book it says it is better practice to  
>> combine
>> > > two fields into one field and index it than use the
>> > > MultiFieldQueryParser. Do I initially index both the fields and
>> > > then index them again together? When I index them together do I
>> > > index the fieldnames or values? Can someone give me an example of
>> > > how to do it?
>> >
>> > What I do is simply index all the fields individually that need  
>> to be
>> > searchable or just stored, but also index a general-purpose
>> > "contents" field with all of that same text.
>> >
>> > You can add multiple fields of the same name to a document,  
>> making it
>> > easy to just keep appending to a "contents" field for a document.
>> > You can see how this is done in the Lucene in Action code in the
>> > TestDataDocumentHandler.java - however I took a cruder approach and
>> > appended the fields together with a space in between them rather  
>> than
>> > using the multiple valued field approach.  Either technique will  
>> work
>> > just fine.
>> >
>> >        Erik
>> >
>> >
>> >  
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>> >
>> >
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to combine multiple fields to a single field for indexing

Suba Suresh
Thanks for everyone's help. I understand how it works now. I can get rid
of MultiFieldQueryParser in search.

thanks
suba suresh.


Erik Hatcher wrote:

> Yeah, I used a cruder form by appending all the text together into a
> single string with a space separator in that LIA example.
>
> Given the position increment gap between instances of same-named fields
> that is now part of Lucene, I recommend using multiple field instances
> instead.
>
>     Erik
>
>
>
> On Aug 24, 2006, at 3:05 AM, Gopikrishnan Subramani wrote:
>> Erik's has used a space as the field separator. May be you can use a
>> different field separator that your analyzer won't eat up, so that will
>> change the token position by 1.
>>
>> Gopi
>>
>> On 8/24/06, KEGan <[hidden email]> wrote:
>>>
>>> Erik,
>>>
>>> What is generally the reason for indexing both individual fields, and
>>> the
>>> general-purpose "content" field ?
>>>
>>> Also, if we search in the general-purpose "content" field, wouldnt this
>>> problem occurs. Let say we have 2 fields and the following values:
>>>
>>> name : John Smith
>>> food  : subway sandwich
>>>
>>> So the general-purpose "content" would have the following values:
>>>
>>> John Smith subway sandwich
>>>
>>> Hence, if the user search for "smith subway" (with quotation), the said
>>> document will be returned. On the other hand, if both fields were
>>> indexed
>>> seperately, this document would not be returned, since there is no field
>>> that contain the value "smith subway".
>>>
>>> How do we go about this problem ?
>>>
>>>
>>> On 8/24/06, Erik Hatcher <[hidden email]> wrote:
>>> >
>>> >
>>> > On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
>>> > > In "Lucene In Action" book it says it is better practice to combine
>>> > > two fields into one field and index it than use the
>>> > > MultiFieldQueryParser. Do I initially index both the fields and
>>> > > then index them again together? When I index them together do I
>>> > > index the fieldnames or values? Can someone give me an example of
>>> > > how to do it?
>>> >
>>> > What I do is simply index all the fields individually that need to be
>>> > searchable or just stored, but also index a general-purpose
>>> > "contents" field with all of that same text.
>>> >
>>> > You can add multiple fields of the same name to a document, making it
>>> > easy to just keep appending to a "contents" field for a document.
>>> > You can see how this is done in the Lucene in Action code in the
>>> > TestDataDocumentHandler.java - however I took a cruder approach and
>>> > appended the fields together with a space in between them rather than
>>> > using the multiple valued field approach.  Either technique will work
>>> > just fine.
>>> >
>>> >        Erik
>>> >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: [hidden email]
>>> > For additional commands, e-mail: [hidden email]
>>> >
>>> >
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to combine multiple fields to a single field for indexing

KEGan
In reply to this post by Erik Hatcher
Erik,

"Given the position increment gap between instances of same-named
fields that is now part of Lucene, I recommend using multiple field
instances instead."

Did you mean ... recommend "NOT" using multiple field ?

If we want to do query like "name:John" or boasting of Fields ... then we
have to use multiple field instances, right ?


On 8/24/06, Erik Hatcher <[hidden email]> wrote:

>
> Yeah, I used a cruder form by appending all the text together into a
> single string with a space separator in that LIA example.
>
> Given the position increment gap between instances of same-named
> fields that is now part of Lucene, I recommend using multiple field
> instances instead.
>
>        Erik
>
>
>
> On Aug 24, 2006, at 3:05 AM, Gopikrishnan Subramani wrote:
> > Erik's has used a space as the field separator. May be you can use a
> > different field separator that your analyzer won't eat up, so that
> > will
> > change the token position by 1.
> >
> > Gopi
> >
> > On 8/24/06, KEGan <[hidden email]> wrote:
> >>
> >> Erik,
> >>
> >> What is generally the reason for indexing both individual fields,
> >> and the
> >> general-purpose "content" field ?
> >>
> >> Also, if we search in the general-purpose "content" field, wouldnt
> >> this
> >> problem occurs. Let say we have 2 fields and the following values:
> >>
> >> name : John Smith
> >> food  : subway sandwich
> >>
> >> So the general-purpose "content" would have the following values:
> >>
> >> John Smith subway sandwich
> >>
> >> Hence, if the user search for "smith subway" (with quotation), the
> >> said
> >> document will be returned. On the other hand, if both fields were
> >> indexed
> >> seperately, this document would not be returned, since there is no
> >> field
> >> that contain the value "smith subway".
> >>
> >> How do we go about this problem ?
> >>
> >>
> >> On 8/24/06, Erik Hatcher <[hidden email]> wrote:
> >> >
> >> >
> >> > On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
> >> > > In "Lucene In Action" book it says it is better practice to
> >> combine
> >> > > two fields into one field and index it than use the
> >> > > MultiFieldQueryParser. Do I initially index both the fields and
> >> > > then index them again together? When I index them together do I
> >> > > index the fieldnames or values? Can someone give me an example of
> >> > > how to do it?
> >> >
> >> > What I do is simply index all the fields individually that need
> >> to be
> >> > searchable or just stored, but also index a general-purpose
> >> > "contents" field with all of that same text.
> >> >
> >> > You can add multiple fields of the same name to a document,
> >> making it
> >> > easy to just keep appending to a "contents" field for a document.
> >> > You can see how this is done in the Lucene in Action code in the
> >> > TestDataDocumentHandler.java - however I took a cruder approach and
> >> > appended the fields together with a space in between them rather
> >> than
> >> > using the multiple valued field approach.  Either technique will
> >> work
> >> > just fine.
> >> >
> >> >        Erik
> >> >
> >> >
> >> >
> >> ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: [hidden email]
> >> > For additional commands, e-mail: [hidden email]
> >> >
> >> >
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How to combine multiple fields to a single field for indexing

Erik Hatcher

On Aug 26, 2006, at 5:11 AM, KEGan wrote:
> Erik,
>
> "Given the position increment gap between instances of same-named
> fields that is now part of Lucene, I recommend using multiple field
> instances instead."
>
> Did you mean ... recommend "NOT" using multiple field ?

I said what I meant accurately.  Comparing building a single  
aggregate search field either by concatenating text into a single  
string and a single field, say "contents" instance, versus multiple  
"contents" instances that could get separated by a position increment  
gap, I recommend the second approach.

But...

> If we want to do query like "name:John" or boasting of Fields ...  
> then we
> have to use multiple field instances, right ?

of course.

        Erik



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to combine multiple fields to a single field for indexing

KEGan
Thanks. I think I grasp the concept now :)

On 8/27/06, Erik Hatcher <[hidden email]> wrote:

>
>
> On Aug 26, 2006, at 5:11 AM, KEGan wrote:
> > Erik,
> >
> > "Given the position increment gap between instances of same-named
> > fields that is now part of Lucene, I recommend using multiple field
> > instances instead."
> >
> > Did you mean ... recommend "NOT" using multiple field ?
>
> I said what I meant accurately.  Comparing building a single
> aggregate search field either by concatenating text into a single
> string and a single field, say "contents" instance, versus multiple
> "contents" instances that could get separated by a position increment
> gap, I recommend the second approach.
>
> But...
>
> > If we want to do query like "name:John" or boasting of Fields ...
> > then we
> > have to use multiple field instances, right ?
>
> of course.
>
>        Erik
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How to combine multiple fields to a single field for indexing

Mark Miller-3
In reply to this post by Erik Hatcher
How do you set the position increment gap between each addition to the
same field name. Should you set it as high as possible to prevent
proximity queries from crossing it? I have been looking for the code to
find out how to put a gap between each same name field addition, but I
have been unable to find what I am looking for. Also, when using a
nearspan, things blow up if you look for something within
Integer.maximum--sic :) -- Will this be the same case for setting the
positional gap and if so is there a good max to use to keep a query from
ever crossing it?

Thanks,

Mark

Erik Hatcher wrote:

>
> On Aug 23, 2006, at 11:36 AM, Suba Suresh wrote:
>> In "Lucene In Action" book it says it is better practice to combine
>> two fields into one field and index it than use the
>> MultiFieldQueryParser. Do I initially index both the fields and then
>> index them again together? When I index them together do I index the
>> fieldnames or values? Can someone give me an example of how to do it?
>
> What I do is simply index all the fields individually that need to be
> searchable or just stored, but also index a general-purpose "contents"
> field with all of that same text.
>
> You can add multiple fields of the same name to a document, making it
> easy to just keep appending to a "contents" field for a document.  You
> can see how this is done in the Lucene in Action code in the
> TestDataDocumentHandler.java - however I took a cruder approach and
> appended the fields together with a space in between them rather than
> using the multiple valued field approach.  Either technique will work
> just fine.
>
>     Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to combine multiple fields to a single field for indexing

Chris Hostetter-3

: How do you set the position increment gap between each addition to the

it does't have an explicit setter, you just subclass that Analyzer of your
choosing and override getPositionIncrementGap to return the value of your
choosing -- it could be a fixed value, or your Analyzer could be
sophisticated and know to put in a larger gap after it sees special marker
values/tokens (ie: a gap of 10 after each "sentence", a gap of 100 after
each "paragraph", a gap of 100 after each "page", ...)

: same field name. Should you set it as high as possible to prevent
: proximity queries from crossing it? I have been looking for the code to
        ...
: nearspan, things blow up if you look for something within
: Integer.maximum--sic :) -- Will this be the same case for setting the
: positional gap and if so is there a good max to use to keep a query from
: ever crossing it?

How big of a gap you should use depends entirely on how you want to use it
-- you could say that a gap of "10" is big enough if you know your
application will never ask for phrase/span queries with slop greater then
"10" ... or you could pick 100, or 1000 .. it's entirely up to you; the
question is do you ever *want* your clients to be able to "bridge the
gap"?  if so, then they need to know how big the gap is, if not then they
need to be prevented from asking for slop bigger then the gap.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]