Preventing phrase queries from matching across lines

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Preventing phrase queries from matching across lines

Eric Jain
What is the best way to prevent a phrase query such as "eggs white"
matching "fried eggs\nwhite snow"?

Two possibilities I have thought about:

1. Replace all line breaks with a special string, e.g. "newline".
2. Have an analyzer somehow increment the position of a term for each line
break it encounters.

Latter seems a bit more complicated to implement, but it would also be more
efficient, right? Or are there better options?

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Preventing phrase queries from matching across lines

Erik Hatcher

On Apr 28, 2006, at 5:35 AM, Eric Jain wrote:

> What is the best way to prevent a phrase query such as "eggs white"  
> matching "fried eggs\nwhite snow"?
>
> Two possibilities I have thought about:
>
> 1. Replace all line breaks with a special string, e.g. "newline".
> 2. Have an analyzer somehow increment the position of a term for  
> each line break it encounters.
>
> Latter seems a bit more complicated to implement, but it would also  
> be more efficient, right? Or are there better options?

#2 shouldn't be too hard to implement, but you'll need to catch new  
lines in the initial tokenizer.  I'm not sure about the efficiency,  
both options would require a tokenizer detecting new lines and either  
injecting a new term or setting a flag such that the next term gets a  
position increment bump.

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Preventing phrase queries from matching across lines

Eric Jain
Erik Hatcher wrote:

> On Apr 28, 2006, at 5:35 AM, Eric Jain wrote:
>> What is the best way to prevent a phrase query such as "eggs white"
>> matching "fried eggs\nwhite snow"?
>>
>> Two possibilities I have thought about:
>>
>> 1. Replace all line breaks with a special string, e.g. "newline".
>> 2. Have an analyzer somehow increment the position of a term for each
>> line break it encounters.
>>
>> Latter seems a bit more complicated to implement, but it would also be
>> more efficient, right? Or are there better options?
>
> #2 shouldn't be too hard to implement, but you'll need to catch new
> lines in the initial tokenizer.  I'm not sure about the efficiency, both
> options would require a tokenizer detecting new lines and either
> injecting a new term or setting a flag such that the next term gets a
> position increment bump.

Thanks, #2 turned out to be easier to implement than expected. I should
have precised that the "efficiency" I was concerned about was not the
efficiency of the tokenization, but the impact of having all those
additional "newline" term (positions) in the index.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]