how to support "implicit trailing wildcards"

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

how to support "implicit trailing wildcards"

yandong yao
Hi everyone,


How to support 'implicit trailing wildcard *' using Solr, eg: using Google
to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain'
will be matched.

From my point of view, there are several ways, both with disadvantages:

1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u',
'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the index
size increases dramatically, b) will matches even has no relationship, such
as such 'mount' will match 'mountain' also.

2) Using two pass searching: first pass searches term dictionary through
TermsComponent using given keyword, then using the first matched term from
term dictionary to search again. eg: when user enter 'umoun', TermsComponent
will match 'umount', then use 'umount' to search. The disadvantage are: a)
need to parse query string so that could recognize meta keywords such as
'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP
client), b) The returned hit counts is not for original search string, thus
will influence other components such as auto-suggest component based on user
search history and hit counts.

3) Write custom SearchComponent, while have no idea where/how to start with.

Is there any other way in Solr to do this, any feedback/suggestion are
welcome!

Thanks very much in advance!
Reply | Threaded
Open this post in threaded view
|

AW: how to support "implicit trailing wildcards"

Bastian S.
Wildcard-Search is already built in, just use:

?q=umoun*
?q=mounta*

-----Ursprüngliche Nachricht-----
Von: yandong yao [mailto:[hidden email]]
Gesendet: Montag, 9. August 2010 15:57
An: [hidden email]
Betreff: how to support "implicit trailing wildcards"

Hi everyone,


How to support 'implicit trailing wildcard *' using Solr, eg: using Google to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain'
will be matched.

From my point of view, there are several ways, both with disadvantages:

1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u', 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the index size increases dramatically, b) will matches even has no relationship, such as such 'mount' will match 'mountain' also.

2) Using two pass searching: first pass searches term dictionary through TermsComponent using given keyword, then using the first matched term from term dictionary to search again. eg: when user enter 'umoun', TermsComponent will match 'umount', then use 'umount' to search. The disadvantage are: a) need to parse query string so that could recognize meta keywords such as 'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP client), b) The returned hit counts is not for original search string, thus will influence other components such as auto-suggest component based on user search history and hit counts.

3) Write custom SearchComponent, while have no idea where/how to start with.

Is there any other way in Solr to do this, any feedback/suggestion are welcome!

Thanks very much in advance!
Reply | Threaded
Open this post in threaded view
|

Re: how to support "implicit trailing wildcards"

yandong yao
Hi Bastian,

Sorry for not make it clear, I also want exact match have higher score than
wildcard match, that is means: if searching 'mount', documents with 'mount'
will have higher score than documents with 'mountain', while 'mount*' seems
treat 'mount' and 'mountain' as same.

besides, also want the query to be processed with analyzer, while from
http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F,
Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer. The
rationale is that if search 'mounted', I also want documents with 'mount'
match.

So seems built-in wildcard search could not satisfy my requirements if i
understand correctly.

Thanks very much!


2010/8/9 Bastian Spitzer <[hidden email]>

> Wildcard-Search is already built in, just use:
>
> ?q=umoun*
> ?q=mounta*
>
> -----Ursprüngliche Nachricht-----
> Von: yandong yao [mailto:[hidden email]]
> Gesendet: Montag, 9. August 2010 15:57
> An: [hidden email]
> Betreff: how to support "implicit trailing wildcards"
>
> Hi everyone,
>
>
> How to support 'implicit trailing wildcard *' using Solr, eg: using Google
> to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain'
> will be matched.
>
> From my point of view, there are several ways, both with disadvantages:
>
> 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u',
> 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the index
> size increases dramatically, b) will matches even has no relationship, such
> as such 'mount' will match 'mountain' also.
>
> 2) Using two pass searching: first pass searches term dictionary through
> TermsComponent using given keyword, then using the first matched term from
> term dictionary to search again. eg: when user enter 'umoun', TermsComponent
> will match 'umount', then use 'umount' to search. The disadvantage are: a)
> need to parse query string so that could recognize meta keywords such as
> 'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP
> client), b) The returned hit counts is not for original search string, thus
> will influence other components such as auto-suggest component based on user
> search history and hit counts.
>
> 3) Write custom SearchComponent, while have no idea where/how to start
> with.
>
> Is there any other way in Solr to do this, any feedback/suggestion are
> welcome!
>
> Thanks very much in advance!
>
Reply | Threaded
Open this post in threaded view
|

Re: how to support "implicit trailing wildcards"

britske
you could satisfy this by making 2 fields:
1. exactmatch
2. wildcardmatch

use copyfield in your schema to copy 1 --> 2 .

q=exactmatch:mount+wildcardmatch:mount*&q.op=OR
this would score exact matches above (solely) wildcard matches

Geert-Jan

2010/8/10 yandong yao <[hidden email]>

> Hi Bastian,
>
> Sorry for not make it clear, I also want exact match have higher score than
> wildcard match, that is means: if searching 'mount', documents with 'mount'
> will have higher score than documents with 'mountain', while 'mount*' seems
> treat 'mount' and 'mountain' as same.
>
> besides, also want the query to be processed with analyzer, while from
>
> http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
> ,
> Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer.
> The
> rationale is that if search 'mounted', I also want documents with 'mount'
> match.
>
> So seems built-in wildcard search could not satisfy my requirements if i
> understand correctly.
>
> Thanks very much!
>
>
> 2010/8/9 Bastian Spitzer <[hidden email]>
>
> > Wildcard-Search is already built in, just use:
> >
> > ?q=umoun*
> > ?q=mounta*
> >
> > -----Ursprüngliche Nachricht-----
> > Von: yandong yao [mailto:[hidden email]]
> > Gesendet: Montag, 9. August 2010 15:57
> > An: [hidden email]
> > Betreff: how to support "implicit trailing wildcards"
> >
> > Hi everyone,
> >
> >
> > How to support 'implicit trailing wildcard *' using Solr, eg: using
> Google
> > to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain'
> > will be matched.
> >
> > From my point of view, there are several ways, both with disadvantages:
> >
> > 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u',
> > 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the
> index
> > size increases dramatically, b) will matches even has no relationship,
> such
> > as such 'mount' will match 'mountain' also.
> >
> > 2) Using two pass searching: first pass searches term dictionary through
> > TermsComponent using given keyword, then using the first matched term
> from
> > term dictionary to search again. eg: when user enter 'umoun',
> TermsComponent
> > will match 'umount', then use 'umount' to search. The disadvantage are:
> a)
> > need to parse query string so that could recognize meta keywords such as
> > 'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP
> > client), b) The returned hit counts is not for original search string,
> thus
> > will influence other components such as auto-suggest component based on
> user
> > search history and hit counts.
> >
> > 3) Write custom SearchComponent, while have no idea where/how to start
> > with.
> >
> > Is there any other way in Solr to do this, any feedback/suggestion are
> > welcome!
> >
> > Thanks very much in advance!
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: how to support "implicit trailing wildcards"

Jan Høydahl / Cominvent
Hi,

You don't need to duplicate the content into two fields to achieve this. Try this:

q=mount OR mount*

The exact match will always get higher score than the wildcard match because wildcard matches uses "constant score".

Making this work for multi term queries is a bit trickier, but something along these lines:

q=(mount OR mount*) AND (everest OR everest*)

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 10. aug. 2010, at 09.38, Geert-Jan Brits wrote:

> you could satisfy this by making 2 fields:
> 1. exactmatch
> 2. wildcardmatch
>
> use copyfield in your schema to copy 1 --> 2 .
>
> q=exactmatch:mount+wildcardmatch:mount*&q.op=OR
> this would score exact matches above (solely) wildcard matches
>
> Geert-Jan
>
> 2010/8/10 yandong yao <[hidden email]>
>
>> Hi Bastian,
>>
>> Sorry for not make it clear, I also want exact match have higher score than
>> wildcard match, that is means: if searching 'mount', documents with 'mount'
>> will have higher score than documents with 'mountain', while 'mount*' seems
>> treat 'mount' and 'mountain' as same.
>>
>> besides, also want the query to be processed with analyzer, while from
>>
>> http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
>> ,
>> Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer.
>> The
>> rationale is that if search 'mounted', I also want documents with 'mount'
>> match.
>>
>> So seems built-in wildcard search could not satisfy my requirements if i
>> understand correctly.
>>
>> Thanks very much!
>>
>>
>> 2010/8/9 Bastian Spitzer <[hidden email]>
>>
>>> Wildcard-Search is already built in, just use:
>>>
>>> ?q=umoun*
>>> ?q=mounta*
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: yandong yao [mailto:[hidden email]]
>>> Gesendet: Montag, 9. August 2010 15:57
>>> An: [hidden email]
>>> Betreff: how to support "implicit trailing wildcards"
>>>
>>> Hi everyone,
>>>
>>>
>>> How to support 'implicit trailing wildcard *' using Solr, eg: using
>> Google
>>> to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain'
>>> will be matched.
>>>
>>> From my point of view, there are several ways, both with disadvantages:
>>>
>>> 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u',
>>> 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the
>> index
>>> size increases dramatically, b) will matches even has no relationship,
>> such
>>> as such 'mount' will match 'mountain' also.
>>>
>>> 2) Using two pass searching: first pass searches term dictionary through
>>> TermsComponent using given keyword, then using the first matched term
>> from
>>> term dictionary to search again. eg: when user enter 'umoun',
>> TermsComponent
>>> will match 'umount', then use 'umount' to search. The disadvantage are:
>> a)
>>> need to parse query string so that could recognize meta keywords such as
>>> 'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP
>>> client), b) The returned hit counts is not for original search string,
>> thus
>>> will influence other components such as auto-suggest component based on
>> user
>>> search history and hit counts.
>>>
>>> 3) Write custom SearchComponent, while have no idea where/how to start
>>> with.
>>>
>>> Is there any other way in Solr to do this, any feedback/suggestion are
>>> welcome!
>>>
>>> Thanks very much in advance!
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: how to support "implicit trailing wildcards"

yandong yao
Hi Jan,

Seems q=mount OR mount* have different sorting order with q=mount for those
documents including mount.
Change to  q=mount^100 OR (mount?* -mount)^1.0, and test well.

Thanks very much!

2010/8/10 Jan Høydahl / Cominvent <[hidden email]>

> Hi,
>
> You don't need to duplicate the content into two fields to achieve this.
> Try this:
>
> q=mount OR mount*
>
> The exact match will always get higher score than the wildcard match
> because wildcard matches uses "constant score".
>
> Making this work for multi term queries is a bit trickier, but something
> along these lines:
>
> q=(mount OR mount*) AND (everest OR everest*)
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 10. aug. 2010, at 09.38, Geert-Jan Brits wrote:
>
> > you could satisfy this by making 2 fields:
> > 1. exactmatch
> > 2. wildcardmatch
> >
> > use copyfield in your schema to copy 1 --> 2 .
> >
> > q=exactmatch:mount+wildcardmatch:mount*&q.op=OR
> > this would score exact matches above (solely) wildcard matches
> >
> > Geert-Jan
> >
> > 2010/8/10 yandong yao <[hidden email]>
> >
> >> Hi Bastian,
> >>
> >> Sorry for not make it clear, I also want exact match have higher score
> than
> >> wildcard match, that is means: if searching 'mount', documents with
> 'mount'
> >> will have higher score than documents with 'mountain', while 'mount*'
> seems
> >> treat 'mount' and 'mountain' as same.
> >>
> >> besides, also want the query to be processed with analyzer, while from
> >>
> >>
> http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
> >> ,
> >> Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer.
> >> The
> >> rationale is that if search 'mounted', I also want documents with
> 'mount'
> >> match.
> >>
> >> So seems built-in wildcard search could not satisfy my requirements if i
> >> understand correctly.
> >>
> >> Thanks very much!
> >>
> >>
> >> 2010/8/9 Bastian Spitzer <[hidden email]>
> >>
> >>> Wildcard-Search is already built in, just use:
> >>>
> >>> ?q=umoun*
> >>> ?q=mounta*
> >>>
> >>> -----Ursprüngliche Nachricht-----
> >>> Von: yandong yao [mailto:[hidden email]]
> >>> Gesendet: Montag, 9. August 2010 15:57
> >>> An: [hidden email]
> >>> Betreff: how to support "implicit trailing wildcards"
> >>>
> >>> Hi everyone,
> >>>
> >>>
> >>> How to support 'implicit trailing wildcard *' using Solr, eg: using
> >> Google
> >>> to search 'umoun', 'umount' will be matched , search 'mounta',
> 'mountain'
> >>> will be matched.
> >>>
> >>> From my point of view, there are several ways, both with disadvantages:
> >>>
> >>> 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with
> 'u',
> >>> 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the
> >> index
> >>> size increases dramatically, b) will matches even has no relationship,
> >> such
> >>> as such 'mount' will match 'mountain' also.
> >>>
> >>> 2) Using two pass searching: first pass searches term dictionary
> through
> >>> TermsComponent using given keyword, then using the first matched term
> >> from
> >>> term dictionary to search again. eg: when user enter 'umoun',
> >> TermsComponent
> >>> will match 'umount', then use 'umount' to search. The disadvantage are:
> >> a)
> >>> need to parse query string so that could recognize meta keywords such
> as
> >>> 'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP
> >>> client), b) The returned hit counts is not for original search string,
> >> thus
> >>> will influence other components such as auto-suggest component based on
> >> user
> >>> search history and hit counts.
> >>>
> >>> 3) Write custom SearchComponent, while have no idea where/how to start
> >>> with.
> >>>
> >>> Is there any other way in Solr to do this, any feedback/suggestion are
> >>> welcome!
> >>>
> >>> Thanks very much in advance!
> >>>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: how to support "implicit trailing wildcards"

Jan Høydahl / Cominvent
I guess q=mount OR (mount*)^0.01 would work equally as well, i.e. diminishing the effect of wildcard matches.
--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 11. aug. 2010, at 17.53, yandong yao wrote:

> Hi Jan,
>
> Seems q=mount OR mount* have different sorting order with q=mount for those
> documents including mount.
> Change to  q=mount^100 OR (mount?* -mount)^1.0, and test well.
>
> Thanks very much!
>
> 2010/8/10 Jan Høydahl / Cominvent <[hidden email]>
>
>> Hi,
>>
>> You don't need to duplicate the content into two fields to achieve this.
>> Try this:
>>
>> q=mount OR mount*
>>
>> The exact match will always get higher score than the wildcard match
>> because wildcard matches uses "constant score".
>>
>> Making this work for multi term queries is a bit trickier, but something
>> along these lines:
>>
>> q=(mount OR mount*) AND (everest OR everest*)
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Training in Europe - www.solrtraining.com
>>
>> On 10. aug. 2010, at 09.38, Geert-Jan Brits wrote:
>>
>>> you could satisfy this by making 2 fields:
>>> 1. exactmatch
>>> 2. wildcardmatch
>>>
>>> use copyfield in your schema to copy 1 --> 2 .
>>>
>>> q=exactmatch:mount+wildcardmatch:mount*&q.op=OR
>>> this would score exact matches above (solely) wildcard matches
>>>
>>> Geert-Jan
>>>
>>> 2010/8/10 yandong yao <[hidden email]>
>>>
>>>> Hi Bastian,
>>>>
>>>> Sorry for not make it clear, I also want exact match have higher score
>> than
>>>> wildcard match, that is means: if searching 'mount', documents with
>> 'mount'
>>>> will have higher score than documents with 'mountain', while 'mount*'
>> seems
>>>> treat 'mount' and 'mountain' as same.
>>>>
>>>> besides, also want the query to be processed with analyzer, while from
>>>>
>>>>
>> http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
>>>> ,
>>>> Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer.
>>>> The
>>>> rationale is that if search 'mounted', I also want documents with
>> 'mount'
>>>> match.
>>>>
>>>> So seems built-in wildcard search could not satisfy my requirements if i
>>>> understand correctly.
>>>>
>>>> Thanks very much!
>>>>
>>>>
>>>> 2010/8/9 Bastian Spitzer <[hidden email]>
>>>>
>>>>> Wildcard-Search is already built in, just use:
>>>>>
>>>>> ?q=umoun*
>>>>> ?q=mounta*
>>>>>
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: yandong yao [mailto:[hidden email]]
>>>>> Gesendet: Montag, 9. August 2010 15:57
>>>>> An: [hidden email]
>>>>> Betreff: how to support "implicit trailing wildcards"
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>>
>>>>> How to support 'implicit trailing wildcard *' using Solr, eg: using
>>>> Google
>>>>> to search 'umoun', 'umount' will be matched , search 'mounta',
>> 'mountain'
>>>>> will be matched.
>>>>>
>>>>> From my point of view, there are several ways, both with disadvantages:
>>>>>
>>>>> 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with
>> 'u',
>>>>> 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the
>>>> index
>>>>> size increases dramatically, b) will matches even has no relationship,
>>>> such
>>>>> as such 'mount' will match 'mountain' also.
>>>>>
>>>>> 2) Using two pass searching: first pass searches term dictionary
>> through
>>>>> TermsComponent using given keyword, then using the first matched term
>>>> from
>>>>> term dictionary to search again. eg: when user enter 'umoun',
>>>> TermsComponent
>>>>> will match 'umount', then use 'umount' to search. The disadvantage are:
>>>> a)
>>>>> need to parse query string so that could recognize meta keywords such
>> as
>>>>> 'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP
>>>>> client), b) The returned hit counts is not for original search string,
>>>> thus
>>>>> will influence other components such as auto-suggest component based on
>>>> user
>>>>> search history and hit counts.
>>>>>
>>>>> 3) Write custom SearchComponent, while have no idea where/how to start
>>>>> with.
>>>>>
>>>>> Is there any other way in Solr to do this, any feedback/suggestion are
>>>>> welcome!
>>>>>
>>>>> Thanks very much in advance!
>>>>>
>>>>
>>
>>