Wildcards / Binary searches

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Wildcards / Binary searches

galo-2
Hi,

Three questions:

1. I want to use solr for some sort of live search, querying with
incomplete terms + wildcard and getting any similar results. Radioh*
would return anything containing that string. The DisMax req. hander
doesn't accept wildcards in the q param so i'm trying the simple one and
still have problems as all my results are coming back with score = 1 and
I need them sorted by relevance.. Is there a way of doing this? Why
doesn't * work in dismax (nor ~ by the way)??

2. What do the phrase slop params do?

3. I'm trying to implement another index where I store a number of int
values for each document. Everything works ok as integers but i'd like
to have some sort of fuzzy searches based on the bit representation of
the numbers. Essentially, this number:

1001001010100

would be compared to these two

1011001010100
1001001010111

And the first would get a bigger score than the second, as it has only 1
flipped bit while the second has 2.

Is it possible to implement this in solr?


Cheers,
galo

Reply | Threaded
Open this post in threaded view
|

RE: Wildcards / Binary searches

Xuesong Luo
I have a similar question about dismax, here is what Chris said:

the dismax handler uses a much more simplified query syntax then the
standard request handler.  Only +, -, and " are special characters so
wildcards are not supported.


HTH

-----Original Message-----
From: galo [mailto:[hidden email]]
Sent: Wednesday, June 06, 2007 8:41 AM
To: [hidden email]
Subject: Wildcards / Binary searches

Hi,

Three questions:

1. I want to use solr for some sort of live search, querying with
incomplete terms + wildcard and getting any similar results. Radioh*
would return anything containing that string. The DisMax req. hander
doesn't accept wildcards in the q param so i'm trying the simple one and

still have problems as all my results are coming back with score = 1 and

I need them sorted by relevance.. Is there a way of doing this? Why
doesn't * work in dismax (nor ~ by the way)??

2. What do the phrase slop params do?

3. I'm trying to implement another index where I store a number of int
values for each document. Everything works ok as integers but i'd like
to have some sort of fuzzy searches based on the bit representation of
the numbers. Essentially, this number:

1001001010100

would be compared to these two

1011001010100
1001001010111

And the first would get a bigger score than the second, as it has only 1

flipped bit while the second has 2.

Is it possible to implement this in solr?


Cheers,
galo


Reply | Threaded
Open this post in threaded view
|

Re: Wildcards / Binary searches

Yonik Seeley-2
In reply to this post by galo-2
On 6/6/07, galo <[hidden email]> wrote:

> 3. I'm trying to implement another index where I store a number of int
> values for each document. Everything works ok as integers but i'd like
> to have some sort of fuzzy searches based on the bit representation of
> the numbers. Essentially, this number:
>
> 1001001010100
>
> would be compared to these two
>
> 1011001010100
> 1001001010111
>
> And the first would get a bigger score than the second, as it has only 1
> flipped bit while the second has 2.

You could store the numbers as a string field with the binary representation,
then try a fuzzy search.

  myfield:1001001010100~

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Wildcards / Binary searches

jjlarrea
In reply to this post by galo-2
At 4:40 PM +0100 6/6/07, galo wrote:
>1. I want to use solr for some sort of live search, querying with incomplete terms + wildcard and getting any similar results. Radioh* would return anything containing that string. The DisMax req. hander doesn't accept wildcards in the q param so i'm trying the simple one and still have problems as all my results are coming back with score = 1 and I need them sorted by relevance.. Is there a way of doing this? Why doesn't * work in dismax (nor ~ by the way)??

DisMax was written with the intent of supporting a simple search box in which one could type or paste some text, e.g. a title like

    Santa Clause: Is he Real (and if so, what is "real")?

and get meaningful results.  To do that it pre-processes the query string by removing unbalanced quotation marks and escaping characters that would otherwise be treated by the query parser as operators:

    \ ! ( ) : ^ [ ] { } ~ * ?

I have a local version of DisMax which parameterizes the escaping so certain operators can be allowed through, which I'd be happy to contribute to you or the codebase, but I expect SimpleRH may be a better tool for your application than DisMaxRH, as long as you get it to score as you wish.

Both Standard and DisMax request handlers use SolrQueryParser, an extension of the Lucene query parser which introduces a small number of changes, one of which is that prefix queries e.g. Radioh* are evaluated with ConstantScorePrefixQuery rather than the standard PrefixQuery.

In issue SOLR-218 developers have been discussing per-field control of query parser options (some of it Solr's, some of it Lucene's).  When that is implemented there should additionally be a property useConstantScorePrefixQuery analogous to the unfortunately-named QueryParser useOldRangeQuery, but handled by SolrQueryParser (until CSPQs are implemented as an option in Lucene QP).

Until that time, well, Chris H. posted a clever and rather timely workaround on the solr-dev list:

>one work arround people may want to consider ... is to force the use of a WildCardQuery in what would otherwise be interpreted as a PrefixQuery by putting a "?" before the "*"
>
>ie: auto?* instead of auto*
>
>(yes, this does require that at least one character follow the prefix)

Perhaps that would help in your case?

- J.J.

Reply | Threaded
Open this post in threaded view
|

Re: Wildcards / Binary searches

galo-2
In reply to this post by Yonik Seeley-2
Yeah i thought of that solution but this is a 20G index with each
document having around 300 or those numbers so i was a bit worried about
the performance.. I'll try anyway, thanks!

> On 06/06/07, *Yonik Seeley* <[hidden email] <mailto:[hidden email]>>
> wrote:
>
>     On 6/6/07, galo <[hidden email] <mailto:[hidden email]>> wrote:
>     >  3. I'm trying to implement another index where I store a number of
>     int
>     >  values for each document. Everything works ok as integers but i'd
>     like
>     >  to have some sort of fuzzy searches based on the bit representation of
>     >  the numbers. Essentially, this number:
>     >
>     >  1001001010100
>     >
>     >  would be compared to these two
>     >
>     >  1011001010100
>     >  1001001010111
>     >
>     >  And the first would get a bigger score than the second, as it has
>     only 1
>     >  flipped bit while the second has 2.
>
>     You could store the numbers as a string field with the binary
>     representation,
>     then try a fuzzy search.
>
>       myfield:1001001010100~
>
>     -Yonik
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Wildcards / Binary searches

galo-2
In reply to this post by galo-2
Ok further to my email below i've been testing with q=radioh?*

Basically the problem is, searching artists even with Radiohead having a
big boost, it's returning stuff with less boost before like
"Radiohead+Ani Di Franco" or "Radiohead+Michael Stipe"

The debug output is below, but basically, for Radiohead and one of the
others we get this:

radiohead+ani - 655391.5  * 0.046359334
radiohead     - 1150991.9 * 0.025442434

So it's fairly clear where is the difference. Looking at the numbers,
the cause seems to be in this line:

8.781371 = idf(docFreq=4096)

While Radiohead+Ani is getting

16.000769 = idf(docFreq=2)

If I can alter this I think sorted.. what's idf and docFreq?


   <str name="id=1200360,internal_docid=159496">
30383.514 = (MATCH) sum of:
   30383.514 = (MATCH) weight(text:radiohead+ani in 159496), product of:
     0.046359334 = queryWeight(text:radiohead+ani), product of:
       16.000769 = idf(docFreq=2)
       0.0028973192 = queryNorm
     655391.5 = (MATCH) fieldWeight(text:radiohead+ani in 159496),
product of:
       1.0 = tf(termFreq(text:radiohead+ani)=1)
       16.000769 = idf(docFreq=2)
       40960.0 = fieldNorm(field=text, doc=159496)
</str>
   <str name="id=979,internal_docid=9799640">
29284.035 = (MATCH) sum of:
   29284.035 = (MATCH) weight(text:radiohead in 9799640), product of:
     0.025442434 = queryWeight(text:radiohead), product of:
       8.781371 = idf(docFreq=4096)
       0.0028973192 = queryNorm
     1150991.9 = (MATCH) fieldWeight(text:radiohead in 9799640), product of:
       1.0 = tf(termFreq(text:radiohead)=1)
       8.781371 = idf(docFreq=4096)
       131072.0 = fieldNorm(field=text, doc=9799640)
</str>

Thanks a lot,

galo


galo wrote:

> I was doing a different trick, basically searching q=radioh*+radioh~,
> and the results are slightly better than ?*, but not great. By the way,
> the case sensitiveness of wildcards affects here of course.
>
> I'd like to have a look to that DisMax you have if you can post it, at
> least to compare results. The way I get to do scoring as I say is far
> from perfect.
>
> By the way, I'm seeing the highlighting dissapears when using these
> wildcards, is that normal??
>
> Thanks for your help,
>
> galo
>
>> At 4:40 PM +0100 6/6/07, galo wrote:
>>  >1. I want to use solr for some sort of live search, querying with
>> incomplete terms + wildcard and getting any similar results. Radioh*
>> would return anything containing that string. The DisMax req. hander
>> doesn't accept wildcards in the q param so i'm trying the simple one
>> and still have problems as all my results are coming back with score =
>> 1 and I need them sorted by relevance.. Is there a way of doing this?
>> Why doesn't * work in dismax (nor ~ by the way)??
>>
>> DisMax was written with the intent of supporting a simple search box
>> in which one could type or paste some text, e.g. a title like
>>
>>     Santa Clause: Is he Real (and if so, what is "real")?
>>
>> and get meaningful results.  To do that it pre-processes the query
>> string by removing unbalanced quotation marks and escaping characters
>> that would otherwise be treated by the query parser as operators:
>>
>>     \ ! ( ) : ^ [ ] { } ~ * ?
>>
>> I have a local version of DisMax which parameterizes the escaping so
>> certain operators can be allowed through, which I'd be happy to
>> contribute to you or the codebase, but I expect SimpleRH may be a
>> better tool for your application than DisMaxRH, as long as you get it
>> to score as you wish.
>>
>> Both Standard and DisMax request handlers use SolrQueryParser, an
>> extension of the Lucene query parser which introduces a small number
>> of changes, one of which is that prefix queries e.g. Radioh* are
>> evaluated with ConstantScorePrefixQuery rather than the standard
>> PrefixQuery.
>>
>> In issue SOLR-218 developers have been discussing per-field control of
>> query parser options (some of it Solr's, some of it Lucene's).  When
>> that is implemented there should additionally be a property
>> useConstantScorePrefixQuery analogous to the unfortunately-named
>> QueryParser useOldRangeQuery, but handled by SolrQueryParser (until
>> CSPQs are implemented as an option in Lucene QP).
>>
>> Until that time, well, Chris H. posted a clever and rather timely
>> workaround on the solr-dev list:
>>
>>  >one work arround people may want to consider ... is to force the use
>> of a WildCardQuery in what would otherwise be interpreted as a
>> PrefixQuery by putting a "?" before the "*"
>>  >
>>  >ie: auto?* instead of auto*
>>  >
>>  >(yes, this does require that at least one character follow the prefix)
>>
>> Perhaps that would help in your case?
>>
>> - J.J.
>>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Wildcards / Binary searches

Chris Hostetter-3
In reply to this post by jjlarrea

: I have a local version of DisMax which parameterizes the escaping so
: certain operators can be allowed through, which I'd be happy to
: contribute to you or the codebase, but I expect SimpleRH may be a better

That sounds like it would be a really usefull patch if you be interested
in posting it to Jira.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Wildcards / Binary searches

Chris Hostetter-3
In reply to this post by galo-2

Side Note: It's my opinion that "type ahead" or "auto complete' style
functionality is best addressed by customized logic (most likely using
specially built fields containing all of the prefixes of the key words up
to N characters as seperate tokens).  simple uses of PrefixQueries are
only going ot get you so far particularly under heavy load or in an index
with a large number of unique terms.


: If I can alter this I think sorted.. what's idf and docFreq?

people who really want to get into the nitty gritty of scoring should
really familiarize themselves with the details of the Lucene scoring
mechanisms...

   http://lucene.apache.org/java/docs/scoring.html

(this is linked to from the question "How are documents scored" in the
SolrRelevancyFAQ .. any edits from users to improve this FAQ would be
greatly appreciated:  http://wiki.apache.org/solr/SolrRelevancyFAQ  )

NOTE: in a "type ahead" style situation, you may actaully want an IDF
function that's the inverse of typical search usages (which i guess would
make it just a "DF" function) since unique terms really aren't "better" in
this usecase.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Wildcards / Binary searches

jjlarrea
In reply to this post by Chris Hostetter-3
Hi, Hoss.

I have a number of things I'd like to post... but the generally-useful stuff is unfortunately a bit interwoven with the special-case stuff, and I need to get out of breathing-down-my-back deadline mode to find the time to separate them, clean up and comment, make test cases, etc.  Hopefully next week I can post at least a modest contribution including this.

- J.J.

At 11:31 AM -0700 6/6/07, Chris Hostetter wrote:

>: I have a local version of DisMax which parameterizes the escaping so
>: certain operators can be allowed through, which I'd be happy to
>: contribute to you or the codebase, but I expect SimpleRH may be a better
>
>That sounds like it would be a really usefull patch if you be interested
>in posting it to Jira.
>
>
>
>-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Wildcards / Binary searches

Frédéric Glorieux
In reply to this post by Chris Hostetter-3



Sorry to jump on a "Side  note" of the thread, but the topic is about
some of my need of the moment.

> Side Note: It's my opinion that "type ahead" or "auto complete' style
> functionality is best addressed by customized logic (most likely using
> specially built fields containing all of the prefixes of the key words up
> to N characters as seperate tokens).  

Do you mean something like below ?
<field name="autocomplete">w wo wor word</field>

> simple uses of PrefixQueries are
> only going ot get you so far particularly under heavy load or in an index
> with a large number of unique terms.

For a bibliographic app with lucene, I implemented a suggest on
different fields (especially "subject" terms, like topic or place), to
populate a form with already used values. I used the Lucene IndexReader
to get very fastly list of terms in sorting order, without duplicate values.

<http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexReader.html#terms(org.apache.lucene.index.Term)>

There's a bad drawback of this way, "The enumeration is ordered by
Term.compareTo()", the sorting order is natively ASCII, uppercase is
before lowercase. I had to patch Lucene Term.compareTo() for this
project, definitively not a good practice for portability of indexes. A
duplicate field with an analyser to produce a sortable ASCII version
would be better.

Opinions of the list on this topic would be welcome.

--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique
Reply | Threaded
Open this post in threaded view
|

Re: Wildcards / Binary searches

Chris Hostetter-3

: Do you mean something like below ?
: <field name="autocomplete">w wo wor word</field>

yeah, but there are some Tokenizers that make this trivial
(EdgeNGramTokenizer i think is the name)


: project, definitively not a good practice for portability of indexes. A
: duplicate field with an analyser to produce a sortable ASCII version
: would be better.

exactly ... I think conceptually the methodology for solving the problem
is very similar to the way the SpellChecker contrib works: use a very
custom index designed for the application (not just look at the terms in
the main corpus) and custom logic for using that index.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Wildcards / Binary searches

Frédéric Glorieux
Hi Chris,

The skills on this list are really very stimulating. I'm sad but I will
probably not be able to contribute. Solr may not be the choosen
technology of the project I'm working on, because of server
administration issues (java). I know that there is no performances
arguments (lucene is incredible, and solr is nicely close to it), but
that's real life. So I will not find time for the idea below.

 > : project, definitively not a good practice for portability of indexes. A
 > : duplicate field with an analyser to produce a sortable ASCII version
 > : would be better.
 >
 > exactly ... I think conceptually the methodology for solving the problem
 > is very similar to the way the SpellChecker contrib works: use a very
 > custom index designed for the application (not just look at the terms in
 > the main corpus) and custom logic for using that index.

It could be a useful request handler ? Giving a field, with a
displayable stored value, and a sortable indexed one, you need the
analyser to parse the user entry, build a term with it, and get very
fastly a pointer to the internal lucene index, exactly at the best
place, for w, wo, wor or word. From the iterator you can display a
suggest list, it's also possible to get one or more docs directly
attached, for example to display a count. It seems interesting for
things like, a topic or an author of a doc ?

> : Do you mean something like below ?
> : <field name="autocomplete">w wo wor word</field>
>
> yeah, but there are some Tokenizers that make this trivial
> (EdgeNGramTokenizer i think is the name)




--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique
Reply | Threaded
Open this post in threaded view
|

Re: Wildcards / Binary searches

Chris Hostetter-3

: It could be a useful request handler ? Giving a field, with a

perhaps, but as i said -- i think it requires more then just a special
request handler, you want a special index as well.

FYI: there is an ongoing thread on this general topic on the java-user
list, i didn't have the time/energy to follow it but the concepts
discussed there might prove interesting for you (most of the people
involved have spent a lot more time on problems like this then i have)...

http://www.nabble.com/How-to-implement-AJAX-search%7ELucene-Search-part--tf3887286.html




-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Wildcards / Binary searches

Frédéric Glorieux
Chris Hostetter a écrit :

> : It could be a useful request handler ? Giving a field, with a
>
> perhaps, but as i said -- i think it requires more then just a special
> request handler, you want a special index as well.
>
> FYI: there is an ongoing thread on this general topic on the java-user
> list, i didn't have the time/energy to follow it but the concepts
> discussed there might prove interesting for you (most of the people
> involved have spent a lot more time on problems like this then i have)...
>
> http://www.nabble.com/How-to-implement-AJAX-search%7ELucene-Search-part--tf3887286.html

Interesting, here is my idea : "WildcardTermEnum (NOT query)"

<http://www.nabble.com/Re%3A-How-to-implement-AJAX-search%7ELucene-Search-part--p11027221.html>


--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique