leading and trailing wildcard query

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

leading and trailing wildcard query

A. Steven Anderson
I've scoured the archives and JIRA , but the answer to my question is just
not clear to me.

With all the new Solr 1.4 features, is there any way  to do a leading and
trailing wildcard query on an *untokenized* field?

e.g. q=myfield:*abc* would return a doc with myfield=xxxabcxxx

Yes, I know how expensive such a query would be, but we have the user
requirement, nonetheless.

If not, any suggestions on how to implement a custom solution using Solr?
Using an external data structure?

--
A. Steven Anderson
Reply | Threaded
Open this post in threaded view
|

Re: leading and trailing wildcard query

A. Steven Anderson
No thoughts on this? Really!?

I would hate to admit to my Oracle DBE that Solr can't be customized to do a
common query that a relational database can do. :-(


On Wed, Nov 4, 2009 at 6:01 PM, A. Steven Anderson <
[hidden email]> wrote:

> I've scoured the archives and JIRA , but the answer to my question is just
> not clear to me.
>
> With all the new Solr 1.4 features, is there any way  to do a leading and
> trailing wildcard query on an *untokenized* field?
>
> e.g. q=myfield:*abc* would return a doc with myfield=xxxabcxxx
>
> Yes, I know how expensive such a query would be, but we have the user
> requirement, nonetheless.
>
> If not, any suggestions on how to implement a custom solution using Solr?
> Using an external data structure?
>
>
--
A. Steven Anderson
Reply | Threaded
Open this post in threaded view
|

Re: leading and trailing wildcard query

Otis Gospodnetic-2
The guilt trick is not the best thing to try on public mailing lists. :)

The first thing that popped to my mind is to use 2 fields, where the second one contains the desrever string of the first one.
The second idea is to use n-grams (if it's OK to tokenize), more specifically edge n-grams.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----

> From: A. Steven Anderson <[hidden email]>
> To: [hidden email]
> Sent: Thu, November 5, 2009 3:04:32 PM
> Subject: Re: leading and trailing wildcard query
>
> No thoughts on this? Really!?
>
> I would hate to admit to my Oracle DBE that Solr can't be customized to do a
> common query that a relational database can do. :-(
>
>
> On Wed, Nov 4, 2009 at 6:01 PM, A. Steven Anderson <
> [hidden email]> wrote:
>
> > I've scoured the archives and JIRA , but the answer to my question is just
> > not clear to me.
> >
> > With all the new Solr 1.4 features, is there any way  to do a leading and
> > trailing wildcard query on an *untokenized* field?
> >
> > e.g. q=myfield:*abc* would return a doc with myfield=xxxabcxxx
> >
> > Yes, I know how expensive such a query would be, but we have the user
> > requirement, nonetheless.
> >
> > If not, any suggestions on how to implement a custom solution using Solr?
> > Using an external data structure?
> >
> >
> --
> A. Steven Anderson

Reply | Threaded
Open this post in threaded view
|

Re: leading and trailing wildcard query

A. Steven Anderson
>
> The guilt trick is not the best thing to try on public mailing lists. :)
>

Point taken, although not my intention.  I guess I have been spoiled by
quick replies and was getting to think it was a stupid question.

Plus, I'm literally gonna get trash talk from my Oracle DBE if I can't make
this work. ;-)

We've basically relegated Oracle to handling ingest from which we index Solr
and provide all search features.  I'd hate to have to succumb to using
Oracle to service this one special query.


> The first thing that popped to my mind is to use 2 fields, where the second
> one contains the desrever string of the first one.
>

Please elaborate. What do you mean by *desrever* string?


> The second idea is to use n-grams (if it's OK to tokenize), more
> specifically edge n-grams.
>

Well, that's the problem.  The field may have non-Latin characters that may
not have whitespace nor punctuation.


--
A. Steven Anderson
Reply | Threaded
Open this post in threaded view
|

RE: leading and trailing wildcard query

bernieh
In reply to this post by Otis Gospodnetic-2
I've just set up something similar (much thanks to Avesh!)-

<fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.EdgeNGramFilterFactory" minGramSize="5" maxGramSize="25" />
 </analyzer>
 <analyzer type="query">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
</fieldType>

<fieldType name="doubleedgytext" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.NGramFilterFactory" minGramSize="5" maxGramSize="25" />
 </analyzer>
 <analyzer type="query">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
</fieldType>
.
.
   <field name="beginswith" type="edgytext" indexed="true" stored="false" multiValued="true"/>
   <field name="contains" type="doubleedgytext" indexed="true" stored="false" multiValued="true"/>
.
.
   <!-- Copy for BEGINSWITH search -->
   <copyField source="content" dest="beginswith"/>
   <copyField source="*_t" dest="beginswith"/>
   <copyField source="*_mt" dest="beginswith"/>
   
   <!-- Copy for CONTAINS search -->
   <copyField source="content" dest="contains"/>
   <copyField source="*_t" dest="contains"/>
   <copyField source="*_mt" dest="contains"/>

bern

-----Original Message-----
From: Otis Gospodnetic [mailto:[hidden email]]
Sent: Friday, 6 November 2009 9:13 AM
To: [hidden email]
Subject: Re: leading and trailing wildcard query

The guilt trick is not the best thing to try on public mailing lists. :)

The first thing that popped to my mind is to use 2 fields, where the second one contains the desrever string of the first one.
The second idea is to use n-grams (if it's OK to tokenize), more specifically edge n-grams.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----

> From: A. Steven Anderson <[hidden email]>
> To: [hidden email]
> Sent: Thu, November 5, 2009 3:04:32 PM
> Subject: Re: leading and trailing wildcard query
>
> No thoughts on this? Really!?
>
> I would hate to admit to my Oracle DBE that Solr can't be customized to do a
> common query that a relational database can do. :-(
>
>
> On Wed, Nov 4, 2009 at 6:01 PM, A. Steven Anderson <
> [hidden email]> wrote:
>
> > I've scoured the archives and JIRA , but the answer to my question is just
> > not clear to me.
> >
> > With all the new Solr 1.4 features, is there any way  to do a leading and
> > trailing wildcard query on an *untokenized* field?
> >
> > e.g. q=myfield:*abc* would return a doc with myfield=xxxabcxxx
> >
> > Yes, I know how expensive such a query would be, but we have the user
> > requirement, nonetheless.
> >
> > If not, any suggestions on how to implement a custom solution using Solr?
> > Using an external data structure?
> >
> >
> --
> A. Steven Anderson

Reply | Threaded
Open this post in threaded view
|

Re: leading and trailing wildcard query

Walter Underwood
In reply to this post by Otis Gospodnetic-2
Doesn't it work to call SolrQueryParser.setAllowLeadingWildcard?

It can be really slow, what an RDBMS person would call a full table  
scan.

There is an open bug to make that settable in a config file, but this  
is a pretty tiny change to the source.

    http://issues.apache.org/jira/browse/SOLR-218

wunder

On Nov 5, 2009, at 2:13 PM, Otis Gospodnetic wrote:

> The guilt trick is not the best thing to try on public mailing  
> lists. :)
>
> The first thing that popped to my mind is to use 2 fields, where the  
> second one contains the desrever string of the first one.
> The second idea is to use n-grams (if it's OK to tokenize), more  
> specifically edge n-grams.
>
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message ----
>> From: A. Steven Anderson <[hidden email]>
>> To: [hidden email]
>> Sent: Thu, November 5, 2009 3:04:32 PM
>> Subject: Re: leading and trailing wildcard query
>>
>> No thoughts on this? Really!?
>>
>> I would hate to admit to my Oracle DBE that Solr can't be  
>> customized to do a
>> common query that a relational database can do. :-(
>>
>>
>> On Wed, Nov 4, 2009 at 6:01 PM, A. Steven Anderson <
>> [hidden email]> wrote:
>>
>>> I've scoured the archives and JIRA , but the answer to my question  
>>> is just
>>> not clear to me.
>>>
>>> With all the new Solr 1.4 features, is there any way  to do a  
>>> leading and
>>> trailing wildcard query on an *untokenized* field?
>>>
>>> e.g. q=myfield:*abc* would return a doc with myfield=xxxabcxxx
>>>
>>> Yes, I know how expensive such a query would be, but we have the  
>>> user
>>> requirement, nonetheless.
>>>
>>> If not, any suggestions on how to implement a custom solution  
>>> using Solr?
>>> Using an external data structure?
>>>
>>>
>> --
>> A. Steven Anderson
>

Reply | Threaded
Open this post in threaded view
|

Re: leading and trailing wildcard query

A. Steven Anderson
In reply to this post by bernieh
Thanks for the solution, but could you elaborate on how it would find
something like *abc* in a field that contains xxxxabcxxxx.

Steve

On Thu, Nov 5, 2009 at 5:25 PM, Bernadette Houghton <
[hidden email]> wrote:

> I've just set up something similar (much thanks to Avesh!)-
>
> <fieldType name="edgytext" class="solr.TextField"
> positionIncrementGap="100">
>  <analyzer type="index">
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
>   <filter class="solr.EdgeNGramFilterFactory" minGramSize="5"
> maxGramSize="25" />
>  </analyzer>
>  <analyzer type="query">
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
>  </analyzer>
> </fieldType>
>
> <fieldType name="doubleedgytext" class="solr.TextField"
> positionIncrementGap="100">
>  <analyzer type="index">
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
>   <filter class="solr.NGramFilterFactory" minGramSize="5" maxGramSize="25"
> />
>  </analyzer>
>  <analyzer type="query">
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
>  </analyzer>
> </fieldType>
> .
> .
>   <field name="beginswith" type="edgytext" indexed="true" stored="false"
> multiValued="true"/>
>   <field name="contains" type="doubleedgytext" indexed="true"
> stored="false" multiValued="true"/>
> .
> .
>   <!-- Copy for BEGINSWITH search -->
>   <copyField source="content" dest="beginswith"/>
>   <copyField source="*_t" dest="beginswith"/>
>   <copyField source="*_mt" dest="beginswith"/>
>
>   <!-- Copy for CONTAINS search -->
>   <copyField source="content" dest="contains"/>
>   <copyField source="*_t" dest="contains"/>
>   <copyField source="*_mt" dest="contains"/>
>
> bern
Reply | Threaded
Open this post in threaded view
|

Re: leading and trailing wildcard query

Erick Erickson
Because that is the semantics of Solr/Lucene wildcard syntax. * stands for
"any number of any character". Basically, it enumerates all the terms in the
field for all the documents and assembles a list of all of them that contain
the
substring "abc" and uses that as one of the clauses of your search...

Best
Erick

On Thu, Nov 5, 2009 at 6:07 PM, A. Steven Anderson <
[hidden email]> wrote:

> Thanks for the solution, but could you elaborate on how it would find
> something like *abc* in a field that contains xxxxabcxxxx.
>
> Steve
>
> On Thu, Nov 5, 2009 at 5:25 PM, Bernadette Houghton <
> [hidden email]> wrote:
>
> > I've just set up something similar (much thanks to Avesh!)-
> >
> > <fieldType name="edgytext" class="solr.TextField"
> > positionIncrementGap="100">
> >  <analyzer type="index">
> >   <tokenizer class="solr.KeywordTokenizerFactory"/>
> >   <filter class="solr.LowerCaseFilterFactory"/>
> >   <filter class="solr.EdgeNGramFilterFactory" minGramSize="5"
> > maxGramSize="25" />
> >  </analyzer>
> >  <analyzer type="query">
> >   <tokenizer class="solr.KeywordTokenizerFactory"/>
> >   <filter class="solr.LowerCaseFilterFactory"/>
> >  </analyzer>
> > </fieldType>
> >
> > <fieldType name="doubleedgytext" class="solr.TextField"
> > positionIncrementGap="100">
> >  <analyzer type="index">
> >   <tokenizer class="solr.KeywordTokenizerFactory"/>
> >   <filter class="solr.LowerCaseFilterFactory"/>
> >   <filter class="solr.NGramFilterFactory" minGramSize="5"
> maxGramSize="25"
> > />
> >  </analyzer>
> >  <analyzer type="query">
> >   <tokenizer class="solr.KeywordTokenizerFactory"/>
> >   <filter class="solr.LowerCaseFilterFactory"/>
> >  </analyzer>
> > </fieldType>
> > .
> > .
> >   <field name="beginswith" type="edgytext" indexed="true" stored="false"
> > multiValued="true"/>
> >   <field name="contains" type="doubleedgytext" indexed="true"
> > stored="false" multiValued="true"/>
> > .
> > .
> >   <!-- Copy for BEGINSWITH search -->
> >   <copyField source="content" dest="beginswith"/>
> >   <copyField source="*_t" dest="beginswith"/>
> >   <copyField source="*_mt" dest="beginswith"/>
> >
> >   <!-- Copy for CONTAINS search -->
> >   <copyField source="content" dest="contains"/>
> >   <copyField source="*_t" dest="contains"/>
> >   <copyField source="*_mt" dest="contains"/>
> >
> > bern
>
Reply | Threaded
Open this post in threaded view
|

Re: leading and trailing wildcard query

A. Steven Anderson
In reply to this post by Walter Underwood
> Doesn't it work to call SolrQueryParser.setAllowLeadingWildcard?


Good question.  Anyone?


> It can be really slow, what an RDBMS person would call a full table scan.


Understood.


> There is an open bug to make that settable in a config file, but this is a
> pretty tiny change to the source.
>   http://issues.apache.org/jira/browse/SOLR-218
>

Unfortunately, we can only use official releases (not even snapshots) since
it's a government-related project.

--
A. Steven Anderson
Reply | Threaded
Open this post in threaded view
|

RE: leading and trailing wildcard query

bernieh
In reply to this post by A. Steven Anderson
Hi Steve, a query such as *abc* would need the NGramFilterFactor, hence the doubleedgytext, and would be retrievable by a query such as contains:abc. Note that you can set the max and minimum size of strings that get indexed.

bern

-----Original Message-----
From: A. Steven Anderson [mailto:[hidden email]]
Sent: Friday, 6 November 2009 10:08 AM
To: [hidden email]
Subject: Re: leading and trailing wildcard query

Thanks for the solution, but could you elaborate on how it would find
something like *abc* in a field that contains xxxxabcxxxx.

Steve

On Thu, Nov 5, 2009 at 5:25 PM, Bernadette Houghton <
[hidden email]> wrote:

> I've just set up something similar (much thanks to Avesh!)-
>
> <fieldType name="edgytext" class="solr.TextField"
> positionIncrementGap="100">
>  <analyzer type="index">
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
>   <filter class="solr.EdgeNGramFilterFactory" minGramSize="5"
> maxGramSize="25" />
>  </analyzer>
>  <analyzer type="query">
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
>  </analyzer>
> </fieldType>
>
> <fieldType name="doubleedgytext" class="solr.TextField"
> positionIncrementGap="100">
>  <analyzer type="index">
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
>   <filter class="solr.NGramFilterFactory" minGramSize="5" maxGramSize="25"
> />
>  </analyzer>
>  <analyzer type="query">
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
>  </analyzer>
> </fieldType>
> .
> .
>   <field name="beginswith" type="edgytext" indexed="true" stored="false"
> multiValued="true"/>
>   <field name="contains" type="doubleedgytext" indexed="true"
> stored="false" multiValued="true"/>
> .
> .
>   <!-- Copy for BEGINSWITH search -->
>   <copyField source="content" dest="beginswith"/>
>   <copyField source="*_t" dest="beginswith"/>
>   <copyField source="*_mt" dest="beginswith"/>
>
>   <!-- Copy for CONTAINS search -->
>   <copyField source="content" dest="contains"/>
>   <copyField source="*_t" dest="contains"/>
>   <copyField source="*_mt" dest="contains"/>
>
> bern
Reply | Threaded
Open this post in threaded view
|

Re: leading and trailing wildcard query

Walter Underwood
In reply to this post by A. Steven Anderson
Ah. With that restriction, it is impossible.

If it is OK to pay Lucid to make a one-line change, you might be able  
to do it. Otherwise, get ready to spend a lot of money for a search  
engine.

wunder

On Nov 5, 2009, at 3:18 PM, A. Steven Anderson wrote:

> Unfortunately, we can only use official releases (not even  
> snapshots) since
> it's a government-related project.
>
> --
> A. Steven Anderson

Reply | Threaded
Open this post in threaded view
|

Re: leading and trailing wildcard query

A. Steven Anderson
In reply to this post by bernieh
> Hi Steve, a query such as *abc* would need the NGramFilterFactor, hence the
> doubleedgytext, and would be retrievable by a query such as contains:abc.
> Note that you can set the max and minimum size of strings that get indexed.
>

Excellent!  Just to clarify though, NGramFilterFactor is a Solr 1.4 feature
only, correct?

--
A. Steven Anderson
Reply | Threaded
Open this post in threaded view
|

Re: leading and trailing wildcard query

Walter Underwood
In reply to this post by bernieh
Note that N-grams are limited to specific string lengths. I presume  
that you need to search for arbitrary strings, not just three-letter  
ones.

wunder

On Nov 5, 2009, at 3:23 PM, Bernadette Houghton wrote:

> Hi Steve, a query such as *abc* would need the NGramFilterFactor,  
> hence the doubleedgytext, and would be retrievable by a query such  
> as contains:abc. Note that you can set the max and minimum size of  
> strings that get indexed.
>
> bern
>
> -----Original Message-----
> From: A. Steven Anderson [mailto:[hidden email]]
> Sent: Friday, 6 November 2009 10:08 AM
> To: [hidden email]
> Subject: Re: leading and trailing wildcard query
>
> Thanks for the solution, but could you elaborate on how it would find
> something like *abc* in a field that contains xxxxabcxxxx.
>
> Steve
>
> On Thu, Nov 5, 2009 at 5:25 PM, Bernadette Houghton <
> [hidden email]> wrote:
>
>> I've just set up something similar (much thanks to Avesh!)-
>>
>> <fieldType name="edgytext" class="solr.TextField"
>> positionIncrementGap="100">
>> <analyzer type="index">
>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
>>  <filter class="solr.LowerCaseFilterFactory"/>
>>  <filter class="solr.EdgeNGramFilterFactory" minGramSize="5"
>> maxGramSize="25" />
>> </analyzer>
>> <analyzer type="query">
>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
>>  <filter class="solr.LowerCaseFilterFactory"/>
>> </analyzer>
>> </fieldType>
>>
>> <fieldType name="doubleedgytext" class="solr.TextField"
>> positionIncrementGap="100">
>> <analyzer type="index">
>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
>>  <filter class="solr.LowerCaseFilterFactory"/>
>>  <filter class="solr.NGramFilterFactory" minGramSize="5"  
>> maxGramSize="25"
>> />
>> </analyzer>
>> <analyzer type="query">
>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
>>  <filter class="solr.LowerCaseFilterFactory"/>
>> </analyzer>
>> </fieldType>
>> .
>> .
>>  <field name="beginswith" type="edgytext" indexed="true"  
>> stored="false"
>> multiValued="true"/>
>>  <field name="contains" type="doubleedgytext" indexed="true"
>> stored="false" multiValued="true"/>
>> .
>> .
>>  <!-- Copy for BEGINSWITH search -->
>>  <copyField source="content" dest="beginswith"/>
>>  <copyField source="*_t" dest="beginswith"/>
>>  <copyField source="*_mt" dest="beginswith"/>
>>
>>  <!-- Copy for CONTAINS search -->
>>  <copyField source="content" dest="contains"/>
>>  <copyField source="*_t" dest="contains"/>
>>  <copyField source="*_mt" dest="contains"/>
>>
>> bern
>

Reply | Threaded
Open this post in threaded view
|

Re: leading and trailing wildcard query

A. Steven Anderson
In reply to this post by Walter Underwood
> Ah. With that restriction, it is impossible.
> If it is OK to pay Lucid to make a one-line change, you might be able to do
> it. Otherwise, get ready to spend a lot of money for a search engine.
>

Well, now that Lucid is getting In-Q-Tel $$$, they will soon learn that
officially releases are all that matters, and 12-18 month release cycles are
not acceptable. ;-)

--
A. Steven Anderson
Reply | Threaded
Open this post in threaded view
|

Re: leading and trailing wildcard query

A. Steven Anderson
In reply to this post by Walter Underwood
> Note that N-grams are limited to specific string lengths. I presume that
> you need to search for arbitrary strings, not just three-letter ones.
>

Understood, but that is a limitation that we can live with.

Thanks!
--
A. Steven Anderson
Reply | Threaded
Open this post in threaded view
|

RE: leading and trailing wildcard query

bernieh
In reply to this post by A. Steven Anderson
Not sure what version it was supported from, but we're on 1.3.
bern

-----Original Message-----
From: A. Steven Anderson [mailto:[hidden email]]
Sent: Friday, 6 November 2009 10:25 AM
To: [hidden email]
Subject: Re: leading and trailing wildcard query

> Hi Steve, a query such as *abc* would need the NGramFilterFactor, hence the
> doubleedgytext, and would be retrievable by a query such as contains:abc.
> Note that you can set the max and minimum size of strings that get indexed.
>

Excellent!  Just to clarify though, NGramFilterFactor is a Solr 1.4 feature
only, correct?

--
A. Steven Anderson
Reply | Threaded
Open this post in threaded view
|

Re: leading and trailing wildcard query

A. Steven Anderson
> Not sure what version it was supported from, but we're on 1.3.


Really!? Great answer!

Thanks!
--
A. Steven Anderson
Reply | Threaded
Open this post in threaded view
|

Re: leading and trailing wildcard query

Andrzej Białecki-2
In reply to this post by A. Steven Anderson
A. Steven Anderson wrote:

> No thoughts on this? Really!?
>
> I would hate to admit to my Oracle DBE that Solr can't be customized to do a
> common query that a relational database can do. :-(
>
>
> On Wed, Nov 4, 2009 at 6:01 PM, A. Steven Anderson <
> [hidden email]> wrote:
>
>> I've scoured the archives and JIRA , but the answer to my question is just
>> not clear to me.
>>
>> With all the new Solr 1.4 features, is there any way  to do a leading and
>> trailing wildcard query on an *untokenized* field?
>>
>> e.g. q=myfield:*abc* would return a doc with myfield=xxxabcxxx
>>
>> Yes, I know how expensive such a query would be, but we have the user
>> requirement, nonetheless.
>>
>> If not, any suggestions on how to implement a custom solution using Solr?
>> Using an external data structure?

You can use ReversedWildcardFilterFactory that creates additional tokens
(in your case, a single additional token :) ) that is reversed, _and_
also triggers the setAllowLeadingWildcards in the QueryParser - won't
help much with the performance though, due to the trailing wildcard in
your original query. Please see the discussion in SOLR-1321 (this will
be available in 1.4 but it should be easy to patch 1.3 to use it).

If you really need to support such queries efficiently you should
implement a full permu-term indexing, i.e. a token filter that rotates
tokens and adds all rotations (with a special marker to mark the
beginning of the word), and a query plugin that detects such query terms
and rotates the query term appropriately.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: leading and trailing wildcard query

Otis Gospodnetic-2
In reply to this post by A. Steven Anderson
> Please elaborate. What do you mean by *desrever* string?

Try reading in reverse ;).

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----

> From: A. Steven Anderson <[hidden email]>
> To: [hidden email]
> Sent: Thu, November 5, 2009 5:23:48 PM
> Subject: Re: leading and trailing wildcard query
>
> >
> > The guilt trick is not the best thing to try on public mailing lists. :)
> >
>
> Point taken, although not my intention.  I guess I have been spoiled by
> quick replies and was getting to think it was a stupid question.
>
> Plus, I'm literally gonna get trash talk from my Oracle DBE if I can't make
> this work. ;-)
>
> We've basically relegated Oracle to handling ingest from which we index Solr
> and provide all search features.  I'd hate to have to succumb to using
> Oracle to service this one special query.
>
>
> > The first thing that popped to my mind is to use 2 fields, where the second
> > one contains the desrever string of the first one.
> >
>
> Please elaborate. What do you mean by *desrever* string?
>
>
> > The second idea is to use n-grams (if it's OK to tokenize), more
> > specifically edge n-grams.
> >
>
> Well, that's the problem.  The field may have non-Latin characters that may
> not have whitespace nor punctuation.
>
>
> --
> A. Steven Anderson

Reply | Threaded
Open this post in threaded view
|

Re: leading and trailing wildcard query

Chantal Ackermann
In reply to this post by Andrzej Białecki-2
Just for the records - this works like a charm:

.../select?q=*potter*&qt=dismax

<response>

<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">93</int>

<lst name="params">
<str name="q">*potter*</str>
<str name="qt">dismax</str>
</lst>
</lst>

<result name="response" numFound="572" start="0" maxScore="5.3375173">
...
<str name="title">L'année où on a découvert «Harry Potter» au cinéma</str>
...

        <requestHandler name="dismax" class="solr.DisMaxRequestHandler">
                <lst name="defaults">
                        <str name="echoParams">explicit</str>
                        <float name="tie">0.01</float>
                        <str name="qf"> all_text_de^0.5 all_text_en^0.5 all_text_es^0.5
all_text_fr^0.5 all_text_it^0.5 all_text_nl^0.5 all_text_nolang^0.5
channel_name_tokens^1.0 role_tokens^1.0 participant_tokens^1.0</str>
                        <str name="pf"> title_de^2 title_en^2 title_es^2 title_fr^2
title_it^2 title_nl^2 title_nolang^2 channel_name_tokens^2 role_tokens^2
participant_tokens^2</str>
                        </str-->
                        <str name="fl"> *,score </str>
                        <str name="mm"> 2&lt;-1 5&lt;80%</str>
                        <int name="ps">100</int>
                        <str name="q.alt">*:*</str>
      </lst>
        </requestHandler>

And the funny thing: ReversedWildcardFilterFactory is still commented
out (I didn't remember I never reactivated it). And NGram was never part
of my schema.

Happy user of 1.4RC - I'm sure our milestones won't beat the SOLR 1.4
release date.

Cheers,
Chantal