how to do auto-suggest w/ case-insensitive search and suggesting original mixed case field values

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

how to do auto-suggest w/ case-insensitive search and suggesting original mixed case field values

Leandro Hermida
This post was updated on .
Hi everyone,

New to forum and to Solr, doing my first major project with it and enjoying it so far, great software.

In my web application I want to set up auto-suggest as you type functionality which will search case-insensitively yet return the original case terms.  It doesn't seem like TermsComponent can do this as it can only return the lowercase indexed terms your are searching against, not the original case terms.

There was one post on this forum http://old.nabble.com/Auto-suggest..-how-to-do-mixed-case-td24106666.html#a24143981 where someone asked the same question, and what someone said is to

There is no way to do this right now using TermsComponent. You can index
lower case terms and store the mixed case terms. Then you can use a prefix
query which will return documents (and hence stored field values).


So this got me started, I set out to use Solr Query instead of TermsComponent to try to do this.  I did the following as mentioned:

<fieldType name="test" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
  </analyzer>
</fieldType>

<fieldType name="test_lc" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

<field name="test" type="test" indexed="false" stored="true" multiValued="true" />
<field name="test_lc" type="test_lc" indexed="true"  stored="false" multiValued="true" />

And used copyField to populate the test_lc field:

<copyField source="test" dest="test_lc"/>

This is the easy part (the forum user didn't explain the hard part!) It is very hard to get the same information that TermsComponent returns using the regular Solr Query functionality!  For example:

http://localhost:8983/solr/terms?terms.fl=test_lc&terms.prefix=a&terms.sort=count&terms.limit=5&omitHeader=true

<lst name="terms">
  <lst name="test_lc">
    <int name="a-kinase anchor protein 13">15</int>
    <int name="accn5">6</int>
    <int name="actin-binding">3</int>
    <int name="activator">1</int>
    <int name="agie-bp1">1</int>
  </lst>
</lst>

which provides useful sorting by and returning of term frequency counts in your index.  So I then ran the following regular prefix query with faceting:

http://localhost:8983/solr/select?q=test_lc%3Aa*&facet=true&facet.field=test_lc&facet.prefix=a&facet.sort=count&facet.limit=5&omitHeader=true&rows=0

And you get the same thing as what TermComponent does (inside a slightly different structure)

<lst name="facet_counts">
  <lst name="facet_queries"/>
  <lst name="facet_fields">
    <lst name="test_lc">
      <int name="a-kinase anchor protein 13">15</int>
      <int name="accn5">6</int>
      <int name="actin-binding">3</int>
      <int name="activator">1</int>
      <int name="agie-bp1">1</int>
    </lst>
  </lst>
  <lst name="facet_dates"/>
</lst>

I then add fl=test to the query above and also some rows rows=5 and you get docs back as well:

<doc>
  <arr name="test">
    <str>3D-structure</str>
    <str>acetylation</str>
    <str>alternative promoter usage</str>
    <str>HLC-7</str>
  </arr>
</doc>
<doc>
  <arr name="test">
    <str>alternative splicing</str>
    <str>complete proteome</str>
    <str>DNA-binding</str>
    <str>RACK1</str>
  </arr>
</doc>
<doc>
  <arr name="test">
    <str>acetylation</str>
    <str>AIG21</str>
    <str>WD repeat</str>
    <str>GNB2L1</str>
  </arr>
</doc>
<doc>
</arr>
  <arr name="test">
    <str>3D-structure</str>
    <str>apoptosis</str>
    <str>cathepsin G-like 1</str>
    <str>ATSGL1</str>
    <str>CTLA-1</str>
  </arr>
</doc>
<doc>
  <arr name="test">
    <str>autoantigen Ge-1</str>
    <str>autoantigen RCD-8</str>
    <str>HERV-H LTR-associating protein 3</str>
    <str>HHLA3</str>
  </arr>
</doc>
</result>

How do I even know how many rows to return in order to be able to extract from the stored values the original case terms for each of the one listed in lower case in the facet counts?


Sorry for the long message, just wanted to fully explain, thanks for any help!

leandro
Reply | Threaded
Open this post in threaded view
|

Re: how to do auto-suggest case-insensitive match and return original case field values

Leandro Hermida
Hi again,

Just pinging again to any Solr experts out there... sorry that my previous message was a bit long (wanted to fully explain what I've already done and where the exact difficulty arises)... but to summarize:

Does anyone know how to use Solr querying with faceting to do an auto-suggest that search case-insensitively yet returns the original mixed case values???

thanks for any help,
Leandro
Reply | Threaded
Open this post in threaded view
|

Re: how to do auto-suggest case-insensitive match and return original case field values

hossman
In reply to this post by Leandro Hermida

: In my web application I want to set up auto-suggest as you type
: functionality which will search case-insensitively yet return the original
: case terms.  It doesn't seem like TermsComponent can do this as it can only
: return the lowercase indexed terms your are searching against, not the
        ...
: which provides useful sorting by and returning of term frequency counts in
: your index.  How does one get this same information with regular Solr Query?
: I set up the following prefix query, searching by the indexed lowercased
: field and returning the other:

The type of approach you are describing (doing a prefix based query for
autosuggest) probably won't work very well unless your index is 100%
designed just for the autosuggest ... if it's an index about products, and
you're just using one of hte fields for autosuggest, you aren't going to
get good autosuggest results because the same word is going to appear in
multiple products.  what you need is an index of *words* that you want to
autosuggest, with fields indicating how important those words are that you
can use in a function query (this replaces the term freq that
TermComponent would use)

the fact that your "test" field is multivalued and stores widly different
things in each doc is an example of what i mean.

Have you considered the possibility of just indexing the lowercase value
concatenated with the regular case value using a special delimiter, and
ten returning to your TermComponent based solution?  index "PowerPoint"
as "powerpoint|PowerPoint" and just split on the "\" character when you
get hte data back from your prefix based term lookup.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: how to do auto-suggest case-insensitive match and return original case field values

Uri Boness
Just updated SOLR-1625 to support regexp hints.

https://issues.apache.org/jira/browse/SOLR-1625

Cheers,
Uri

Chris Hostetter wrote:

> : In my web application I want to set up auto-suggest as you type
> : functionality which will search case-insensitively yet return the original
> : case terms.  It doesn't seem like TermsComponent can do this as it can only
> : return the lowercase indexed terms your are searching against, not the
> ...
> : which provides useful sorting by and returning of term frequency counts in
> : your index.  How does one get this same information with regular Solr Query?
> : I set up the following prefix query, searching by the indexed lowercased
> : field and returning the other:
>
> The type of approach you are describing (doing a prefix based query for
> autosuggest) probably won't work very well unless your index is 100%
> designed just for the autosuggest ... if it's an index about products, and
> you're just using one of hte fields for autosuggest, you aren't going to
> get good autosuggest results because the same word is going to appear in
> multiple products.  what you need is an index of *words* that you want to
> autosuggest, with fields indicating how important those words are that you
> can use in a function query (this replaces the term freq that
> TermComponent would use)
>
> the fact that your "test" field is multivalued and stores widly different
> things in each doc is an example of what i mean.
>
> Have you considered the possibility of just indexing the lowercase value
> concatenated with the regular case value using a special delimiter, and
> ten returning to your TermComponent based solution?  index "PowerPoint"
> as "powerpoint|PowerPoint" and just split on the "\" character when you
> get hte data back from your prefix based term lookup.
>
>
> -Hoss
>
>
>  
Reply | Threaded
Open this post in threaded view
|

Re: how to do auto-suggest case-insensitive match and return original case field values

Leandro Hermida
In reply to this post by hossman
Hello,

Thanks for the reply (see below)

hossman wrote
The type of approach you are describing (doing a prefix based query for
autosuggest) probably won't work very well unless your index is 100%
designed just for the autosuggest ... if it's an index about products, and
you're just using one of hte fields for autosuggest, you aren't going to
get good autosuggest results because the same word is going to appear in
multiple products.  what you need is an index of *words* that you want to
autosuggest, with fields indicating how important those words are that you
can use in a function query (this replaces the term freq that
TermComponent would use)

the fact that your "test" field is multivalued and stores widly different
things in each doc is an example of what i mean.
I am using Solr to index biological annotations about proteins (which my documents). There is no tokenization or special analysis of the annotation text strings as they are not free text, each annotation is a single token.  Also, for the purpose of my auto-suggest and searching there are actually no different types of annotations, that's why they all go into the same multivalued field for each protein document.  I want to use the auto-suggest and search to help biologists (who know the annotation terminology) find all the protein documents with the annotation they are thinking of, and to suggest what is available as they type.  The thing is that in my field letter case can be important define the meaning of an annotation, but the biologist might not remember the exact case.  Therefore I want them to be able to type in what ever case and the auto-suggest will pull up as they type annotations with the correct case to assist them.

Let's just take the fundamental question, independent of any example:  is it possible to do a case-insensitive prefix search using faceting (to get the term suggestions) that also returns the originally mixed case terms of *all* those terms listed in lowercase in the facet list?  The only other post I saw in this forum on this topic a user seemed to think this was easily doable, but I don't think they actually tried to do it because the faceted search doesn't seem possible, you run into all these problems.  It just isn't something Solr/Lucene can actually do the way it is organized.

hossman wrote
Have you considered the possibility of just indexing the lowercase value
concatenated with the regular case value using a special delimiter, and
ten returning to your TermComponent based solution?  index "PowerPoint"
as "powerpoint|PowerPoint" and just split on the "\" character when you
get hte data back from your prefix based term lookup.
I think this is a good workaround, will definitely try it!

leandro
Reply | Threaded
Open this post in threaded view
|

Re: how to do auto-suggest case-insensitive match and return original case field values

Leandro Hermida
In reply to this post by Uri Boness
Uri Boness wrote
Just updated SOLR-1625 to support regexp hints.

https://issues.apache.org/jira/browse/SOLR-1625

Cheers,
Uri
This is perfect, exactly what is needed to make this functionality possible.  Is the patch already in trunk?

thanks,
leandro
Reply | Threaded
Open this post in threaded view
|

Re: how to do auto-suggest case-insensitive match and return original case field values

Leandro Hermida
Hello,

Watched the JIRA issue and saw that it got commited recently.  Just tested it and it works *perfectly*, thanks Uri adding such a nice feature to Solr!

For other users out there who want to do this:

1. Download the latest nightly build of Solr 1.5-dev at http://people.apache.org/builds/lucene/solr/nightly/
2. For the index field you were using to do terms auto-suggest, rebuild it without using LowercaseFilterFactory so that it indexes the original mixed case terms
3. In your terms HTTP GET URL, replace terms.prefix=abc (where abc is actually what the user is typing in) with

terms.regex=%5Eabc.%2A&terms.regex.flag=case_insensitive

where %5E = ^ and %2A = *

Voila!

hermida wrote
Uri Boness wrote
Just updated SOLR-1625 to support regexp hints.

https://issues.apache.org/jira/browse/SOLR-1625

Cheers,
Uri
This is perfect, exactly what is needed to make this functionality possible.  Is the patch already in trunk?

thanks,
leandro