Ngram autocompleter and term frequency boosting

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Ngram autocompleter and term frequency boosting

climbingrose
Hi guys,

I'm trying to build a Ngram-based autocompleter that takes term frequency
into account.

Let's say I have the following documents:

D1: title => "Java Developer"
D2: title => "Java Programmer"
D3: title => "Java Developer"

When the user types in "Java", I want to display

1. "Java Developer"
2. "Java Programmer"

Basically "Java Developer" ranks first because it appears twice in the
index while "Java Programmer" only appears once. Is it possible?

I'm using the following config for "title" field:

    <fieldType name="text_pre" class="solr.TextField" omitNorms="false">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="25" side="front"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Ngram autocompleter and term frequency boosting

Otis Gospodnetic-2
Cuong,

If when you are indexing your AC suggestions you know "Java Developer" appears twice in the index, why not give it appropriate index-time boost?  Wouldn't that work for you?


Otis

----
Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html



----- Original Message -----

> From: Cuong Hoang <[hidden email]>
> To: [hidden email]
> Cc:
> Sent: Thursday, January 19, 2012 12:01 AM
> Subject: Ngram autocompleter and term frequency boosting
>
> Hi guys,
>
> I'm trying to build a Ngram-based autocompleter that takes term frequency
> into account.
>
> Let's say I have the following documents:
>
> D1: title => "Java Developer"
> D2: title => "Java Programmer"
> D3: title => "Java Developer"
>
> When the user types in "Java", I want to display
>
> 1. "Java Developer"
> 2. "Java Programmer"
>
> Basically "Java Developer" ranks first because it appears twice in the
> index while "Java Programmer" only appears once. Is it possible?
>
> I'm using the following config for "title" field:
>
>     <fieldType name="text_pre" class="solr.TextField"
> omitNorms="false">
>       <analyzer type="index">
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EdgeNGramFilterFactory"
> minGramSize="1"
> maxGramSize="25" side="front"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
> Thanks
>
Reply | Threaded
Open this post in threaded view
|

Re: Ngram autocompleter and term frequency boosting

Andrew Harvey
With Solr 4.0 you could use relevance functions to give a query time boost if you don't have the information at index time.

Alternatively you could do term facet based autocomplete which would mean you could sort by count rather than any other input.

Andrew

Sent on the run.

On 20/01/2012, at 15:45, Otis Gospodnetic <[hidden email]> wrote:

> Cuong,
>
> If when you are indexing your AC suggestions you know "Java Developer" appears twice in the index, why not give it appropriate index-time boost?  Wouldn't that work for you?
>
>
> Otis
>
> ----
> Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html
>
>
>
> ----- Original Message -----
>> From: Cuong Hoang <[hidden email]>
>> To: [hidden email]
>> Cc:
>> Sent: Thursday, January 19, 2012 12:01 AM
>> Subject: Ngram autocompleter and term frequency boosting
>>
>> Hi guys,
>>
>> I'm trying to build a Ngram-based autocompleter that takes term frequency
>> into account.
>>
>> Let's say I have the following documents:
>>
>> D1: title => "Java Developer"
>> D2: title => "Java Programmer"
>> D3: title => "Java Developer"
>>
>> When the user types in "Java", I want to display
>>
>> 1. "Java Developer"
>> 2. "Java Programmer"
>>
>> Basically "Java Developer" ranks first because it appears twice in the
>> index while "Java Programmer" only appears once. Is it possible?
>>
>> I'm using the following config for "title" field:
>>
>>     <fieldType name="text_pre" class="solr.TextField"
>> omitNorms="false">
>>       <analyzer type="index">
>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.EdgeNGramFilterFactory"
>> minGramSize="1"
>> maxGramSize="25" side="front"/>
>>       </analyzer>
>>       <analyzer type="query">
>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>       </analyzer>
>>     </fieldType>
>>
>> Thanks
>>
Reply | Threaded
Open this post in threaded view
|

Re: Ngram autocompleter and term frequency boosting

climbingrose
Thanks for your replies. I can't apply index-time boost because I don't
know the term frequencies in advance. Additionally, new documents come in
every few minutes which make maintaining term frequencies outside Solr a
difficult task.

Facet prefix would probably help in this case. I thought there would be a
better way to achieve my goal without having to do a facet search.

@Andrew: still at Westfield?

Thanks,
Cuong

On Fri, Jan 20, 2012 at 6:43 PM, Andrew Harvey <[hidden email]>wrote:

> With Solr 4.0 you could use relevance functions to give a query time boost
> if you don't have the information at index time.
>
> Alternatively you could do term facet based autocomplete which would mean
> you could sort by count rather than any other input.
>
> Andrew
>
> Sent on the run.
>
> On 20/01/2012, at 15:45, Otis Gospodnetic <[hidden email]>
> wrote:
>
> > Cuong,
> >
> > If when you are indexing your AC suggestions you know "Java Developer"
> appears twice in the index, why not give it appropriate index-time boost?
>  Wouldn't that work for you?
> >
> >
> > Otis
> >
> > ----
> > Performance Monitoring SaaS for Solr -
> http://sematext.com/spm/solr-performance-monitoring/index.html
> >
> >
> >
> > ----- Original Message -----
> >> From: Cuong Hoang <[hidden email]>
> >> To: [hidden email]
> >> Cc:
> >> Sent: Thursday, January 19, 2012 12:01 AM
> >> Subject: Ngram autocompleter and term frequency boosting
> >>
> >> Hi guys,
> >>
> >> I'm trying to build a Ngram-based autocompleter that takes term
> frequency
> >> into account.
> >>
> >> Let's say I have the following documents:
> >>
> >> D1: title => "Java Developer"
> >> D2: title => "Java Programmer"
> >> D3: title => "Java Developer"
> >>
> >> When the user types in "Java", I want to display
> >>
> >> 1. "Java Developer"
> >> 2. "Java Programmer"
> >>
> >> Basically "Java Developer" ranks first because it appears twice in the
> >> index while "Java Programmer" only appears once. Is it possible?
> >>
> >> I'm using the following config for "title" field:
> >>
> >>     <fieldType name="text_pre" class="solr.TextField"
> >> omitNorms="false">
> >>       <analyzer type="index">
> >>         <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>         <filter class="solr.LowerCaseFilterFactory"/>
> >>         <filter class="solr.EdgeNGramFilterFactory"
> >> minGramSize="1"
> >> maxGramSize="25" side="front"/>
> >>       </analyzer>
> >>       <analyzer type="query">
> >>         <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>         <filter class="solr.LowerCaseFilterFactory"/>
> >>       </analyzer>
> >>     </fieldType>
> >>
> >> Thanks
> >>
>