edge ngram/find as you type sorting

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

edge ngram/find as you type sorting

matthew sporleder
I have added an edge ngram field to my index and get decent results
with partial words but the results appear randomly sorted and all
contain the same score.  Ideally I would like to sort by shortest
ngram match within my other qualifiers.

Is there a canonical solution to this?

Thanks,
Matt

p.s. I mostly followed
https://lucidworks.com/post/auto-suggest-from-popular-queries-using-edgengrams/

schema bits:

<fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="25" />
 </analyzer>

  <field name="slug" type="string_ci" indexed="true" stored="true"
multiValued="false" />

  <field name="fayt" type="edgytext" indexed="true" stored="false"
omitNorms="false" omitTermFreqAndPositions="true" multiValued="true"
/>


<copyField source="slug" dest="fayt" maxChars="65" />
Reply | Threaded
Open this post in threaded view
|

Re: edge ngram/find as you type sorting

Erick Erickson
Sort by the full field. You’ll need to copy to a field with keywordTokenizer and lowercaseFilter (string_ci? assuming it’s not really a :”string”) type.

Best,
Erick

> On Mar 24, 2020, at 7:10 AM, matthew sporleder <[hidden email]> wrote:
>
> I have added an edge ngram field to my index and get decent results
> with partial words but the results appear randomly sorted and all
> contain the same score.  Ideally I would like to sort by shortest
> ngram match within my other qualifiers.
>
> Is there a canonical solution to this?
>
> Thanks,
> Matt
>
> p.s. I mostly followed
> https://lucidworks.com/post/auto-suggest-from-popular-queries-using-edgengrams/
>
> schema bits:
>
> <fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
>   <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> maxGramSize="25" />
> </analyzer>
>
>  <field name="slug" type="string_ci" indexed="true" stored="true"
> multiValued="false" />
>
>  <field name="fayt" type="edgytext" indexed="true" stored="false"
> omitNorms="false" omitTermFreqAndPositions="true" multiValued="true"
> />
>
>
> <copyField source="slug" dest="fayt" maxChars="65" />

Reply | Threaded
Open this post in threaded view
|

Re: edge ngram/find as you type sorting

matthew sporleder
Oh maybe a schema bug!

my string_ci:
 <fieldType name="string_ci" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
     <analyzer>
          <tokenizer class="solr.KeywordTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory" />
     </analyzer>
  </fieldType>

going to try this instead:
  <fieldType name="string_lctoken" class="solr.StrField"
sortMissingLast="true" omitNorms="true">
     <analyzer>
          <tokenizer class="solr.KeywordTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory" />
     </analyzer>
  </fieldType>

Then I can probably kill the lowercasefilter on edgeytext:



On Tue, Mar 24, 2020 at 7:44 AM Erick Erickson <[hidden email]> wrote:

>
> Sort by the full field. You’ll need to copy to a field with keywordTokenizer and lowercaseFilter (string_ci? assuming it’s not really a :”string”) type.
>
> Best,
> Erick
>
> > On Mar 24, 2020, at 7:10 AM, matthew sporleder <[hidden email]> wrote:
> >
> > I have added an edge ngram field to my index and get decent results
> > with partial words but the results appear randomly sorted and all
> > contain the same score.  Ideally I would like to sort by shortest
> > ngram match within my other qualifiers.
> >
> > Is there a canonical solution to this?
> >
> > Thanks,
> > Matt
> >
> > p.s. I mostly followed
> > https://lucidworks.com/post/auto-suggest-from-popular-queries-using-edgengrams/
> >
> > schema bits:
> >
> > <fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
> > <analyzer type="index">
> >   <tokenizer class="solr.KeywordTokenizerFactory"/>
> >   <filter class="solr.LowerCaseFilterFactory"/>
> >   <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > maxGramSize="25" />
> > </analyzer>
> >
> >  <field name="slug" type="string_ci" indexed="true" stored="true"
> > multiValued="false" />
> >
> >  <field name="fayt" type="edgytext" indexed="true" stored="false"
> > omitNorms="false" omitTermFreqAndPositions="true" multiValued="true"
> > />
> >
> >
> > <copyField source="slug" dest="fayt" maxChars="65" />
>
Reply | Threaded
Open this post in threaded view
|

Re: edge ngram/find as you type sorting

Erick Erickson
Won’t work. String types are totally unanalyzed. Your string_ci fieldType is what I was looking for.

No, you shouldn’t kill the lowercasefilter unless you want all of your searches will then be case-sensitive.

So you should try:

q=edgy_text:whatever&sort=string_ci asc

Please use the admin>>pick_core>>analysis page when thinking about changing your schema, it’ll answer a _lot_ of these questions immediately.

Best,
Erick

> On Mar 24, 2020, at 8:37 AM, matthew sporleder <[hidden email]> wrote:
>
> Oh maybe a schema bug!
>
> my string_ci:
> <fieldType name="string_ci" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
>     <analyzer>
>          <tokenizer class="solr.KeywordTokenizerFactory"/>
>          <filter class="solr.LowerCaseFilterFactory" />
>     </analyzer>
>  </fieldType>
>
> going to try this instead:
>  <fieldType name="string_lctoken" class="solr.StrField"
> sortMissingLast="true" omitNorms="true">
>     <analyzer>
>          <tokenizer class="solr.KeywordTokenizerFactory"/>
>          <filter class="solr.LowerCaseFilterFactory" />
>     </analyzer>
>  </fieldType>
>
> Then I can probably kill the lowercasefilter on edgeytext:
>
>
>
> On Tue, Mar 24, 2020 at 7:44 AM Erick Erickson <[hidden email]> wrote:
>>
>> Sort by the full field. You’ll need to copy to a field with keywordTokenizer and lowercaseFilter (string_ci? assuming it’s not really a :”string”) type.
>>
>> Best,
>> Erick
>>
>>> On Mar 24, 2020, at 7:10 AM, matthew sporleder <[hidden email]> wrote:
>>>
>>> I have added an edge ngram field to my index and get decent results
>>> with partial words but the results appear randomly sorted and all
>>> contain the same score.  Ideally I would like to sort by shortest
>>> ngram match within my other qualifiers.
>>>
>>> Is there a canonical solution to this?
>>>
>>> Thanks,
>>> Matt
>>>
>>> p.s. I mostly followed
>>> https://lucidworks.com/post/auto-suggest-from-popular-queries-using-edgengrams/
>>>
>>> schema bits:
>>>
>>> <fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
>>> <analyzer type="index">
>>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>  <filter class="solr.LowerCaseFilterFactory"/>
>>>  <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
>>> maxGramSize="25" />
>>> </analyzer>
>>>
>>> <field name="slug" type="string_ci" indexed="true" stored="true"
>>> multiValued="false" />
>>>
>>> <field name="fayt" type="edgytext" indexed="true" stored="false"
>>> omitNorms="false" omitTermFreqAndPositions="true" multiValued="true"
>>> />
>>>
>>>
>>> <copyField source="slug" dest="fayt" maxChars="65" />
>>

Reply | Threaded
Open this post in threaded view
|

Re: edge ngram/find as you type sorting

matthew sporleder
Okay I appreciate you responding.

Switching "slug" from "string_ci" class="solr.StrField" accomplished
about the same results, which makes sense to me now :)

The previous definition of string_ci was:
  <fieldType name="string_ci" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
     <analyzer>
          <tokenizer class="solr.KeywordTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory" />
     </analyzer>
  </fieldType>

So lowercase + KeywordTokenizerFactory;

I am trying again with omitNorms=false  to see if I can get the more
"exact" matches to score better this time around.


On Tue, Mar 24, 2020 at 9:54 AM Erick Erickson <[hidden email]> wrote:

>
> Won’t work. String types are totally unanalyzed. Your string_ci fieldType is what I was looking for.
>
> No, you shouldn’t kill the lowercasefilter unless you want all of your searches will then be case-sensitive.
>
> So you should try:
>
> q=edgy_text:whatever&sort=string_ci asc
>
> Please use the admin>>pick_core>>analysis page when thinking about changing your schema, it’ll answer a _lot_ of these questions immediately.
>
> Best,
> Erick
>
> > On Mar 24, 2020, at 8:37 AM, matthew sporleder <[hidden email]> wrote:
> >
> > Oh maybe a schema bug!
> >
> > my string_ci:
> > <fieldType name="string_ci" class="solr.TextField"
> > sortMissingLast="true" omitNorms="true">
> >     <analyzer>
> >          <tokenizer class="solr.KeywordTokenizerFactory"/>
> >          <filter class="solr.LowerCaseFilterFactory" />
> >     </analyzer>
> >  </fieldType>
> >
> > going to try this instead:
> >  <fieldType name="string_lctoken" class="solr.StrField"
> > sortMissingLast="true" omitNorms="true">
> >     <analyzer>
> >          <tokenizer class="solr.KeywordTokenizerFactory"/>
> >          <filter class="solr.LowerCaseFilterFactory" />
> >     </analyzer>
> >  </fieldType>
> >
> > Then I can probably kill the lowercasefilter on edgeytext:
> >
> >
> >
> > On Tue, Mar 24, 2020 at 7:44 AM Erick Erickson <[hidden email]> wrote:
> >>
> >> Sort by the full field. You’ll need to copy to a field with keywordTokenizer and lowercaseFilter (string_ci? assuming it’s not really a :”string”) type.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Mar 24, 2020, at 7:10 AM, matthew sporleder <[hidden email]> wrote:
> >>>
> >>> I have added an edge ngram field to my index and get decent results
> >>> with partial words but the results appear randomly sorted and all
> >>> contain the same score.  Ideally I would like to sort by shortest
> >>> ngram match within my other qualifiers.
> >>>
> >>> Is there a canonical solution to this?
> >>>
> >>> Thanks,
> >>> Matt
> >>>
> >>> p.s. I mostly followed
> >>> https://lucidworks.com/post/auto-suggest-from-popular-queries-using-edgengrams/
> >>>
> >>> schema bits:
> >>>
> >>> <fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
> >>> <analyzer type="index">
> >>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>  <filter class="solr.LowerCaseFilterFactory"/>
> >>>  <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> >>> maxGramSize="25" />
> >>> </analyzer>
> >>>
> >>> <field name="slug" type="string_ci" indexed="true" stored="true"
> >>> multiValued="false" />
> >>>
> >>> <field name="fayt" type="edgytext" indexed="true" stored="false"
> >>> omitNorms="false" omitTermFreqAndPositions="true" multiValued="true"
> >>> />
> >>>
> >>>
> >>> <copyField source="slug" dest="fayt" maxChars="65" />
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: edge ngram/find as you type sorting

matthew sporleder
Where I landed:

  <fieldType name="string_ci" class="solr.TextField"
sortMissingLast="true" omitNorms="false">
     <analyzer>
          <tokenizer class="solr.KeywordTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory" />
     </analyzer>
  </fieldType>

<fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
   <filter class="solr.LowerCaseFilterFactory" />
   <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="25" />
   <tokenizer class="solr.KeywordTokenizerFactory"/>
 </analyzer>
 <analyzer type="query">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
</fieldType>


  <field name="slug" type="string_ci" indexed="true" stored="true"
multiValued="false" />
  <field name="fayt" type="edgytext" indexed="true" stored="false"
omitNorms="false" omitTermFreqAndPositions="false" multiValued="true"
/>
  <field name="qt_len" type="int" indexed="true" stored="true"
multiValued="false" />

---

I can then do a search for

q=fayt:my_article_slu&sort=qt_len asc

to get the shortest/most exact find-as-you-type match.  I couldn't get
around all results having the same score (can I boost proximity to the
end of a string?) in the edge ngram search but I am hoping this is the
fastest way to do this type of search since I can avoid wildcards
"my_article_slu*" and stuff.

More suggestions welcome and thanks for the help.  I will re-index
with omitNorms=true again to see if I can save a little space.





On Tue, Mar 24, 2020 at 11:39 AM matthew sporleder <[hidden email]> wrote:

>
> Okay I appreciate you responding.
>
> Switching "slug" from "string_ci" class="solr.StrField" accomplished
> about the same results, which makes sense to me now :)
>
> The previous definition of string_ci was:
>   <fieldType name="string_ci" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
>      <analyzer>
>           <tokenizer class="solr.KeywordTokenizerFactory"/>
>           <filter class="solr.LowerCaseFilterFactory" />
>      </analyzer>
>   </fieldType>
>
> So lowercase + KeywordTokenizerFactory;
>
> I am trying again with omitNorms=false  to see if I can get the more
> "exact" matches to score better this time around.
>
>
> On Tue, Mar 24, 2020 at 9:54 AM Erick Erickson <[hidden email]> wrote:
> >
> > Won’t work. String types are totally unanalyzed. Your string_ci fieldType is what I was looking for.
> >
> > No, you shouldn’t kill the lowercasefilter unless you want all of your searches will then be case-sensitive.
> >
> > So you should try:
> >
> > q=edgy_text:whatever&sort=string_ci asc
> >
> > Please use the admin>>pick_core>>analysis page when thinking about changing your schema, it’ll answer a _lot_ of these questions immediately.
> >
> > Best,
> > Erick
> >
> > > On Mar 24, 2020, at 8:37 AM, matthew sporleder <[hidden email]> wrote:
> > >
> > > Oh maybe a schema bug!
> > >
> > > my string_ci:
> > > <fieldType name="string_ci" class="solr.TextField"
> > > sortMissingLast="true" omitNorms="true">
> > >     <analyzer>
> > >          <tokenizer class="solr.KeywordTokenizerFactory"/>
> > >          <filter class="solr.LowerCaseFilterFactory" />
> > >     </analyzer>
> > >  </fieldType>
> > >
> > > going to try this instead:
> > >  <fieldType name="string_lctoken" class="solr.StrField"
> > > sortMissingLast="true" omitNorms="true">
> > >     <analyzer>
> > >          <tokenizer class="solr.KeywordTokenizerFactory"/>
> > >          <filter class="solr.LowerCaseFilterFactory" />
> > >     </analyzer>
> > >  </fieldType>
> > >
> > > Then I can probably kill the lowercasefilter on edgeytext:
> > >
> > >
> > >
> > > On Tue, Mar 24, 2020 at 7:44 AM Erick Erickson <[hidden email]> wrote:
> > >>
> > >> Sort by the full field. You’ll need to copy to a field with keywordTokenizer and lowercaseFilter (string_ci? assuming it’s not really a :”string”) type.
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >>> On Mar 24, 2020, at 7:10 AM, matthew sporleder <[hidden email]> wrote:
> > >>>
> > >>> I have added an edge ngram field to my index and get decent results
> > >>> with partial words but the results appear randomly sorted and all
> > >>> contain the same score.  Ideally I would like to sort by shortest
> > >>> ngram match within my other qualifiers.
> > >>>
> > >>> Is there a canonical solution to this?
> > >>>
> > >>> Thanks,
> > >>> Matt
> > >>>
> > >>> p.s. I mostly followed
> > >>> https://lucidworks.com/post/auto-suggest-from-popular-queries-using-edgengrams/
> > >>>
> > >>> schema bits:
> > >>>
> > >>> <fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
> > >>> <analyzer type="index">
> > >>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
> > >>>  <filter class="solr.LowerCaseFilterFactory"/>
> > >>>  <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > >>> maxGramSize="25" />
> > >>> </analyzer>
> > >>>
> > >>> <field name="slug" type="string_ci" indexed="true" stored="true"
> > >>> multiValued="false" />
> > >>>
> > >>> <field name="fayt" type="edgytext" indexed="true" stored="false"
> > >>> omitNorms="false" omitTermFreqAndPositions="true" multiValued="true"
> > >>> />
> > >>>
> > >>>
> > >>> <copyField source="slug" dest="fayt" maxChars="65" />
> > >>
> >
Reply | Threaded
Open this post in threaded view
|

Re: edge ngram/find as you type sorting

Erick Erickson
Why do you want to deal with score at all? Sorting
overrides score-based sorting. Well, unless you
specify score as a secondary sort. But since you’re
sorting by length anyway, trying to score
based on proximity to the end does nothing.

The weirdness you’re going to get here, though, is
that the order of the results will not be alphabetical.
Say you have two docs, one with abcd and one with
abce. Now say you search on abc. Whether abcd or
abce comes first is indeterminant.

If you simply stored the keyword-lowercased value
in a copyfield and sorted on _that_, you wouldn’t have
this problem. But if you’re really worried about space,
that might not be an option.

Best,
Erick

> On Mar 25, 2020, at 9:49 AM, matthew sporleder <[hidden email]> wrote:
>
> Where I landed:
>
>  <fieldType name="string_ci" class="solr.TextField"
> sortMissingLast="true" omitNorms="false">
>     <analyzer>
>          <tokenizer class="solr.KeywordTokenizerFactory"/>
>          <filter class="solr.LowerCaseFilterFactory" />
>     </analyzer>
>  </fieldType>
>
> <fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
>   <filter class="solr.LowerCaseFilterFactory" />
>   <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> maxGramSize="25" />
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
> </analyzer>
> <analyzer type="query">
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
>
>
>  <field name="slug" type="string_ci" indexed="true" stored="true"
> multiValued="false" />
>  <field name="fayt" type="edgytext" indexed="true" stored="false"
> omitNorms="false" omitTermFreqAndPositions="false" multiValued="true"
> />
>  <field name="qt_len" type="int" indexed="true" stored="true"
> multiValued="false" />
>
> ---
>
> I can then do a search for
>
> q=fayt:my_article_slu&sort=qt_len asc
>
> to get the shortest/most exact find-as-you-type match.  I couldn't get
> around all results having the same score (can I boost proximity to the
> end of a string?) in the edge ngram search but I am hoping this is the
> fastest way to do this type of search since I can avoid wildcards
> "my_article_slu*" and stuff.
>
> More suggestions welcome and thanks for the help.  I will re-index
> with omitNorms=true again to see if I can save a little space.
>
>
>
>
>
> On Tue, Mar 24, 2020 at 11:39 AM matthew sporleder <[hidden email]> wrote:
>>
>> Okay I appreciate you responding.
>>
>> Switching "slug" from "string_ci" class="solr.StrField" accomplished
>> about the same results, which makes sense to me now :)
>>
>> The previous definition of string_ci was:
>>  <fieldType name="string_ci" class="solr.TextField"
>> sortMissingLast="true" omitNorms="true">
>>     <analyzer>
>>          <tokenizer class="solr.KeywordTokenizerFactory"/>
>>          <filter class="solr.LowerCaseFilterFactory" />
>>     </analyzer>
>>  </fieldType>
>>
>> So lowercase + KeywordTokenizerFactory;
>>
>> I am trying again with omitNorms=false  to see if I can get the more
>> "exact" matches to score better this time around.
>>
>>
>> On Tue, Mar 24, 2020 at 9:54 AM Erick Erickson <[hidden email]> wrote:
>>>
>>> Won’t work. String types are totally unanalyzed. Your string_ci fieldType is what I was looking for.
>>>
>>> No, you shouldn’t kill the lowercasefilter unless you want all of your searches will then be case-sensitive.
>>>
>>> So you should try:
>>>
>>> q=edgy_text:whatever&sort=string_ci asc
>>>
>>> Please use the admin>>pick_core>>analysis page when thinking about changing your schema, it’ll answer a _lot_ of these questions immediately.
>>>
>>> Best,
>>> Erick
>>>
>>>> On Mar 24, 2020, at 8:37 AM, matthew sporleder <[hidden email]> wrote:
>>>>
>>>> Oh maybe a schema bug!
>>>>
>>>> my string_ci:
>>>> <fieldType name="string_ci" class="solr.TextField"
>>>> sortMissingLast="true" omitNorms="true">
>>>>    <analyzer>
>>>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>>         <filter class="solr.LowerCaseFilterFactory" />
>>>>    </analyzer>
>>>> </fieldType>
>>>>
>>>> going to try this instead:
>>>> <fieldType name="string_lctoken" class="solr.StrField"
>>>> sortMissingLast="true" omitNorms="true">
>>>>    <analyzer>
>>>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>>         <filter class="solr.LowerCaseFilterFactory" />
>>>>    </analyzer>
>>>> </fieldType>
>>>>
>>>> Then I can probably kill the lowercasefilter on edgeytext:
>>>>
>>>>
>>>>
>>>> On Tue, Mar 24, 2020 at 7:44 AM Erick Erickson <[hidden email]> wrote:
>>>>>
>>>>> Sort by the full field. You’ll need to copy to a field with keywordTokenizer and lowercaseFilter (string_ci? assuming it’s not really a :”string”) type.
>>>>>
>>>>> Best,
>>>>> Erick
>>>>>
>>>>>> On Mar 24, 2020, at 7:10 AM, matthew sporleder <[hidden email]> wrote:
>>>>>>
>>>>>> I have added an edge ngram field to my index and get decent results
>>>>>> with partial words but the results appear randomly sorted and all
>>>>>> contain the same score.  Ideally I would like to sort by shortest
>>>>>> ngram match within my other qualifiers.
>>>>>>
>>>>>> Is there a canonical solution to this?
>>>>>>
>>>>>> Thanks,
>>>>>> Matt
>>>>>>
>>>>>> p.s. I mostly followed
>>>>>> https://lucidworks.com/post/auto-suggest-from-popular-queries-using-edgengrams/
>>>>>>
>>>>>> schema bits:
>>>>>>
>>>>>> <fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
>>>>>> <analyzer type="index">
>>>>>> <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>>>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
>>>>>> maxGramSize="25" />
>>>>>> </analyzer>
>>>>>>
>>>>>> <field name="slug" type="string_ci" indexed="true" stored="true"
>>>>>> multiValued="false" />
>>>>>>
>>>>>> <field name="fayt" type="edgytext" indexed="true" stored="false"
>>>>>> omitNorms="false" omitTermFreqAndPositions="true" multiValued="true"
>>>>>> />
>>>>>>
>>>>>>
>>>>>> <copyField source="slug" dest="fayt" maxChars="65" />
>>>>>
>>>

Reply | Threaded
Open this post in threaded view
|

Re: edge ngram/find as you type sorting

matthew sporleder
My original goal was to avoid indexing the string length because I
wanted edge ngram to "score" based on how "exact" the match was:

q=abc
"abc" has a high score
"abcd" has a lower score
"abcde" has an even lower score

You say sorting by by the original field will do that but in practice
it is not happening so I am probably missing something.

I *am* getting a close version of what I said above with sorting on
the length, which I added to the index.

searching for my keyword-lowercase field:abc* + sorting by length is
also working so maybe I can skip the edge ngram field entirely and
just do that but I was hoping the trade some disk space for
performance.  This field will get queried a lot.


On Wed, Mar 25, 2020 at 10:39 AM Erick Erickson <[hidden email]> wrote:

>
> Why do you want to deal with score at all? Sorting
> overrides score-based sorting. Well, unless you
> specify score as a secondary sort. But since you’re
> sorting by length anyway, trying to score
> based on proximity to the end does nothing.
>
> The weirdness you’re going to get here, though, is
> that the order of the results will not be alphabetical.
> Say you have two docs, one with abcd and one with
> abce. Now say you search on abc. Whether abcd or
> abce comes first is indeterminant.
>
> If you simply stored the keyword-lowercased value
> in a copyfield and sorted on _that_, you wouldn’t have
> this problem. But if you’re really worried about space,
> that might not be an option.
>
> Best,
> Erick
>
> > On Mar 25, 2020, at 9:49 AM, matthew sporleder <[hidden email]> wrote:
> >
> > Where I landed:
> >
> >  <fieldType name="string_ci" class="solr.TextField"
> > sortMissingLast="true" omitNorms="false">
> >     <analyzer>
> >          <tokenizer class="solr.KeywordTokenizerFactory"/>
> >          <filter class="solr.LowerCaseFilterFactory" />
> >     </analyzer>
> >  </fieldType>
> >
> > <fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
> > <analyzer type="index">
> >   <filter class="solr.LowerCaseFilterFactory" />
> >   <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > maxGramSize="25" />
> >   <tokenizer class="solr.KeywordTokenizerFactory"/>
> > </analyzer>
> > <analyzer type="query">
> >   <tokenizer class="solr.KeywordTokenizerFactory"/>
> >   <filter class="solr.LowerCaseFilterFactory"/>
> > </analyzer>
> > </fieldType>
> >
> >
> >  <field name="slug" type="string_ci" indexed="true" stored="true"
> > multiValued="false" />
> >  <field name="fayt" type="edgytext" indexed="true" stored="false"
> > omitNorms="false" omitTermFreqAndPositions="false" multiValued="true"
> > />
> >  <field name="qt_len" type="int" indexed="true" stored="true"
> > multiValued="false" />
> >
> > ---
> >
> > I can then do a search for
> >
> > q=fayt:my_article_slu&sort=qt_len asc
> >
> > to get the shortest/most exact find-as-you-type match.  I couldn't get
> > around all results having the same score (can I boost proximity to the
> > end of a string?) in the edge ngram search but I am hoping this is the
> > fastest way to do this type of search since I can avoid wildcards
> > "my_article_slu*" and stuff.
> >
> > More suggestions welcome and thanks for the help.  I will re-index
> > with omitNorms=true again to see if I can save a little space.
> >
> >
> >
> >
> >
> > On Tue, Mar 24, 2020 at 11:39 AM matthew sporleder <[hidden email]> wrote:
> >>
> >> Okay I appreciate you responding.
> >>
> >> Switching "slug" from "string_ci" class="solr.StrField" accomplished
> >> about the same results, which makes sense to me now :)
> >>
> >> The previous definition of string_ci was:
> >>  <fieldType name="string_ci" class="solr.TextField"
> >> sortMissingLast="true" omitNorms="true">
> >>     <analyzer>
> >>          <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>          <filter class="solr.LowerCaseFilterFactory" />
> >>     </analyzer>
> >>  </fieldType>
> >>
> >> So lowercase + KeywordTokenizerFactory;
> >>
> >> I am trying again with omitNorms=false  to see if I can get the more
> >> "exact" matches to score better this time around.
> >>
> >>
> >> On Tue, Mar 24, 2020 at 9:54 AM Erick Erickson <[hidden email]> wrote:
> >>>
> >>> Won’t work. String types are totally unanalyzed. Your string_ci fieldType is what I was looking for.
> >>>
> >>> No, you shouldn’t kill the lowercasefilter unless you want all of your searches will then be case-sensitive.
> >>>
> >>> So you should try:
> >>>
> >>> q=edgy_text:whatever&sort=string_ci asc
> >>>
> >>> Please use the admin>>pick_core>>analysis page when thinking about changing your schema, it’ll answer a _lot_ of these questions immediately.
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>>> On Mar 24, 2020, at 8:37 AM, matthew sporleder <[hidden email]> wrote:
> >>>>
> >>>> Oh maybe a schema bug!
> >>>>
> >>>> my string_ci:
> >>>> <fieldType name="string_ci" class="solr.TextField"
> >>>> sortMissingLast="true" omitNorms="true">
> >>>>    <analyzer>
> >>>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>>         <filter class="solr.LowerCaseFilterFactory" />
> >>>>    </analyzer>
> >>>> </fieldType>
> >>>>
> >>>> going to try this instead:
> >>>> <fieldType name="string_lctoken" class="solr.StrField"
> >>>> sortMissingLast="true" omitNorms="true">
> >>>>    <analyzer>
> >>>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>>         <filter class="solr.LowerCaseFilterFactory" />
> >>>>    </analyzer>
> >>>> </fieldType>
> >>>>
> >>>> Then I can probably kill the lowercasefilter on edgeytext:
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Mar 24, 2020 at 7:44 AM Erick Erickson <[hidden email]> wrote:
> >>>>>
> >>>>> Sort by the full field. You’ll need to copy to a field with keywordTokenizer and lowercaseFilter (string_ci? assuming it’s not really a :”string”) type.
> >>>>>
> >>>>> Best,
> >>>>> Erick
> >>>>>
> >>>>>> On Mar 24, 2020, at 7:10 AM, matthew sporleder <[hidden email]> wrote:
> >>>>>>
> >>>>>> I have added an edge ngram field to my index and get decent results
> >>>>>> with partial words but the results appear randomly sorted and all
> >>>>>> contain the same score.  Ideally I would like to sort by shortest
> >>>>>> ngram match within my other qualifiers.
> >>>>>>
> >>>>>> Is there a canonical solution to this?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Matt
> >>>>>>
> >>>>>> p.s. I mostly followed
> >>>>>> https://lucidworks.com/post/auto-suggest-from-popular-queries-using-edgengrams/
> >>>>>>
> >>>>>> schema bits:
> >>>>>>
> >>>>>> <fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
> >>>>>> <analyzer type="index">
> >>>>>> <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>>>> <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> >>>>>> maxGramSize="25" />
> >>>>>> </analyzer>
> >>>>>>
> >>>>>> <field name="slug" type="string_ci" indexed="true" stored="true"
> >>>>>> multiValued="false" />
> >>>>>>
> >>>>>> <field name="fayt" type="edgytext" indexed="true" stored="false"
> >>>>>> omitNorms="false" omitTermFreqAndPositions="true" multiValued="true"
> >>>>>> />
> >>>>>>
> >>>>>>
> >>>>>> <copyField source="slug" dest="fayt" maxChars="65" />
> >>>>>
> >>>
>
Reply | Threaded
Open this post in threaded view
|

Re: edge ngram/find as you type sorting

Erick Erickson
What _is_ happening? Please provide examples of the inputs
and outputs that don’t work for you. ‘cause
the sort order should be “nothing comes before something"
so sorting ascending on a keywordtokenizer+lowecasefilter
should give you exactly what you’re asking for with no
need for a length field.

Best,
Erick

> On Mar 25, 2020, at 11:07 AM, matthew sporleder <[hidden email]> wrote:
>
> My original goal was to avoid indexing the string length because I
> wanted edge ngram to "score" based on how "exact" the match was:
>
> q=abc
> "abc" has a high score
> "abcd" has a lower score
> "abcde" has an even lower score
>
> You say sorting by by the original field will do that but in practice
> it is not happening so I am probably missing something.
>
> I *am* getting a close version of what I said above with sorting on
> the length, which I added to the index.
>
> searching for my keyword-lowercase field:abc* + sorting by length is
> also working so maybe I can skip the edge ngram field entirely and
> just do that but I was hoping the trade some disk space for
> performance.  This field will get queried a lot.
>
>
> On Wed, Mar 25, 2020 at 10:39 AM Erick Erickson <[hidden email]> wrote:
>>
>> Why do you want to deal with score at all? Sorting
>> overrides score-based sorting. Well, unless you
>> specify score as a secondary sort. But since you’re
>> sorting by length anyway, trying to score
>> based on proximity to the end does nothing.
>>
>> The weirdness you’re going to get here, though, is
>> that the order of the results will not be alphabetical.
>> Say you have two docs, one with abcd and one with
>> abce. Now say you search on abc. Whether abcd or
>> abce comes first is indeterminant.
>>
>> If you simply stored the keyword-lowercased value
>> in a copyfield and sorted on _that_, you wouldn’t have
>> this problem. But if you’re really worried about space,
>> that might not be an option.
>>
>> Best,
>> Erick
>>
>>> On Mar 25, 2020, at 9:49 AM, matthew sporleder <[hidden email]> wrote:
>>>
>>> Where I landed:
>>>
>>> <fieldType name="string_ci" class="solr.TextField"
>>> sortMissingLast="true" omitNorms="false">
>>>    <analyzer>
>>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>         <filter class="solr.LowerCaseFilterFactory" />
>>>    </analyzer>
>>> </fieldType>
>>>
>>> <fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
>>> <analyzer type="index">
>>>  <filter class="solr.LowerCaseFilterFactory" />
>>>  <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
>>> maxGramSize="25" />
>>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
>>> </analyzer>
>>> <analyzer type="query">
>>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>  <filter class="solr.LowerCaseFilterFactory"/>
>>> </analyzer>
>>> </fieldType>
>>>
>>>
>>> <field name="slug" type="string_ci" indexed="true" stored="true"
>>> multiValued="false" />
>>> <field name="fayt" type="edgytext" indexed="true" stored="false"
>>> omitNorms="false" omitTermFreqAndPositions="false" multiValued="true"
>>> />
>>> <field name="qt_len" type="int" indexed="true" stored="true"
>>> multiValued="false" />
>>>
>>> ---
>>>
>>> I can then do a search for
>>>
>>> q=fayt:my_article_slu&sort=qt_len asc
>>>
>>> to get the shortest/most exact find-as-you-type match.  I couldn't get
>>> around all results having the same score (can I boost proximity to the
>>> end of a string?) in the edge ngram search but I am hoping this is the
>>> fastest way to do this type of search since I can avoid wildcards
>>> "my_article_slu*" and stuff.
>>>
>>> More suggestions welcome and thanks for the help.  I will re-index
>>> with omitNorms=true again to see if I can save a little space.
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Mar 24, 2020 at 11:39 AM matthew sporleder <[hidden email]> wrote:
>>>>
>>>> Okay I appreciate you responding.
>>>>
>>>> Switching "slug" from "string_ci" class="solr.StrField" accomplished
>>>> about the same results, which makes sense to me now :)
>>>>
>>>> The previous definition of string_ci was:
>>>> <fieldType name="string_ci" class="solr.TextField"
>>>> sortMissingLast="true" omitNorms="true">
>>>>    <analyzer>
>>>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>>         <filter class="solr.LowerCaseFilterFactory" />
>>>>    </analyzer>
>>>> </fieldType>
>>>>
>>>> So lowercase + KeywordTokenizerFactory;
>>>>
>>>> I am trying again with omitNorms=false  to see if I can get the more
>>>> "exact" matches to score better this time around.
>>>>
>>>>
>>>> On Tue, Mar 24, 2020 at 9:54 AM Erick Erickson <[hidden email]> wrote:
>>>>>
>>>>> Won’t work. String types are totally unanalyzed. Your string_ci fieldType is what I was looking for.
>>>>>
>>>>> No, you shouldn’t kill the lowercasefilter unless you want all of your searches will then be case-sensitive.
>>>>>
>>>>> So you should try:
>>>>>
>>>>> q=edgy_text:whatever&sort=string_ci asc
>>>>>
>>>>> Please use the admin>>pick_core>>analysis page when thinking about changing your schema, it’ll answer a _lot_ of these questions immediately.
>>>>>
>>>>> Best,
>>>>> Erick
>>>>>
>>>>>> On Mar 24, 2020, at 8:37 AM, matthew sporleder <[hidden email]> wrote:
>>>>>>
>>>>>> Oh maybe a schema bug!
>>>>>>
>>>>>> my string_ci:
>>>>>> <fieldType name="string_ci" class="solr.TextField"
>>>>>> sortMissingLast="true" omitNorms="true">
>>>>>>   <analyzer>
>>>>>>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>>>>        <filter class="solr.LowerCaseFilterFactory" />
>>>>>>   </analyzer>
>>>>>> </fieldType>
>>>>>>
>>>>>> going to try this instead:
>>>>>> <fieldType name="string_lctoken" class="solr.StrField"
>>>>>> sortMissingLast="true" omitNorms="true">
>>>>>>   <analyzer>
>>>>>>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>>>>        <filter class="solr.LowerCaseFilterFactory" />
>>>>>>   </analyzer>
>>>>>> </fieldType>
>>>>>>
>>>>>> Then I can probably kill the lowercasefilter on edgeytext:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 24, 2020 at 7:44 AM Erick Erickson <[hidden email]> wrote:
>>>>>>>
>>>>>>> Sort by the full field. You’ll need to copy to a field with keywordTokenizer and lowercaseFilter (string_ci? assuming it’s not really a :”string”) type.
>>>>>>>
>>>>>>> Best,
>>>>>>> Erick
>>>>>>>
>>>>>>>> On Mar 24, 2020, at 7:10 AM, matthew sporleder <[hidden email]> wrote:
>>>>>>>>
>>>>>>>> I have added an edge ngram field to my index and get decent results
>>>>>>>> with partial words but the results appear randomly sorted and all
>>>>>>>> contain the same score.  Ideally I would like to sort by shortest
>>>>>>>> ngram match within my other qualifiers.
>>>>>>>>
>>>>>>>> Is there a canonical solution to this?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Matt
>>>>>>>>
>>>>>>>> p.s. I mostly followed
>>>>>>>> https://lucidworks.com/post/auto-suggest-from-popular-queries-using-edgengrams/
>>>>>>>>
>>>>>>>> schema bits:
>>>>>>>>
>>>>>>>> <fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
>>>>>>>> <analyzer type="index">
>>>>>>>> <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>>>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
>>>>>>>> maxGramSize="25" />
>>>>>>>> </analyzer>
>>>>>>>>
>>>>>>>> <field name="slug" type="string_ci" indexed="true" stored="true"
>>>>>>>> multiValued="false" />
>>>>>>>>
>>>>>>>> <field name="fayt" type="edgytext" indexed="true" stored="false"
>>>>>>>> omitNorms="false" omitTermFreqAndPositions="true" multiValued="true"
>>>>>>>> />
>>>>>>>>
>>>>>>>>
>>>>>>>> <copyField source="slug" dest="fayt" maxChars="65" />
>>>>>>>
>>>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: edge ngram/find as you type sorting

matthew sporleder
Okay.  I am getting pretty much a random order of documents containing
the prefix.

Does my "string_ci" defined below count as
"keywordtokenizer+lowecasefilter"?  (assumption)
Does my "fayt" copy field below look right? (assumption)

I have a bunch of web pages indexed with "slug" fields with the prefix
"what_is_lov"
so I search:
select?q=fayt:what_is_lov&fl=slug&rows=1000&sort=slug%20asc&wt=csv

and get:
slug
What_is_Lov_Holtz_known_for
What_is_lova_after_it_harddens
What_is_Lova_Moor's_birthday
What_is_lovable_in_Spanish
What_is_lovage
What_is_Lovagny's_population
What_is_lovan_for
What_is_lovanox
What_is_lovarstan_for
What_is_Lovasatin



On Wed, Mar 25, 2020 at 1:15 PM Erick Erickson <[hidden email]> wrote:

>
> What _is_ happening? Please provide examples of the inputs
> and outputs that don’t work for you. ‘cause
> the sort order should be “nothing comes before something"
> so sorting ascending on a keywordtokenizer+lowecasefilter
> should give you exactly what you’re asking for with no
> need for a length field.
>
> Best,
> Erick
>
> > On Mar 25, 2020, at 11:07 AM, matthew sporleder <[hidden email]> wrote:
> >
> > My original goal was to avoid indexing the string length because I
> > wanted edge ngram to "score" based on how "exact" the match was:
> >
> > q=abc
> > "abc" has a high score
> > "abcd" has a lower score
> > "abcde" has an even lower score
> >
> > You say sorting by by the original field will do that but in practice
> > it is not happening so I am probably missing something.
> >
> > I *am* getting a close version of what I said above with sorting on
> > the length, which I added to the index.
> >
> > searching for my keyword-lowercase field:abc* + sorting by length is
> > also working so maybe I can skip the edge ngram field entirely and
> > just do that but I was hoping the trade some disk space for
> > performance.  This field will get queried a lot.
> >
> >
> > On Wed, Mar 25, 2020 at 10:39 AM Erick Erickson <[hidden email]> wrote:
> >>
> >> Why do you want to deal with score at all? Sorting
> >> overrides score-based sorting. Well, unless you
> >> specify score as a secondary sort. But since you’re
> >> sorting by length anyway, trying to score
> >> based on proximity to the end does nothing.
> >>
> >> The weirdness you’re going to get here, though, is
> >> that the order of the results will not be alphabetical.
> >> Say you have two docs, one with abcd and one with
> >> abce. Now say you search on abc. Whether abcd or
> >> abce comes first is indeterminant.
> >>
> >> If you simply stored the keyword-lowercased value
> >> in a copyfield and sorted on _that_, you wouldn’t have
> >> this problem. But if you’re really worried about space,
> >> that might not be an option.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Mar 25, 2020, at 9:49 AM, matthew sporleder <[hidden email]> wrote:
> >>>
> >>> Where I landed:
> >>>
> >>> <fieldType name="string_ci" class="solr.TextField"
> >>> sortMissingLast="true" omitNorms="false">
> >>>    <analyzer>
> >>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>         <filter class="solr.LowerCaseFilterFactory" />
> >>>    </analyzer>
> >>> </fieldType>
> >>>
> >>> <fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
> >>> <analyzer type="index">
> >>>  <filter class="solr.LowerCaseFilterFactory" />
> >>>  <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> >>> maxGramSize="25" />
> >>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>> </analyzer>
> >>> <analyzer type="query">
> >>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>  <filter class="solr.LowerCaseFilterFactory"/>
> >>> </analyzer>
> >>> </fieldType>
> >>>
> >>>
> >>> <field name="slug" type="string_ci" indexed="true" stored="true"
> >>> multiValued="false" />
> >>> <field name="fayt" type="edgytext" indexed="true" stored="false"
> >>> omitNorms="false" omitTermFreqAndPositions="false" multiValued="true"
> >>> />
> >>> <field name="qt_len" type="int" indexed="true" stored="true"
> >>> multiValued="false" />
> >>>
> >>> ---
> >>>
> >>> I can then do a search for
> >>>
> >>> q=fayt:my_article_slu&sort=qt_len asc
> >>>
> >>> to get the shortest/most exact find-as-you-type match.  I couldn't get
> >>> around all results having the same score (can I boost proximity to the
> >>> end of a string?) in the edge ngram search but I am hoping this is the
> >>> fastest way to do this type of search since I can avoid wildcards
> >>> "my_article_slu*" and stuff.
> >>>
> >>> More suggestions welcome and thanks for the help.  I will re-index
> >>> with omitNorms=true again to see if I can save a little space.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Mar 24, 2020 at 11:39 AM matthew sporleder <[hidden email]> wrote:
> >>>>
> >>>> Okay I appreciate you responding.
> >>>>
> >>>> Switching "slug" from "string_ci" class="solr.StrField" accomplished
> >>>> about the same results, which makes sense to me now :)
> >>>>
> >>>> The previous definition of string_ci was:
> >>>> <fieldType name="string_ci" class="solr.TextField"
> >>>> sortMissingLast="true" omitNorms="true">
> >>>>    <analyzer>
> >>>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>>         <filter class="solr.LowerCaseFilterFactory" />
> >>>>    </analyzer>
> >>>> </fieldType>
> >>>>
> >>>> So lowercase + KeywordTokenizerFactory;
> >>>>
> >>>> I am trying again with omitNorms=false  to see if I can get the more
> >>>> "exact" matches to score better this time around.
> >>>>
> >>>>
> >>>> On Tue, Mar 24, 2020 at 9:54 AM Erick Erickson <[hidden email]> wrote:
> >>>>>
> >>>>> Won’t work. String types are totally unanalyzed. Your string_ci fieldType is what I was looking for.
> >>>>>
> >>>>> No, you shouldn’t kill the lowercasefilter unless you want all of your searches will then be case-sensitive.
> >>>>>
> >>>>> So you should try:
> >>>>>
> >>>>> q=edgy_text:whatever&sort=string_ci asc
> >>>>>
> >>>>> Please use the admin>>pick_core>>analysis page when thinking about changing your schema, it’ll answer a _lot_ of these questions immediately.
> >>>>>
> >>>>> Best,
> >>>>> Erick
> >>>>>
> >>>>>> On Mar 24, 2020, at 8:37 AM, matthew sporleder <[hidden email]> wrote:
> >>>>>>
> >>>>>> Oh maybe a schema bug!
> >>>>>>
> >>>>>> my string_ci:
> >>>>>> <fieldType name="string_ci" class="solr.TextField"
> >>>>>> sortMissingLast="true" omitNorms="true">
> >>>>>>   <analyzer>
> >>>>>>        <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>>>>        <filter class="solr.LowerCaseFilterFactory" />
> >>>>>>   </analyzer>
> >>>>>> </fieldType>
> >>>>>>
> >>>>>> going to try this instead:
> >>>>>> <fieldType name="string_lctoken" class="solr.StrField"
> >>>>>> sortMissingLast="true" omitNorms="true">
> >>>>>>   <analyzer>
> >>>>>>        <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>>>>        <filter class="solr.LowerCaseFilterFactory" />
> >>>>>>   </analyzer>
> >>>>>> </fieldType>
> >>>>>>
> >>>>>> Then I can probably kill the lowercasefilter on edgeytext:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Mar 24, 2020 at 7:44 AM Erick Erickson <[hidden email]> wrote:
> >>>>>>>
> >>>>>>> Sort by the full field. You’ll need to copy to a field with keywordTokenizer and lowercaseFilter (string_ci? assuming it’s not really a :”string”) type.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Erick
> >>>>>>>
> >>>>>>>> On Mar 24, 2020, at 7:10 AM, matthew sporleder <[hidden email]> wrote:
> >>>>>>>>
> >>>>>>>> I have added an edge ngram field to my index and get decent results
> >>>>>>>> with partial words but the results appear randomly sorted and all
> >>>>>>>> contain the same score.  Ideally I would like to sort by shortest
> >>>>>>>> ngram match within my other qualifiers.
> >>>>>>>>
> >>>>>>>> Is there a canonical solution to this?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Matt
> >>>>>>>>
> >>>>>>>> p.s. I mostly followed
> >>>>>>>> https://lucidworks.com/post/auto-suggest-from-popular-queries-using-edgengrams/
> >>>>>>>>
> >>>>>>>> schema bits:
> >>>>>>>>
> >>>>>>>> <fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
> >>>>>>>> <analyzer type="index">
> >>>>>>>> <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>>>>>> <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> >>>>>>>> maxGramSize="25" />
> >>>>>>>> </analyzer>
> >>>>>>>>
> >>>>>>>> <field name="slug" type="string_ci" indexed="true" stored="true"
> >>>>>>>> multiValued="false" />
> >>>>>>>>
> >>>>>>>> <field name="fayt" type="edgytext" indexed="true" stored="false"
> >>>>>>>> omitNorms="false" omitTermFreqAndPositions="true" multiValued="true"
> >>>>>>>> />
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> <copyField source="slug" dest="fayt" maxChars="65" />
> >>>>>>>
> >>>>>
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: edge ngram/find as you type sorting

Erick Erickson
You’re getting the correct sorted order… The underscore character is confusing you.

It’s ascii code for underscore is %2d which sorts before any letter, uppercase or lowercase.

See the alphaOnlySort type for a way to remove this, although the output there can also
be confusing.

Best,
Erick

> On Mar 25, 2020, at 1:30 PM, matthew sporleder <[hidden email]> wrote:
>
> What_is_Lov_Holtz_known_for
> What_is_lova_after_it_harddens
> What_is_Lova_Moor's_birthday
> What_is_lovable_in_Spanish
> What_is_lovage
> What_is_Lovagny's_population
> What_is_lovan_for
> What_is_lovanox
> What_is_lovarstan_for
> What_is_Lovasatin

Reply | Threaded
Open this post in threaded view
|

Re: edge ngram/find as you type sorting

matthew sporleder
Okay confirmed-
I am getting a more predictable results set after adding an additional field:
  <fieldType name="string_alpha" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
     <analyzer>
          <tokenizer class="solr.KeywordTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory" />
          <filter class="solr.PatternReplaceFilterFactory"
pattern="\p{Punct}" replacement=""/>
     </analyzer>
  </fieldType>

q=slug:what_is_lo*&fl=slug&rows=1000&wt=csv&sort=slug_alpha%20asc

So it appears I can skip edge ngram entirely using this method as
slug:foo* appears to be the exact same results as fayt:foo, but I have
the cost of the alphaOnly field :)

I will try to figure out some benchmarks or something to decide how to go.

Thanks again for the help so far.


On Wed, Mar 25, 2020 at 2:39 PM Erick Erickson <[hidden email]> wrote:

>
> You’re getting the correct sorted order… The underscore character is confusing you.
>
> It’s ascii code for underscore is %2d which sorts before any letter, uppercase or lowercase.
>
> See the alphaOnlySort type for a way to remove this, although the output there can also
> be confusing.
>
> Best,
> Erick
>
> > On Mar 25, 2020, at 1:30 PM, matthew sporleder <[hidden email]> wrote:
> >
> > What_is_Lov_Holtz_known_for
> > What_is_lova_after_it_harddens
> > What_is_Lova_Moor's_birthday
> > What_is_lovable_in_Spanish
> > What_is_lovage
> > What_is_Lovagny's_population
> > What_is_lovan_for
> > What_is_lovanox
> > What_is_lovarstan_for
> > What_is_Lovasatin
>
Reply | Threaded
Open this post in threaded view
|

Re: edge ngram/find as you type sorting

Erick Erickson
the ngramming is a time/space tradeoff. Typically,
if you restrict the wildcards to have three or more
“real” characters performance is fine. One real
character (i.e. a*) will be your worst-case. I’ve
seen requiring two characters in the prefix work well
too. It Depends (tm).

Conceptually what happens here is that Lucene has
to enumerate all of the terms that start with the prefix
and create a ginormous OR clause. The term
enumeration will take longer the more terms there are.
Things are more efficient than that, but still...

So make sure you’re testing with a real corpus. Having
a test index with just a few terms will be misleading.

Best,
Erick

> On Mar 25, 2020, at 9:37 PM, matthew sporleder <[hidden email]> wrote:
>
> Okay confirmed-
> I am getting a more predictable results set after adding an additional field:
>  <fieldType name="string_alpha" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
>     <analyzer>
>          <tokenizer class="solr.KeywordTokenizerFactory"/>
>          <filter class="solr.LowerCaseFilterFactory" />
>          <filter class="solr.PatternReplaceFilterFactory"
> pattern="\p{Punct}" replacement=""/>
>     </analyzer>
>  </fieldType>
>
> q=slug:what_is_lo*&fl=slug&rows=1000&wt=csv&sort=slug_alpha%20asc
>
> So it appears I can skip edge ngram entirely using this method as
> slug:foo* appears to be the exact same results as fayt:foo, but I have
> the cost of the alphaOnly field :)
>
> I will try to figure out some benchmarks or something to decide how to go.
>
> Thanks again for the help so far.
>
>
> On Wed, Mar 25, 2020 at 2:39 PM Erick Erickson <[hidden email]> wrote:
>>
>> You’re getting the correct sorted order… The underscore character is confusing you.
>>
>> It’s ascii code for underscore is %2d which sorts before any letter, uppercase or lowercase.
>>
>> See the alphaOnlySort type for a way to remove this, although the output there can also
>> be confusing.
>>
>> Best,
>> Erick
>>
>>> On Mar 25, 2020, at 1:30 PM, matthew sporleder <[hidden email]> wrote:
>>>
>>> What_is_Lov_Holtz_known_for
>>> What_is_lova_after_it_harddens
>>> What_is_Lova_Moor's_birthday
>>> What_is_lovable_in_Spanish
>>> What_is_lovage
>>> What_is_Lovagny's_population
>>> What_is_lovan_for
>>> What_is_lovanox
>>> What_is_lovarstan_for
>>> What_is_Lovasatin
>>

Reply | Threaded
Open this post in threaded view
|

Re: edge ngram/find as you type sorting

matthew sporleder
That explains the OOM's I've been getting in the initial test cycle.
I'm working with about 50M (small) documents.

On Thu, Mar 26, 2020 at 7:58 AM Erick Erickson <[hidden email]> wrote:

>
> the ngramming is a time/space tradeoff. Typically,
> if you restrict the wildcards to have three or more
> “real” characters performance is fine. One real
> character (i.e. a*) will be your worst-case. I’ve
> seen requiring two characters in the prefix work well
> too. It Depends (tm).
>
> Conceptually what happens here is that Lucene has
> to enumerate all of the terms that start with the prefix
> and create a ginormous OR clause. The term
> enumeration will take longer the more terms there are.
> Things are more efficient than that, but still...
>
> So make sure you’re testing with a real corpus. Having
> a test index with just a few terms will be misleading.
>
> Best,
> Erick
>
> > On Mar 25, 2020, at 9:37 PM, matthew sporleder <[hidden email]> wrote:
> >
> > Okay confirmed-
> > I am getting a more predictable results set after adding an additional field:
> >  <fieldType name="string_alpha" class="solr.TextField"
> > sortMissingLast="true" omitNorms="true">
> >     <analyzer>
> >          <tokenizer class="solr.KeywordTokenizerFactory"/>
> >          <filter class="solr.LowerCaseFilterFactory" />
> >          <filter class="solr.PatternReplaceFilterFactory"
> > pattern="\p{Punct}" replacement=""/>
> >     </analyzer>
> >  </fieldType>
> >
> > q=slug:what_is_lo*&fl=slug&rows=1000&wt=csv&sort=slug_alpha%20asc
> >
> > So it appears I can skip edge ngram entirely using this method as
> > slug:foo* appears to be the exact same results as fayt:foo, but I have
> > the cost of the alphaOnly field :)
> >
> > I will try to figure out some benchmarks or something to decide how to go.
> >
> > Thanks again for the help so far.
> >
> >
> > On Wed, Mar 25, 2020 at 2:39 PM Erick Erickson <[hidden email]> wrote:
> >>
> >> You’re getting the correct sorted order… The underscore character is confusing you.
> >>
> >> It’s ascii code for underscore is %2d which sorts before any letter, uppercase or lowercase.
> >>
> >> See the alphaOnlySort type for a way to remove this, although the output there can also
> >> be confusing.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Mar 25, 2020, at 1:30 PM, matthew sporleder <[hidden email]> wrote:
> >>>
> >>> What_is_Lov_Holtz_known_for
> >>> What_is_lova_after_it_harddens
> >>> What_is_Lova_Moor's_birthday
> >>> What_is_lovable_in_Spanish
> >>> What_is_lovage
> >>> What_is_Lovagny's_population
> >>> What_is_lovan_for
> >>> What_is_lovanox
> >>> What_is_lovarstan_for
> >>> What_is_Lovasatin
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: edge ngram/find as you type sorting

Erick Erickson
From other mails, it looks like you’re inheriting something you had
no input in building. My sympathies ;)

Unless you’ve explicitly changed the memory by specifying -Xmx and -Xms
at startup, you’re operating with 512M of memory, which is far too small
for most Solr installations. the -m parameter at startup will modify this.

The admin UI will also show you how much memory Solr is running with.

Best,
Erick

> On Mar 26, 2020, at 8:52 AM, matthew sporleder <[hidden email]> wrote:
>
> That explains the OOM's I've been getting in the initial test cycle.
> I'm working with about 50M (small) documents.
>
> On Thu, Mar 26, 2020 at 7:58 AM Erick Erickson <[hidden email]> wrote:
>>
>> the ngramming is a time/space tradeoff. Typically,
>> if you restrict the wildcards to have three or more
>> “real” characters performance is fine. One real
>> character (i.e. a*) will be your worst-case. I’ve
>> seen requiring two characters in the prefix work well
>> too. It Depends (tm).
>>
>> Conceptually what happens here is that Lucene has
>> to enumerate all of the terms that start with the prefix
>> and create a ginormous OR clause. The term
>> enumeration will take longer the more terms there are.
>> Things are more efficient than that, but still...
>>
>> So make sure you’re testing with a real corpus. Having
>> a test index with just a few terms will be misleading.
>>
>> Best,
>> Erick
>>
>>> On Mar 25, 2020, at 9:37 PM, matthew sporleder <[hidden email]> wrote:
>>>
>>> Okay confirmed-
>>> I am getting a more predictable results set after adding an additional field:
>>> <fieldType name="string_alpha" class="solr.TextField"
>>> sortMissingLast="true" omitNorms="true">
>>>    <analyzer>
>>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>         <filter class="solr.LowerCaseFilterFactory" />
>>>         <filter class="solr.PatternReplaceFilterFactory"
>>> pattern="\p{Punct}" replacement=""/>
>>>    </analyzer>
>>> </fieldType>
>>>
>>> q=slug:what_is_lo*&fl=slug&rows=1000&wt=csv&sort=slug_alpha%20asc
>>>
>>> So it appears I can skip edge ngram entirely using this method as
>>> slug:foo* appears to be the exact same results as fayt:foo, but I have
>>> the cost of the alphaOnly field :)
>>>
>>> I will try to figure out some benchmarks or something to decide how to go.
>>>
>>> Thanks again for the help so far.
>>>
>>>
>>> On Wed, Mar 25, 2020 at 2:39 PM Erick Erickson <[hidden email]> wrote:
>>>>
>>>> You’re getting the correct sorted order… The underscore character is confusing you.
>>>>
>>>> It’s ascii code for underscore is %2d which sorts before any letter, uppercase or lowercase.
>>>>
>>>> See the alphaOnlySort type for a way to remove this, although the output there can also
>>>> be confusing.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>>> On Mar 25, 2020, at 1:30 PM, matthew sporleder <[hidden email]> wrote:
>>>>>
>>>>> What_is_Lov_Holtz_known_for
>>>>> What_is_lova_after_it_harddens
>>>>> What_is_Lova_Moor's_birthday
>>>>> What_is_lovable_in_Spanish
>>>>> What_is_lovage
>>>>> What_is_Lovagny's_population
>>>>> What_is_lovan_for
>>>>> What_is_lovanox
>>>>> What_is_lovarstan_for
>>>>> What_is_Lovasatin
>>>>
>>