keepword file with phrases

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

keepword file with phrases

lee carroll
Hi List
I'm trying to achieve the following

text in "this aisle contains preserves and savoury spreads"

desired index entry for a field to be used for faceting (ie strict set of
normalised terms)
is "jams" "savoury spreads" ie two facet terms

current set up for the field is

<fieldType name="facet" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
outputUnigrams="true"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="goodForSynonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.KeepWordFilterFactory"
words="goodForKeepWords.txt" ignoreCase="true"/>
      </analyzer>
      <analyzer type="query">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
outputUnigrams="true"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="goodForSynonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.KeepWordFilterFactory"
words="goodForKeepWords.txt" ignoreCase="true"/>
      </analyzer>
    </fieldType>

The thinking here is
get rid of any mark up nonsense
split into tokens based on whitespace => "this" "aisle" "contains"
"preserves" "and" "savoury" "spreads"
produce shingles of 1 or 2 tokens => "this","this aisle", "aisle", "aisle
contains", "contains", "contains preserves","preserves","and",
                                                      "and savoury",
"savoury", "savoury spreads", "spreads"

expand synonyms using a synomym file (preserves -> jam) =>

"this","this aisle", "aisle", "aisle contains", "contains","contains
preserves","preserves","jam","and","and savoury", "savoury", "savoury
spreads", "spreads"

produce a normalised term list using a keepword file of jam , "savoury
spreads" in it

which should place "jam" "savoury spreads" into the index field facet.....

However i don't get savoury spreads in the index. from the analysis.jsp
everything goes to plan upto the last step where the keepword file does not
like keeping the phrase "savoury spreads". i've tried niavely quoting the
phrase in the keepword file :-)

What is the best way to achive the above ? Is this the correct approach or
is there a better way ?

thanks in advance lee
Reply | Threaded
Open this post in threaded view
|

Re: keepword file with phrases

lee carroll
Just to add things are going not as expected before the keepword, the
synonym list is not be expanded for shingles I think I don't understand term
position....

On 5 February 2011 16:08, lee carroll <[hidden email]> wrote:

> Hi List
> I'm trying to achieve the following
>
> text in "this aisle contains preserves and savoury spreads"
>
> desired index entry for a field to be used for faceting (ie strict set of
> normalised terms)
> is "jams" "savoury spreads" ie two facet terms
>
> current set up for the field is
>
> <fieldType name="facet" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <charFilter class="solr.HTMLStripCharFilterFactory"/>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> outputUnigrams="true"/>
>         <filter class="solr.SynonymFilterFactory"
> synonyms="goodForSynonyms.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.KeepWordFilterFactory"
> words="goodForKeepWords.txt" ignoreCase="true"/>
>       </analyzer>
>       <analyzer type="query">
>         <charFilter class="solr.HTMLStripCharFilterFactory"/>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> outputUnigrams="true"/>
>         <filter class="solr.SynonymFilterFactory"
> synonyms="goodForSynonyms.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.KeepWordFilterFactory"
> words="goodForKeepWords.txt" ignoreCase="true"/>
>       </analyzer>
>     </fieldType>
>
> The thinking here is
> get rid of any mark up nonsense
> split into tokens based on whitespace => "this" "aisle" "contains"
> "preserves" "and" "savoury" "spreads"
> produce shingles of 1 or 2 tokens => "this","this aisle", "aisle", "aisle
> contains", "contains", "contains preserves","preserves","and",
>                                                       "and savoury",
> "savoury", "savoury spreads", "spreads"
>
> expand synonyms using a synomym file (preserves -> jam) =>
>
> "this","this aisle", "aisle", "aisle contains", "contains","contains
> preserves","preserves","jam","and","and savoury", "savoury", "savoury
> spreads", "spreads"
>
> produce a normalised term list using a keepword file of jam , "savoury
> spreads" in it
>
> which should place "jam" "savoury spreads" into the index field facet.....
>
> However i don't get savoury spreads in the index. from the analysis.jsp
> everything goes to plan upto the last step where the keepword file does not
> like keeping the phrase "savoury spreads". i've tried niavely quoting the
> phrase in the keepword file :-)
>
> What is the best way to achive the above ? Is this the correct approach or
> is there a better way ?
>
> thanks in advance lee
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: keepword file with phrases

Billnbell
You need to switch the order. Do synonyms and expansion first, then
shingles..

Have you tried using analysis.jsp ?

On 2/5/11 10:31 AM, "lee carroll" <[hidden email]> wrote:

>Just to add things are going not as expected before the keepword, the
>synonym list is not be expanded for shingles I think I don't understand
>term
>position....
>
>On 5 February 2011 16:08, lee carroll <[hidden email]>
>wrote:
>
>> Hi List
>> I'm trying to achieve the following
>>
>> text in "this aisle contains preserves and savoury spreads"
>>
>> desired index entry for a field to be used for faceting (ie strict set
>>of
>> normalised terms)
>> is "jams" "savoury spreads" ie two facet terms
>>
>> current set up for the field is
>>
>> <fieldType name="facet" class="solr.TextField"
>>positionIncrementGap="100">
>>       <analyzer type="index">
>>         <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>         <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
>> outputUnigrams="true"/>
>>         <filter class="solr.SynonymFilterFactory"
>> synonyms="goodForSynonyms.txt" ignoreCase="true" expand="true"/>
>>         <filter class="solr.KeepWordFilterFactory"
>> words="goodForKeepWords.txt" ignoreCase="true"/>
>>       </analyzer>
>>       <analyzer type="query">
>>         <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>         <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
>> outputUnigrams="true"/>
>>         <filter class="solr.SynonymFilterFactory"
>> synonyms="goodForSynonyms.txt" ignoreCase="true" expand="true"/>
>>         <filter class="solr.KeepWordFilterFactory"
>> words="goodForKeepWords.txt" ignoreCase="true"/>
>>       </analyzer>
>>     </fieldType>
>>
>> The thinking here is
>> get rid of any mark up nonsense
>> split into tokens based on whitespace => "this" "aisle" "contains"
>> "preserves" "and" "savoury" "spreads"
>> produce shingles of 1 or 2 tokens => "this","this aisle", "aisle",
>>"aisle
>> contains", "contains", "contains preserves","preserves","and",
>>                                                       "and savoury",
>> "savoury", "savoury spreads", "spreads"
>>
>> expand synonyms using a synomym file (preserves -> jam) =>
>>
>> "this","this aisle", "aisle", "aisle contains", "contains","contains
>> preserves","preserves","jam","and","and savoury", "savoury", "savoury
>> spreads", "spreads"
>>
>> produce a normalised term list using a keepword file of jam , "savoury
>> spreads" in it
>>
>> which should place "jam" "savoury spreads" into the index field
>>facet.....
>>
>> However i don't get savoury spreads in the index. from the analysis.jsp
>> everything goes to plan upto the last step where the keepword file does
>>not
>> like keeping the phrase "savoury spreads". i've tried niavely quoting
>>the
>> phrase in the keepword file :-)
>>
>> What is the best way to achive the above ? Is this the correct approach
>>or
>> is there a better way ?
>>
>> thanks in advance lee
>>
>>
>>
>>
>>


Reply | Threaded
Open this post in threaded view
|

Re: keepword file with phrases

Chris Hostetter-3

: You need to switch the order. Do synonyms and expansion first, then
: shingles..

except then he would be building shingles out of all the permutations of
"words" in his symonyms -- including the multi-word synonyms.  i don't
*think* that's what he wants based on his example (but i may be wrong)

: Have you tried using analysis.jsp ?

he already mentioned he has, in his original mail, and that's how he can
tell it's not working.

lee: based on your followup post about seeing problems in the synonyms
output, i suspect the problem you are having is with how the synonymfilter
"parses" the synonyms file -- by default it assumes it should split on
certain characters to creates multi-word synonyms -- but in your case the
tokens you are feeding synonym filter (the output of your shingle filter)
really do have whitespace in them

there is a "tokenizerFactory" option that Koji added a hwile back to the
SYnonymFilterFactory that lets you specify the classname of a
TokenizerFactory to use when parsing the synonym rule -- that may be what
you need to get your synonyms with spaces in them (so they work properly
with your shingles)

(assuming of course that i really understand your problem)


-Hoss
Reply | Threaded
Open this post in threaded view
|

Re: keepword file with phrases

Billnbell
OK that makes sense.

If you double quote the synonyms file will that help for white space?

Bill


On 2/5/11 4:37 PM, "Chris Hostetter" <[hidden email]> wrote:

>
>: You need to switch the order. Do synonyms and expansion first, then
>: shingles..
>
>except then he would be building shingles out of all the permutations of
>"words" in his symonyms -- including the multi-word synonyms.  i don't
>*think* that's what he wants based on his example (but i may be wrong)
>
>: Have you tried using analysis.jsp ?
>
>he already mentioned he has, in his original mail, and that's how he can
>tell it's not working.
>
>lee: based on your followup post about seeing problems in the synonyms
>output, i suspect the problem you are having is with how the
>synonymfilter
>"parses" the synonyms file -- by default it assumes it should split on
>certain characters to creates multi-word synonyms -- but in your case the
>tokens you are feeding synonym filter (the output of your shingle filter)
>really do have whitespace in them
>
>there is a "tokenizerFactory" option that Koji added a hwile back to the
>SYnonymFilterFactory that lets you specify the classname of a
>TokenizerFactory to use when parsing the synonym rule -- that may be what
>you need to get your synonyms with spaces in them (so they work properly
>with your shingles)
>
>(assuming of course that i really understand your problem)
>
>
>-Hoss


Reply | Threaded
Open this post in threaded view
|

Re: keepword file with phrases

lee carroll
Hi Bill,

quoting in the synonyms file did not produce the correct expansion :-(

Looking at Chris's comments now

cheers

lee

On 5 February 2011 23:38, Bill Bell <[hidden email]> wrote:

> OK that makes sense.
>
> If you double quote the synonyms file will that help for white space?
>
> Bill
>
>
> On 2/5/11 4:37 PM, "Chris Hostetter" <[hidden email]> wrote:
>
> >
> >: You need to switch the order. Do synonyms and expansion first, then
> >: shingles..
> >
> >except then he would be building shingles out of all the permutations of
> >"words" in his symonyms -- including the multi-word synonyms.  i don't
> >*think* that's what he wants based on his example (but i may be wrong)
> >
> >: Have you tried using analysis.jsp ?
> >
> >he already mentioned he has, in his original mail, and that's how he can
> >tell it's not working.
> >
> >lee: based on your followup post about seeing problems in the synonyms
> >output, i suspect the problem you are having is with how the
> >synonymfilter
> >"parses" the synonyms file -- by default it assumes it should split on
> >certain characters to creates multi-word synonyms -- but in your case the
> >tokens you are feeding synonym filter (the output of your shingle filter)
> >really do have whitespace in them
> >
> >there is a "tokenizerFactory" option that Koji added a hwile back to the
> >SYnonymFilterFactory that lets you specify the classname of a
> >TokenizerFactory to use when parsing the synonym rule -- that may be what
> >you need to get your synonyms with spaces in them (so they work properly
> >with your shingles)
> >
> >(assuming of course that i really understand your problem)
> >
> >
> >-Hoss
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: keepword file with phrases

lee carroll
Hi Chris,

Yes you've identified the problem :-)

I've tried using keyword tokeniser but that seems to merge all comma
seperated lists of synonyms in one.

the pattern tokeniser would seem to be a candidate but can you pass the
pattern attribute to the tokeniser attribute in the synontm filter ?

example synonym line which is problematic

termA1,termA2,termA3, phrase termA, termA4 => normalisedTermA
termB1,termB2,termB3 => normalisedTermB

when the synonym filter uses the keyword tokeniser

only "phrase term A" ends up being matched as a synonym :-)


lee


On 6 February 2011 12:58, lee carroll <[hidden email]> wrote:

> Hi Bill,
>
> quoting in the synonyms file did not produce the correct expansion :-(
>
> Looking at Chris's comments now
>
> cheers
>
> lee
>
>
> On 5 February 2011 23:38, Bill Bell <[hidden email]> wrote:
>
>> OK that makes sense.
>>
>> If you double quote the synonyms file will that help for white space?
>>
>> Bill
>>
>>
>> On 2/5/11 4:37 PM, "Chris Hostetter" <[hidden email]> wrote:
>>
>> >
>> >: You need to switch the order. Do synonyms and expansion first, then
>> >: shingles..
>> >
>> >except then he would be building shingles out of all the permutations of
>> >"words" in his symonyms -- including the multi-word synonyms.  i don't
>> >*think* that's what he wants based on his example (but i may be wrong)
>> >
>> >: Have you tried using analysis.jsp ?
>> >
>> >he already mentioned he has, in his original mail, and that's how he can
>> >tell it's not working.
>> >
>> >lee: based on your followup post about seeing problems in the synonyms
>> >output, i suspect the problem you are having is with how the
>> >synonymfilter
>> >"parses" the synonyms file -- by default it assumes it should split on
>> >certain characters to creates multi-word synonyms -- but in your case the
>> >tokens you are feeding synonym filter (the output of your shingle filter)
>> >really do have whitespace in them
>> >
>> >there is a "tokenizerFactory" option that Koji added a hwile back to the
>> >SYnonymFilterFactory that lets you specify the classname of a
>> >TokenizerFactory to use when parsing the synonym rule -- that may be what
>> >you need to get your synonyms with spaces in them (so they work properly
>> >with your shingles)
>> >
>> >(assuming of course that i really understand your problem)
>> >
>> >
>> >-Hoss
>>
>>
>>
>