Query on Synonyms feature in Solr

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Query on Synonyms feature in Solr

Rajinimaski
Synonyms feature to be enabled on documents in Solr.


I have one field in solr that has the content of a document.( say field name
: document_data).

The data in that field is :

"Tamil Nadu state private school fee determination committee headed by
Justice Raviraja has submitted the private schools fees structure to the
district educational officers on Monday"

Synonyms for private school in synonym flat file are :

Private schools,NGO Schools,Unaided schools


Now when i search on this field as  document_data=unaided schools.  I need
to get the results.

What are the token, analyser filter that i can apply  to the
"document_dataFIELD" in order to get the results above




This is the indexed document :
<add>
<doc>
<field name="ID">SOLR200</field>
<field name="document_data">Tamil Nadu state private school fee
determination committee headed by Justice Raviraja has submitted the private
schools fees structure to the district educational officers on
Monday</field>
</doc>
</add>


Right now i tried for these 2 fields type.. And i couldn't get the above
results

 <fieldType name="Synonym_document" class="solr.TextField"
positionIncrementGap="100" >
        <analyzer>
  <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.SynonymFilter" synonyms="Taxonomy.txt"
ignoreCase="true" expand="true"/>
 <filter class="solr.LowerCaseFilterFactory"/>
 <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
      </analyzer>
    </fieldType>


 <fieldType name="Synonym_document" class="solr.TextField"
positionIncrementGap="100" >
        <analyzer>
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilter" synonyms="Taxonomy.txt"
ignoreCase="true" expand="true"/>
 <filter class="solr.LowerCaseFilterFactory"/>
 <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
      </analyzer>
    </fieldType>


 <field name="document_data" type="Synonym_document" indexed="true"
multiValued="true"/>

Both didn't work for my query.
Anyone please guide me with the token, analyser filter that i can apply  to
the "document_data FIELD" in order to get the results above


Regards,
Rajani
Reply | Threaded
Open this post in threaded view
|

Re: Query on Synonyms feature in Solr

Karsten R.
Hi rajini,

multi-word synonyms like "private schools" normally make problems.

See e.g. Solr-1-4-Enterprise-Search-Server Page 56:
"For multi-word synonyms to work, the analysis must be applied at
index-time and with expansion so that both the original words and the
combined word get indexed. ..."

Your problem:
The input of Synonym Filter must be the exact !Token! "Private schools".

So "WhitespaceTokenizerFactory" generates two tokens: "private" "schools"
and for "KeywordTokenizerFactory" the whole text is one token.

Beste regards
  Karsten



-------- Original-Nachricht --------
> Datum: Mon, 13 Jun 2011 16:07:35 +0530
> Von: rajini maski <[hidden email]>
> An: [hidden email]
> Betreff: Query on Synonyms feature in Solr

> Synonyms feature to be enabled on documents in Solr.
>
>
> I have one field in solr that has the content of a document.( say field
> name
> : document_data).
>
> The data in that field is :
>
> "Tamil Nadu state private school fee determination committee headed by
> Justice Raviraja has submitted the private schools fees structure to the
> district educational officers on Monday"
>
> Synonyms for private school in synonym flat file are :
>
> Private schools,NGO Schools,Unaided schools
>
>
> Now when i search on this field as  document_data=unaided schools.  I need
> to get the results.
>
> What are the token, analyser filter that i can apply  to the
> "document_dataFIELD" in order to get the results above
>
>
>
>
> This is the indexed document :
> <add>
> <doc>
> <field name="ID">SOLR200</field>
> <field name="document_data">Tamil Nadu state private school fee
> determination committee headed by Justice Raviraja has submitted the
> private
> schools fees structure to the district educational officers on
> Monday</field>
> </doc>
> </add>
>
>
> Right now i tried for these 2 fields type.. And i couldn't get the above
> results
>
>  <fieldType name="Synonym_document" class="solr.TextField"
> positionIncrementGap="100" >
>         <analyzer>
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
>     <filter class="solr.SynonymFilter" synonyms="Taxonomy.txt"
> ignoreCase="true" expand="true"/>
>  <filter class="solr.LowerCaseFilterFactory"/>
>  <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>       </analyzer>
>     </fieldType>
>
>
>  <fieldType name="Synonym_document" class="solr.TextField"
> positionIncrementGap="100" >
>         <analyzer>
>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>     <filter class="solr.SynonymFilter" synonyms="Taxonomy.txt"
> ignoreCase="true" expand="true"/>
>  <filter class="solr.LowerCaseFilterFactory"/>
>  <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>       </analyzer>
>     </fieldType>
>
>
>  <field name="document_data" type="Synonym_document" indexed="true"
> multiValued="true"/>
>
> Both didn't work for my query.
> Anyone please guide me with the token, analyser filter that i can apply
> to
> the "document_data FIELD" in order to get the results above
>
>
> Regards,
> Rajani
Reply | Threaded
Open this post in threaded view
|

Re: Query on Synonyms feature in Solr

Rajinimaski
Karsten,

   I have tried for both the cases you mentioned below.

For "WhitespaceTokenizerFactory" that generates two tokens: "private"
"schools" and so i don't get results as required. It will initially split
"private schools" as "private" and "schools" and then try to match in
synonym filter. This fails the match because my synonym flat file has list
like this :Private schools,NGO Schools,Unaided schools

So after split, it is trying to find synonym filter for "private" and not
for "Private Schools".This fails the match


In case of KeywordTokenizerFactory, It takes the entire content in that
field as one key word.
eg: document_data = "Tamil Nadu state private school fee determination
committee headed by Justice Raviraja has submitted the private schools fees
structure to the district educational officers on Monday"

is considered as one key word. But note that  "private school" is just the
part of that field or the part of the sentence in that field.
And thus this will also not match our search :(

Any other suggestions to fix this?

Regards,
Rajani Maski



On Mon, Jun 13, 2011 at 4:54 PM, <[hidden email]> wrote:

> Hi rajini,
>
> multi-word synonyms like "private schools" normally make problems.
>
> See e.g. Solr-1-4-Enterprise-Search-Server Page 56:
> "For multi-word synonyms to work, the analysis must be applied at
> index-time and with expansion so that both the original words and the
> combined word get indexed. ..."
>
> Your problem:
> The input of Synonym Filter must be the exact !Token! "Private schools".
>
> So "WhitespaceTokenizerFactory" generates two tokens: "private" "schools"
> and for "KeywordTokenizerFactory" the whole text is one token.
>
> Beste regards
>  Karsten
>
>
>
> -------- Original-Nachricht --------
> > Datum: Mon, 13 Jun 2011 16:07:35 +0530
> > Von: rajini maski <[hidden email]>
> > An: [hidden email]
> > Betreff: Query on Synonyms feature in Solr
>
> > Synonyms feature to be enabled on documents in Solr.
> >
> >
> > I have one field in solr that has the content of a document.( say field
> > name
> > : document_data).
> >
> > The data in that field is :
> >
> > "Tamil Nadu state private school fee determination committee headed by
> > Justice Raviraja has submitted the private schools fees structure to the
> > district educational officers on Monday"
> >
> > Synonyms for private school in synonym flat file are :
> >
> > Private schools,NGO Schools,Unaided schools
> >
> >
> > Now when i search on this field as  document_data=unaided schools.  I
> need
> > to get the results.
> >
> > What are the token, analyser filter that i can apply  to the
> > "document_dataFIELD" in order to get the results above
> >
> >
> >
> >
> > This is the indexed document :
> > <add>
> > <doc>
> > <field name="ID">SOLR200</field>
> > <field name="document_data">Tamil Nadu state private school fee
> > determination committee headed by Justice Raviraja has submitted the
> > private
> > schools fees structure to the district educational officers on
> > Monday</field>
> > </doc>
> > </add>
> >
> >
> > Right now i tried for these 2 fields type.. And i couldn't get the above
> > results
> >
> >  <fieldType name="Synonym_document" class="solr.TextField"
> > positionIncrementGap="100" >
> >         <analyzer>
> >   <tokenizer class="solr.KeywordTokenizerFactory"/>
> >     <filter class="solr.SynonymFilter" synonyms="Taxonomy.txt"
> > ignoreCase="true" expand="true"/>
> >  <filter class="solr.LowerCaseFilterFactory"/>
> >  <filter class="solr.SnowballPorterFilterFactory" language="English"
> > protected="protwords.txt"/>
> >       </analyzer>
> >     </fieldType>
> >
> >
> >  <fieldType name="Synonym_document" class="solr.TextField"
> > positionIncrementGap="100" >
> >         <analyzer>
> >      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >     <filter class="solr.SynonymFilter" synonyms="Taxonomy.txt"
> > ignoreCase="true" expand="true"/>
> >  <filter class="solr.LowerCaseFilterFactory"/>
> >  <filter class="solr.SnowballPorterFilterFactory" language="English"
> > protected="protwords.txt"/>
> >       </analyzer>
> >     </fieldType>
> >
> >
> >  <field name="document_data" type="Synonym_document" indexed="true"
> > multiValued="true"/>
> >
> > Both didn't work for my query.
> > Anyone please guide me with the token, analyser filter that i can apply
> > to
> > the "document_data FIELD" in order to get the results above
> >
> >
> > Regards,
> > Rajani
>
Reply | Threaded
Open this post in threaded view
|

Re: Query on Synonyms feature in Solr

Erick Erickson
I think the point is that you need to expand synonyms at
index time but not at query time. In the field type definitions you
provided, the expansion happens both at index and query
time....

Or have you tried that already?

Best
Erick

On Mon, Jun 13, 2011 at 7:46 AM, rajini maski <[hidden email]> wrote:

> Karsten,
>
>   I have tried for both the cases you mentioned below.
>
> For "WhitespaceTokenizerFactory" that generates two tokens: "private"
> "schools" and so i don't get results as required. It will initially split
> "private schools" as "private" and "schools" and then try to match in
> synonym filter. This fails the match because my synonym flat file has list
> like this :Private schools,NGO Schools,Unaided schools
>
> So after split, it is trying to find synonym filter for "private" and not
> for "Private Schools".This fails the match
>
>
> In case of KeywordTokenizerFactory, It takes the entire content in that
> field as one key word.
> eg: document_data = "Tamil Nadu state private school fee determination
> committee headed by Justice Raviraja has submitted the private schools fees
> structure to the district educational officers on Monday"
>
> is considered as one key word. But note that  "private school" is just the
> part of that field or the part of the sentence in that field.
> And thus this will also not match our search :(
>
> Any other suggestions to fix this?
>
> Regards,
> Rajani Maski
>
>
>
> On Mon, Jun 13, 2011 at 4:54 PM, <[hidden email]> wrote:
>
>> Hi rajini,
>>
>> multi-word synonyms like "private schools" normally make problems.
>>
>> See e.g. Solr-1-4-Enterprise-Search-Server Page 56:
>> "For multi-word synonyms to work, the analysis must be applied at
>> index-time and with expansion so that both the original words and the
>> combined word get indexed. ..."
>>
>> Your problem:
>> The input of Synonym Filter must be the exact !Token! "Private schools".
>>
>> So "WhitespaceTokenizerFactory" generates two tokens: "private" "schools"
>> and for "KeywordTokenizerFactory" the whole text is one token.
>>
>> Beste regards
>>  Karsten
>>
>>
>>
>> -------- Original-Nachricht --------
>> > Datum: Mon, 13 Jun 2011 16:07:35 +0530
>> > Von: rajini maski <[hidden email]>
>> > An: [hidden email]
>> > Betreff: Query on Synonyms feature in Solr
>>
>> > Synonyms feature to be enabled on documents in Solr.
>> >
>> >
>> > I have one field in solr that has the content of a document.( say field
>> > name
>> > : document_data).
>> >
>> > The data in that field is :
>> >
>> > "Tamil Nadu state private school fee determination committee headed by
>> > Justice Raviraja has submitted the private schools fees structure to the
>> > district educational officers on Monday"
>> >
>> > Synonyms for private school in synonym flat file are :
>> >
>> > Private schools,NGO Schools,Unaided schools
>> >
>> >
>> > Now when i search on this field as  document_data=unaided schools.  I
>> need
>> > to get the results.
>> >
>> > What are the token, analyser filter that i can apply  to the
>> > "document_dataFIELD" in order to get the results above
>> >
>> >
>> >
>> >
>> > This is the indexed document :
>> > <add>
>> > <doc>
>> > <field name="ID">SOLR200</field>
>> > <field name="document_data">Tamil Nadu state private school fee
>> > determination committee headed by Justice Raviraja has submitted the
>> > private
>> > schools fees structure to the district educational officers on
>> > Monday</field>
>> > </doc>
>> > </add>
>> >
>> >
>> > Right now i tried for these 2 fields type.. And i couldn't get the above
>> > results
>> >
>> >  <fieldType name="Synonym_document" class="solr.TextField"
>> > positionIncrementGap="100" >
>> >         <analyzer>
>> >   <tokenizer class="solr.KeywordTokenizerFactory"/>
>> >     <filter class="solr.SynonymFilter" synonyms="Taxonomy.txt"
>> > ignoreCase="true" expand="true"/>
>> >  <filter class="solr.LowerCaseFilterFactory"/>
>> >  <filter class="solr.SnowballPorterFilterFactory" language="English"
>> > protected="protwords.txt"/>
>> >       </analyzer>
>> >     </fieldType>
>> >
>> >
>> >  <fieldType name="Synonym_document" class="solr.TextField"
>> > positionIncrementGap="100" >
>> >         <analyzer>
>> >      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> >     <filter class="solr.SynonymFilter" synonyms="Taxonomy.txt"
>> > ignoreCase="true" expand="true"/>
>> >  <filter class="solr.LowerCaseFilterFactory"/>
>> >  <filter class="solr.SnowballPorterFilterFactory" language="English"
>> > protected="protwords.txt"/>
>> >       </analyzer>
>> >     </fieldType>
>> >
>> >
>> >  <field name="document_data" type="Synonym_document" indexed="true"
>> > multiValued="true"/>
>> >
>> > Both didn't work for my query.
>> > Anyone please guide me with the token, analyser filter that i can apply
>> > to
>> > the "document_data FIELD" in order to get the results above
>> >
>> >
>> > Regards,
>> > Rajani
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Query on Synonyms feature in Solr

roySolr
This post was updated on .
Maybe you can try to escape the synonyms so it's not tokenized by whitespace..

Private\ schools,NGO\ Schools,Unaided\ schools
Reply | Threaded
Open this post in threaded view
|

Re: Query on Synonyms feature in Solr

Rajinimaski
Erick: I have tried what you said. I needed clarification on this.. Below is
my doubt added:

Say If i have field type :

<fieldType name="Synonymdata" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="org.apache.solr.orchsynonym.OrchSynonymFilter"
synonyms="BODYTaxonomy.txt,PalpClinLocObsTaxo.txt,MacroscopicTaxonomy.txt,MicroscopicTaxonomy.txt,SpecimenTaxonomy.txt,ParameterTaxonomy.txt,StrainTaxonomy.txt"
ignoreCase="true" expand="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="org.apache.solr.orchsynonym.OrchSynonymFilter"
synonyms="BODYTaxonomy.txt,PalpClinLocObsTaxo.txt,MacroscopicTaxonomy.txt,MicroscopicTaxonomy.txt,SpecimenTaxonomy.txt,ParameterTaxonomy.txt,StrainTaxonomy.txt"
ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory"  ignoreCase="true"
words="stopwords.txt"       enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
      </analyzer>
    </fieldType>



The data indexed in this field is :

sentence 1 : " tissue devitalization was noted in hepalocytes of liver"
sentence 2 :  "Necrosis not found in liver"

Synonyms:
necrosis , tissue devitalization, cellular necrosis

How does the white space and synonym filter behave?I am not able to
understand in analysis page..Please let me know if  it is like this that
works? Correct me if i am wrong..

sentence 1 : " tissue devitalization was noted in hepalocytes of liver"

white space :
tissue
 devitalization
 was
 noted
 in
 hepalocytes
 of
liver

Synoyms for token words:
No synonyms for tissue , no synonym for devitalization and so
on.........................
So does the "tissue devitalization" word will not become synonym for
Necrosis ?(since it is mentioned in synonym)

If it adds as the synonym, Then how is it splitting the sentence and adding
the filter? Which is happening first?


Sentence 2: Necrosis not  found in liver


white space
Necrosis
not
 found
 in
 liver


Synoyms for token words:
synonyms for Necrosis: tissue devitalization,cellular necrosis, no synonym
for not, no synonym for found and so on.........................

Is this correct?


My main concern is when i have 3 set of data like this:

tissue devitalization was observed in hepalocytes of liver
necrosis was observed in liver
Necrosis not found in liver

When i search "Necrosis not found" I need to get only the last sentence.

I am not able to find out the list of tokens and analysers that i need to
apply in order to acheieve this desired output

Awaiting reply
Rajani Maski










On Tue, Jun 14, 2011 at 3:13 PM, roySolr <[hidden email]> wrote:

> Maybe you can try to escape the synonyms so it's no tokized by whitespace..
>
> Private\ schools,NGO\ Schools,Unaided\ schools
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Query-on-Synonyms-feature-in-Solr-tp3058197p3062392.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Query on Synonyms feature in Solr

Erick Erickson
Well, first it is usually unnecessary to specify the
synonym filter both at index and query time, I'd apply
it only at query time to start, then perhaps switch
to index time, see the discussion at:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46
for why index-time is preferable.
Note you'll have to re-index.

That said, essentially what happens (and assuming
synonym filter is only in the query part) is you have
something like this as your search for "necrosis not
found".

Offset 0                         offset1         offset 2
necrosis
tissue devitalization        not            found
cellular necrosis


Note that one of your three synonyms must appear in position 0,
followed by the other two terms.

So your example should "just work". But as I said, it would probably
be best if you put your synonym filter only in at index or query time.

An analogous process happens if you add synonyms at index
time.

Best
Erick

On Wed, Jun 15, 2011 at 8:14 AM, rajini maski <[hidden email]> wrote:

> Erick: I have tried what you said. I needed clarification on this.. Below is
> my doubt added:
>
> Say If i have field type :
>
> <fieldType name="Synonymdata" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          <filter class="org.apache.solr.orchsynonym.OrchSynonymFilter"
> synonyms="BODYTaxonomy.txt,PalpClinLocObsTaxo.txt,MacroscopicTaxonomy.txt,MicroscopicTaxonomy.txt,SpecimenTaxonomy.txt,ParameterTaxonomy.txt,StrainTaxonomy.txt"
> ignoreCase="true" expand="true"/>
>      <filter class="solr.LowerCaseFilterFactory"/>
>    <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          <filter class="org.apache.solr.orchsynonym.OrchSynonymFilter"
> synonyms="BODYTaxonomy.txt,PalpClinLocObsTaxo.txt,MacroscopicTaxonomy.txt,MicroscopicTaxonomy.txt,SpecimenTaxonomy.txt,ParameterTaxonomy.txt,StrainTaxonomy.txt"
> ignoreCase="true" expand="false"/>
>        <filter class="solr.StopFilterFactory"  ignoreCase="true"
> words="stopwords.txt"       enablePositionIncrements="true" />
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>      </analyzer>
>    </fieldType>
>
>
>
> The data indexed in this field is :
>
> sentence 1 : " tissue devitalization was noted in hepalocytes of liver"
> sentence 2 :  "Necrosis not found in liver"
>
> Synonyms:
> necrosis , tissue devitalization, cellular necrosis
>
> How does the white space and synonym filter behave?I am not able to
> understand in analysis page..Please let me know if  it is like this that
> works? Correct me if i am wrong..
>
> sentence 1 : " tissue devitalization was noted in hepalocytes of liver"
>
> white space :
> tissue
>  devitalization
>  was
>  noted
>  in
>  hepalocytes
>  of
> liver
>
> Synoyms for token words:
> No synonyms for tissue , no synonym for devitalization and so
> on.........................
> So does the "tissue devitalization" word will not become synonym for
> Necrosis ?(since it is mentioned in synonym)
>
> If it adds as the synonym, Then how is it splitting the sentence and adding
> the filter? Which is happening first?
>
>
> Sentence 2: Necrosis not  found in liver
>
>
> white space
> Necrosis
> not
>  found
>  in
>  liver
>
>
> Synoyms for token words:
> synonyms for Necrosis: tissue devitalization,cellular necrosis, no synonym
> for not, no synonym for found and so on.........................
>
> Is this correct?
>
>
> My main concern is when i have 3 set of data like this:
>
> tissue devitalization was observed in hepalocytes of liver
> necrosis was observed in liver
> Necrosis not found in liver
>
> When i search "Necrosis not found" I need to get only the last sentence.
>
> I am not able to find out the list of tokens and analysers that i need to
> apply in order to acheieve this desired output
>
> Awaiting reply
> Rajani Maski
>
>
>
>
>
>
>
>
>
>
> On Tue, Jun 14, 2011 at 3:13 PM, roySolr <[hidden email]> wrote:
>
>> Maybe you can try to escape the synonyms so it's no tokized by whitespace..
>>
>> Private\ schools,NGO\ Schools,Unaided\ schools
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Query-on-Synonyms-feature-in-Solr-tp3058197p3062392.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Query on Synonyms feature in Solr

Rajinimaski
than

On Wed, Jun 15, 2011 at 9:42 PM, Erick Erickson <[hidden email]>wrote:

> Well, first it is usually unnecessary to specify the
> synonym filter both at index and query time, I'd apply
> it only at query time to start, then perhaps switch
> to index time, see the discussion at:
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46
> for why index-time is preferable.
> Note you'll have to re-index.
>
> That said, essentially what happens (and assuming
> synonym filter is only in the query part) is you have
> something like this as your search for "necrosis not
> found".
>
> Offset 0                         offset1         offset 2
> necrosis
> tissue devitalization        not            found
> cellular necrosis
>
>
> Note that one of your three synonyms must appear in position 0,
> followed by the other two terms.
>
> So your example should "just work". But as I said, it would probably
> be best if you put your synonym filter only in at index or query time.
>
> An analogous process happens if you add synonyms at index
> time.
>
> Best
> Erick
>
> On Wed, Jun 15, 2011 at 8:14 AM, rajini maski <[hidden email]>
> wrote:
> > Erick: I have tried what you said. I needed clarification on this.. Below
> is
> > my doubt added:
> >
> > Say If i have field type :
> >
> > <fieldType name="Synonymdata" class="solr.TextField"
> > positionIncrementGap="100">
> >      <analyzer type="index">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >          <filter class="org.apache.solr.orchsynonym.OrchSynonymFilter"
> >
> synonyms="BODYTaxonomy.txt,PalpClinLocObsTaxo.txt,MacroscopicTaxonomy.txt,MicroscopicTaxonomy.txt,SpecimenTaxonomy.txt,ParameterTaxonomy.txt,StrainTaxonomy.txt"
> > ignoreCase="true" expand="true"/>
> >      <filter class="solr.LowerCaseFilterFactory"/>
> >    <filter class="solr.SnowballPorterFilterFactory" language="English"
> > protected="protwords.txt"/>
> >      </analyzer>
> >      <analyzer type="query">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >          <filter class="org.apache.solr.orchsynonym.OrchSynonymFilter"
> >
> synonyms="BODYTaxonomy.txt,PalpClinLocObsTaxo.txt,MacroscopicTaxonomy.txt,MicroscopicTaxonomy.txt,SpecimenTaxonomy.txt,ParameterTaxonomy.txt,StrainTaxonomy.txt"
> > ignoreCase="true" expand="false"/>
> >        <filter class="solr.StopFilterFactory"  ignoreCase="true"
> > words="stopwords.txt"       enablePositionIncrements="true" />
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.SnowballPorterFilterFactory"
> language="English"
> > protected="protwords.txt"/>
> >      </analyzer>
> >    </fieldType>
> >
> >
> >
> > The data indexed in this field is :
> >
> > sentence 1 : " tissue devitalization was noted in hepalocytes of liver"
> > sentence 2 :  "Necrosis not found in liver"
> >
> > Synonyms:
> > necrosis , tissue devitalization, cellular necrosis
> >
> > How does the white space and synonym filter behave?I am not able to
> > understand in analysis page..Please let me know if  it is like this that
> > works? Correct me if i am wrong..
> >
> > sentence 1 : " tissue devitalization was noted in hepalocytes of liver"
> >
> > white space :
> > tissue
> >  devitalization
> >  was
> >  noted
> >  in
> >  hepalocytes
> >  of
> > liver
> >
> > Synoyms for token words:
> > No synonyms for tissue , no synonym for devitalization and so
> > on.........................
> > So does the "tissue devitalization" word will not become synonym for
> > Necrosis ?(since it is mentioned in synonym)
> >
> > If it adds as the synonym, Then how is it splitting the sentence and
> adding
> > the filter? Which is happening first?
> >
> >
> > Sentence 2: Necrosis not  found in liver
> >
> >
> > white space
> > Necrosis
> > not
> >  found
> >  in
> >  liver
> >
> >
> > Synoyms for token words:
> > synonyms for Necrosis: tissue devitalization,cellular necrosis, no
> synonym
> > for not, no synonym for found and so on.........................
> >
> > Is this correct?
> >
> >
> > My main concern is when i have 3 set of data like this:
> >
> > tissue devitalization was observed in hepalocytes of liver
> > necrosis was observed in liver
> > Necrosis not found in liver
> >
> > When i search "Necrosis not found" I need to get only the last sentence.
> >
> > I am not able to find out the list of tokens and analysers that i need to
> > apply in order to acheieve this desired output
> >
> > Awaiting reply
> > Rajani Maski
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Jun 14, 2011 at 3:13 PM, roySolr <[hidden email]>
> wrote:
> >
> >> Maybe you can try to escape the synonyms so it's no tokized by
> whitespace..
> >>
> >> Private\ schools,NGO\ Schools,Unaided\ schools
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Query-on-Synonyms-feature-in-Solr-tp3058197p3062392.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Query on Synonyms feature in Solr

Rajinimaski
In reply to this post by Erick Erickson
ok. Thank you. I will consider this.

One last doubt ,how do i handle negation terms?

In the above mail as i mentioned, If i have 3 sentence like this:

1 .tissue devitalization was observed in hepalocytes of liver
2. necrosis was observed in liver
3. Necrosis not found in liver

When i search "Necrosis not found" I need to get only the last sentence. but
now i get all the 3 results.

I am not able to find out the list of tokens and analysers that i need to
apply in order to acheieve this desired output

Awaiting reply
Rajani Maski




As explained in the above mail,

On Wed, Jun 15, 2011 at 9:42 PM, Erick Erickson <[hidden email]>wrote:

> Well, first it is usually unnecessary to specify the
> synonym filter both at index and query time, I'd apply
> it only at query time to start, then perhaps switch
> to index time, see the discussion at:
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46
> for why index-time is preferable.
> Note you'll have to re-index.
>
> That said, essentially what happens (and assuming
> synonym filter is only in the query part) is you have
> something like this as your search for "necrosis not
> found".
>
> Offset 0                         offset1         offset 2
> necrosis
> tissue devitalization        not            found
> cellular necrosis
>
>
> Note that one of your three synonyms must appear in position 0,
> followed by the other two terms.
>
> So your example should "just work". But as I said, it would probably
> be best if you put your synonym filter only in at index or query time.
>
> An analogous process happens if you add synonyms at index
> time.
>
> Best
> Erick
>
> On Wed, Jun 15, 2011 at 8:14 AM, rajini maski <[hidden email]>
> wrote:
> > Erick: I have tried what you said. I needed clarification on this.. Below
> is
> > my doubt added:
> >
> > Say If i have field type :
> >
> > <fieldType name="Synonymdata" class="solr.TextField"
> > positionIncrementGap="100">
> >      <analyzer type="index">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >          <filter class="org.apache.solr.orchsynonym.OrchSynonymFilter"
> >
> synonyms="BODYTaxonomy.txt,PalpClinLocObsTaxo.txt,MacroscopicTaxonomy.txt,MicroscopicTaxonomy.txt,SpecimenTaxonomy.txt,ParameterTaxonomy.txt,StrainTaxonomy.txt"
> > ignoreCase="true" expand="true"/>
> >      <filter class="solr.LowerCaseFilterFactory"/>
> >    <filter class="solr.SnowballPorterFilterFactory" language="English"
> > protected="protwords.txt"/>
> >      </analyzer>
> >      <analyzer type="query">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >          <filter class="org.apache.solr.orchsynonym.OrchSynonymFilter"
> >
> synonyms="BODYTaxonomy.txt,PalpClinLocObsTaxo.txt,MacroscopicTaxonomy.txt,MicroscopicTaxonomy.txt,SpecimenTaxonomy.txt,ParameterTaxonomy.txt,StrainTaxonomy.txt"
> > ignoreCase="true" expand="false"/>
> >        <filter class="solr.StopFilterFactory"  ignoreCase="true"
> > words="stopwords.txt"       enablePositionIncrements="true" />
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.SnowballPorterFilterFactory"
> language="English"
> > protected="protwords.txt"/>
> >      </analyzer>
> >    </fieldType>
> >
> >
> >
> > The data indexed in this field is :
> >
> > sentence 1 : " tissue devitalization was noted in hepalocytes of liver"
> > sentence 2 :  "Necrosis not found in liver"
> >
> > Synonyms:
> > necrosis , tissue devitalization, cellular necrosis
> >
> > How does the white space and synonym filter behave?I am not able to
> > understand in analysis page..Please let me know if  it is like this that
> > works? Correct me if i am wrong..
> >
> > sentence 1 : " tissue devitalization was noted in hepalocytes of liver"
> >
> > white space :
> > tissue
> >  devitalization
> >  was
> >  noted
> >  in
> >  hepalocytes
> >  of
> > liver
> >
> > Synoyms for token words:
> > No synonyms for tissue , no synonym for devitalization and so
> > on.........................
> > So does the "tissue devitalization" word will not become synonym for
> > Necrosis ?(since it is mentioned in synonym)
> >
> > If it adds as the synonym, Then how is it splitting the sentence and
> adding
> > the filter? Which is happening first?
> >
> >
> > Sentence 2: Necrosis not  found in liver
> >
> >
> > white space
> > Necrosis
> > not
> >  found
> >  in
> >  liver
> >
> >
> > Synoyms for token words:
> > synonyms for Necrosis: tissue devitalization,cellular necrosis, no
> synonym
> > for not, no synonym for found and so on.........................
> >
> > Is this correct?
> >
> >
> > My main concern is when i have 3 set of data like this:
> >
> > tissue devitalization was observed in hepalocytes of liver
> > necrosis was observed in liver
> > Necrosis not found in liver
> >
> > When i search "Necrosis not found" I need to get only the last sentence.
> >
> > I am not able to find out the list of tokens and analysers that i need to
> > apply in order to acheieve this desired output
> >
> > Awaiting reply
> > Rajani Maski
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Jun 14, 2011 at 3:13 PM, roySolr <[hidden email]>
> wrote:
> >
> >> Maybe you can try to escape the synonyms so it's no tokized by
> whitespace..
> >>
> >> Private\ schools,NGO\ Schools,Unaided\ schools
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Query-on-Synonyms-feature-in-Solr-tp3058197p3062392.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Query on Synonyms feature in Solr

Erick Erickson
Have you tried setting your default operator to AND
in schema.xml?

Best
Erick

On Wed, Jun 15, 2011 at 12:36 PM, rajini maski <[hidden email]> wrote:

> ok. Thank you. I will consider this.
>
> One last doubt ,how do i handle negation terms?
>
> In the above mail as i mentioned, If i have 3 sentence like this:
>
> 1 .tissue devitalization was observed in hepalocytes of liver
> 2. necrosis was observed in liver
> 3. Necrosis not found in liver
>
> When i search "Necrosis not found" I need to get only the last sentence. but
> now i get all the 3 results.
>
> I am not able to find out the list of tokens and analysers that i need to
> apply in order to acheieve this desired output
>
> Awaiting reply
> Rajani Maski
>
>
>
>
> As explained in the above mail,
>
> On Wed, Jun 15, 2011 at 9:42 PM, Erick Erickson <[hidden email]>wrote:
>
>> Well, first it is usually unnecessary to specify the
>> synonym filter both at index and query time, I'd apply
>> it only at query time to start, then perhaps switch
>> to index time, see the discussion at:
>>
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46
>> for why index-time is preferable.
>> Note you'll have to re-index.
>>
>> That said, essentially what happens (and assuming
>> synonym filter is only in the query part) is you have
>> something like this as your search for "necrosis not
>> found".
>>
>> Offset 0                         offset1         offset 2
>> necrosis
>> tissue devitalization        not            found
>> cellular necrosis
>>
>>
>> Note that one of your three synonyms must appear in position 0,
>> followed by the other two terms.
>>
>> So your example should "just work". But as I said, it would probably
>> be best if you put your synonym filter only in at index or query time.
>>
>> An analogous process happens if you add synonyms at index
>> time.
>>
>> Best
>> Erick
>>
>> On Wed, Jun 15, 2011 at 8:14 AM, rajini maski <[hidden email]>
>> wrote:
>> > Erick: I have tried what you said. I needed clarification on this.. Below
>> is
>> > my doubt added:
>> >
>> > Say If i have field type :
>> >
>> > <fieldType name="Synonymdata" class="solr.TextField"
>> > positionIncrementGap="100">
>> >      <analyzer type="index">
>> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> >          <filter class="org.apache.solr.orchsynonym.OrchSynonymFilter"
>> >
>> synonyms="BODYTaxonomy.txt,PalpClinLocObsTaxo.txt,MacroscopicTaxonomy.txt,MicroscopicTaxonomy.txt,SpecimenTaxonomy.txt,ParameterTaxonomy.txt,StrainTaxonomy.txt"
>> > ignoreCase="true" expand="true"/>
>> >      <filter class="solr.LowerCaseFilterFactory"/>
>> >    <filter class="solr.SnowballPorterFilterFactory" language="English"
>> > protected="protwords.txt"/>
>> >      </analyzer>
>> >      <analyzer type="query">
>> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> >          <filter class="org.apache.solr.orchsynonym.OrchSynonymFilter"
>> >
>> synonyms="BODYTaxonomy.txt,PalpClinLocObsTaxo.txt,MacroscopicTaxonomy.txt,MicroscopicTaxonomy.txt,SpecimenTaxonomy.txt,ParameterTaxonomy.txt,StrainTaxonomy.txt"
>> > ignoreCase="true" expand="false"/>
>> >        <filter class="solr.StopFilterFactory"  ignoreCase="true"
>> > words="stopwords.txt"       enablePositionIncrements="true" />
>> >        <filter class="solr.LowerCaseFilterFactory"/>
>> >        <filter class="solr.SnowballPorterFilterFactory"
>> language="English"
>> > protected="protwords.txt"/>
>> >      </analyzer>
>> >    </fieldType>
>> >
>> >
>> >
>> > The data indexed in this field is :
>> >
>> > sentence 1 : " tissue devitalization was noted in hepalocytes of liver"
>> > sentence 2 :  "Necrosis not found in liver"
>> >
>> > Synonyms:
>> > necrosis , tissue devitalization, cellular necrosis
>> >
>> > How does the white space and synonym filter behave?I am not able to
>> > understand in analysis page..Please let me know if  it is like this that
>> > works? Correct me if i am wrong..
>> >
>> > sentence 1 : " tissue devitalization was noted in hepalocytes of liver"
>> >
>> > white space :
>> > tissue
>> >  devitalization
>> >  was
>> >  noted
>> >  in
>> >  hepalocytes
>> >  of
>> > liver
>> >
>> > Synoyms for token words:
>> > No synonyms for tissue , no synonym for devitalization and so
>> > on.........................
>> > So does the "tissue devitalization" word will not become synonym for
>> > Necrosis ?(since it is mentioned in synonym)
>> >
>> > If it adds as the synonym, Then how is it splitting the sentence and
>> adding
>> > the filter? Which is happening first?
>> >
>> >
>> > Sentence 2: Necrosis not  found in liver
>> >
>> >
>> > white space
>> > Necrosis
>> > not
>> >  found
>> >  in
>> >  liver
>> >
>> >
>> > Synoyms for token words:
>> > synonyms for Necrosis: tissue devitalization,cellular necrosis, no
>> synonym
>> > for not, no synonym for found and so on.........................
>> >
>> > Is this correct?
>> >
>> >
>> > My main concern is when i have 3 set of data like this:
>> >
>> > tissue devitalization was observed in hepalocytes of liver
>> > necrosis was observed in liver
>> > Necrosis not found in liver
>> >
>> > When i search "Necrosis not found" I need to get only the last sentence.
>> >
>> > I am not able to find out the list of tokens and analysers that i need to
>> > apply in order to acheieve this desired output
>> >
>> > Awaiting reply
>> > Rajani Maski
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Tue, Jun 14, 2011 at 3:13 PM, roySolr <[hidden email]>
>> wrote:
>> >
>> >> Maybe you can try to escape the synonyms so it's no tokized by
>> whitespace..
>> >>
>> >> Private\ schools,NGO\ Schools,Unaided\ schools
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://lucene.472066.n3.nabble.com/Query-on-Synonyms-feature-in-Solr-tp3058197p3062392.html
>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>> >
>>
>