Applying synonyms increase the data size from MB to GBs

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Applying synonyms increase the data size from MB to GBs

Rajinimaski
Applying synonyms increased the data size from 28 mb to 10.3 gb

   Before enabling synonyms to the a field , the data size was 28mb.  Now ,
after applying synonyms I see that data folder size has increased to 10.3
gb.

Attached is schema field type for that field:


 <fieldType name="textBODY" class="solr.TextField"
positionIncrementGap="100" >
      <analyzer>
        <filter class="solr.SynonymFilterFactory"
synonyms="BODYTaxonomy.txt" ignoreCase="true" expand="true"/>
       <filter class="solr.SynonymFilterFactory" synonyms="ObsTaxo.txt"
ignoreCase="true" expand="true"/>
       <filter class="solr.SynonymFilterFactory" synonyms="MTaxonomy.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.SynonymFilterFactory" synonyms="MicTaxo.txt"
ignoreCase="true" expand="true"/>
       <filter class="solr.SynonymFilterFactory" synonyms="SpTaxonomy.txt"
ignoreCase="true" expand="true"/>
       <filter class="solr.SynonymFilterFactory"
synonyms="ParameterTaxonomy.txt" ignoreCase="true" expand="true"/>
       <filter class="solr.SynonymFilterFactory" synonyms="STaxo.txt"
ignoreCase="true" expand="true"/>
  <filter class="solr.LowerCaseFilterFactory"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
      </analyzer>
    </fieldType>

All the attached synonym files are not more than 200KB


What might be the reason for this? Any config changes to be done?



Regards

Rajani
Reply | Threaded
Open this post in threaded view
|

Re: Applying synonyms increase the data size from MB to GBs

Gora Mohanty-3
On Mon, Jun 6, 2011 at 10:34 AM, rajini maski <[hidden email]> wrote:

> Applying synonyms increased the data size from 28 mb to 10.3 gb
>
>   Before enabling synonyms to the a field , the data size was 28mb.  Now ,
> after applying synonyms I see that data folder size has increased to 10.3
> gb.
>
> Attached is schema field type for that field:
>
>
>  <fieldType name="textBODY" class="solr.TextField"
> positionIncrementGap="100" >
>      <analyzer>
>        <filter class="solr.SynonymFilterFactory"
> synonyms="BODYTaxonomy.txt" ignoreCase="true" expand="true"/>
>       <filter class="solr.SynonymFilterFactory" synonyms="ObsTaxo.txt"
> ignoreCase="true" expand="true"/>
>       <filter class="solr.SynonymFilterFactory" synonyms="MTaxonomy.txt"
> ignoreCase="true" expand="true"/>
[...]

Could you explain what you are trying to do with multiple SynonymFilterFactory
filters applied to the field?

Regards,
Gora
Reply | Threaded
Open this post in threaded view
|

Re: Applying synonyms increase the data size from MB to GBs

Rajinimaski
   I have the flat files (synonym text files) each upto 200kb. Integrationg
all of them increased the txt file size to huge. And I wanted to maintain
them separately. So in order to apply all those synonyms to same field type
I created that many filter tags for respective synonym txt files.

Is it not the right way to do so?

Is there a way where in I can apply all those file to same tag with some
delimiter separated?

like this:

<fieldType name="textBODY" class="solr.TextField" positionIncrementGap="100"
>
      <analyzer>
        <filter class="solr.SynonymFilterFactory" synonyms="BODYTaxonomy.txt
, ClinicalObs.txt, MicTaxo.txt, SPTaxo.txt" ignoreCase="true"
expand="true"/>
              <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
      </analyzer>
    </fieldType>




Rajani


On Mon, Jun 6, 2011 at 11:01 AM, Gora Mohanty <[hidden email]> wrote:

> On Mon, Jun 6, 2011 at 10:34 AM, rajini maski <[hidden email]>
> wrote:
> > Applying synonyms increased the data size from 28 mb to 10.3 gb
> >
> >   Before enabling synonyms to the a field , the data size was 28mb.  Now
> ,
> > after applying synonyms I see that data folder size has increased to 10.3
> > gb.
> >
> > Attached is schema field type for that field:
> >
> >
> >  <fieldType name="textBODY" class="solr.TextField"
> > positionIncrementGap="100" >
> >      <analyzer>
> >        <filter class="solr.SynonymFilterFactory"
> > synonyms="BODYTaxonomy.txt" ignoreCase="true" expand="true"/>
> >       <filter class="solr.SynonymFilterFactory" synonyms="ObsTaxo.txt"
> > ignoreCase="true" expand="true"/>
> >       <filter class="solr.SynonymFilterFactory" synonyms="MTaxonomy.txt"
> > ignoreCase="true" expand="true"/>
> [...]
>
> Could you explain what you are trying to do with multiple
> SynonymFilterFactory
> filters applied to the field?
>
> Regards,
> Gora
>
Reply | Threaded
Open this post in threaded view
|

Re: Applying synonyms increase the data size from MB to GBs

pravesh
Since you r using expand="true" , so, every time a matching synonym entry is found the analyzer is expanding the term with all synonyms set in the index. This may cause the index to grow in size.
Reply | Threaded
Open this post in threaded view
|

Re: Applying synonyms increase the data size from MB to GBs

iorixxx
In reply to this post by Rajinimaski
> Is there a way where in I can apply all those file to same
> tag with some
> delimiter separated?
>
> like this:
>         <filter
> class="solr.SynonymFilterFactory"
> synonyms="BODYTaxonomy.txt
> , ClinicalObs.txt, MicTaxo.txt, SPTaxo.txt"
> ignoreCase="true"
> expand="true"/>


Yes, you can perfectly feed multiple text files separated by comma to synonyms parameter.

synonyms="BODYTaxonomy.txt,ClinicalObs.txt,MicTaxo.txt,SPTaxo.txt"
Reply | Threaded
Open this post in threaded view
|

Re: Applying synonyms increase the data size from MB to GBs

Erick Erickson
Have you considered query-time expansion rather than index-time expansion?
In general this will lead to more complex queries, but smaller indexes.

Take a look at the analysis page available from the admin page to see exactly
what happens.

What is the high-legel problem you're trying to solve? Having this huge an
expansion in index size is pretty unusual, and I'm wondering if there might be
another approach to the problem...

Best
Erick

On Mon, Jun 6, 2011 at 6:19 AM, Ahmet Arslan <[hidden email]> wrote:

>> Is there a way where in I can apply all those file to same
>> tag with some
>> delimiter separated?
>>
>> like this:
>>         <filter
>> class="solr.SynonymFilterFactory"
>> synonyms="BODYTaxonomy.txt
>> , ClinicalObs.txt, MicTaxo.txt, SPTaxo.txt"
>> ignoreCase="true"
>> expand="true"/>
>
>
> Yes, you can perfectly feed multiple text files separated by comma to synonyms parameter.
>
> synonyms="BODYTaxonomy.txt,ClinicalObs.txt,MicTaxo.txt,SPTaxo.txt"
>