Quantcast

size of synonyms.txt

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

size of synonyms.txt

Bernd Fehling
While trying some synonyms.txt files I noticed a huge increase of heap usage.

synonyms_1.txt --> 6645 lines (2826104 bytes in size)
results in 66364 entries in SynonymMap with 730MB heap usage.
Startup time about 2 minutes.

synonyms_2.txt --> 6645 lines (5384884 bytes in size)
results in 115168 entries in SynonymMap with 3.3GB heap usage.
Startup time about 4 minutes.


What is your size of synonyms.txt?


Any limitations (e.g. file size, number of synonyms, ...)?


How to deal with _really_ large numbers of synonyms?


To the experts:
Why not using synonyms from a file, just because memory is faster?


Regards,
Bernd
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: size of synonyms.txt

Robert Muir
On Wed, Jun 22, 2011 at 10:14 AM, Bernd Fehling
<[hidden email]> wrote:

> While trying some synonyms.txt files I noticed a huge increase of heap
> usage.
>
> synonyms_1.txt --> 6645 lines (2826104 bytes in size)
> results in 66364 entries in SynonymMap with 730MB heap usage.
> Startup time about 2 minutes.
>
> synonyms_2.txt --> 6645 lines (5384884 bytes in size)
> results in 115168 entries in SynonymMap with 3.3GB heap usage.
> Startup time about 4 minutes.
>
>
> What is your size of synonyms.txt?
>
>
> Any limitations (e.g. file size, number of synonyms, ...)?
>
>
> How to deal with _really_ large numbers of synonyms?
>
>
> To the experts:
> Why not using synonyms from a file, just because memory is faster?
>

Hi,

I think we should look at implementing synonyms with an FST, to reduce
the ram usage.
I also think this would make it easier for us to minimize the number
of captureState/restoreState that it does,
because it would just be a more natural way to handle all the
multi-word cases... this could actually speed up the analysis time for
this filter.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: size of synonyms.txt

project2501
I once tried to load wordnet synsets as a synonym file and it was
prohibitively slow and unusable. fyi.

On 06/22/2011 12:23 PM, Robert Muir wrote:

> On Wed, Jun 22, 2011 at 10:14 AM, Bernd Fehling
> <[hidden email]>  wrote:
>> While trying some synonyms.txt files I noticed a huge increase of heap
>> usage.
>>
>> synonyms_1.txt -->  6645 lines (2826104 bytes in size)
>> results in 66364 entries in SynonymMap with 730MB heap usage.
>> Startup time about 2 minutes.
>>
>> synonyms_2.txt -->  6645 lines (5384884 bytes in size)
>> results in 115168 entries in SynonymMap with 3.3GB heap usage.
>> Startup time about 4 minutes.
>>
>>
>> What is your size of synonyms.txt?
>>
>>
>> Any limitations (e.g. file size, number of synonyms, ...)?
>>
>>
>> How to deal with _really_ large numbers of synonyms?
>>
>>
>> To the experts:
>> Why not using synonyms from a file, just because memory is faster?
>>
> Hi,
>
> I think we should look at implementing synonyms with an FST, to reduce
> the ram usage.
> I also think this would make it easier for us to minimize the number
> of captureState/restoreState that it does,
> because it would just be a more natural way to handle all the
> multi-word cases... this could actually speed up the analysis time for
> this filter.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: size of synonyms.txt

Bernd Fehling
In reply to this post by Robert Muir

> On Wed, Jun 22, 2011 at 10:14 AM, Bernd Fehling
> <[hidden email]> wrote:
> > While trying some synonyms.txt files I noticed a huge increase
> of heap
> > usage.
> >
> > synonyms_1.txt --> 6645 lines (2826104 bytes in size)
> > results in 66364 entries in SynonymMap with 730MB heap usage.
> > Startup time about 2 minutes.
> >
> > synonyms_2.txt --> 6645 lines (5384884 bytes in size)
> > results in 115168 entries in SynonymMap with 3.3GB heap usage.
> > Startup time about 4 minutes.
> >
> >
> > What is your size of synonyms.txt?
> >
> >
> > Any limitations (e.g. file size, number of synonyms, ...)?
> >
> >
> > How to deal with _really_ large numbers of synonyms?
> >
> >
> > To the experts:
> > Why not using synonyms from a file, just because memory is faster?
> >
>
> Hi,
>
> I think we should look at implementing synonyms with an FST, to reduce
> the ram usage.
> I also think this would make it easier for us to minimize the number
> of captureState/restoreState that it does,
> because it would just be a more natural way to handle all the
> multi-word cases... this could actually speed up the analysis
> time for
> this filter.

Wow you can read between the lines ;-)
Exactly what I have on my mind.
Loading...