indexing synonyms / reducing the index size

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

indexing synonyms / reducing the index size

Pablo Gomes Ludermir
Hello all,

I know that we can expand a word to get its synonyms with Wordnet. I
was wondering if we could reduce the index size by including a synonym
instead of a word on the synonym list.

For instance, if "screen" shows up, I would like to replace it by
"monitor" (it is a stupid example, but it was the first thing that
crossed my mind). Thus, instead of having both entries on the index, I
would have only one.

Thus, I would need to pre-process any queries, replacing the words by
its synonyms as well. I was wondering if someone has done such a thing
in an analyzer already and could give me a little help.

My aim is to reduce the index as much as possible (I already have a
stemmer and a stopword filter on the analyzer). Could anyone point
other ways to reduce the number of terms of an index?

The fact is that I would like to create "extra vectors" with my own
weighting scheme, and it is a quite costly algorithm, so the less
terms I have the better it performs.

Regards,
Pablo

--
Pablo Gomes Ludermir
[hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing synonyms / reducing the index size

David Spencer
Pablo Gomes Ludermir wrote:

> Hello all,
>
> I know that we can expand a word to get its synonyms with Wordnet. I
> was wondering if we could reduce the index size by including a synonym
> instead of a word on the synonym list.
>
> For instance, if "screen" shows up, I would like to replace it by
> "monitor" (it is a stupid example, but it was the first thing that
> crossed my mind). Thus, instead of having both entries on the index, I
> would have only one.
>
> Thus, I would need to pre-process any queries, replacing the words by
> its synonyms as well. I was wondering if someone has done such a thing
> in an analyzer already and could give me a little help.

Already done, I did it, bottom of this page, search for wordnet:
http://lucene.apache.org/java/docs/lucene-sandbox/

It runs in 2 phases:
[1] Parses some of Wordnet, stores synonyms in a Lucene index as a kind
of persistent Map. This is just run once.

[2] Query expansion, does things like expanding "monitor" to "monitor
screen". This runs against an unchanged index.


>
> My aim is to reduce the index as much as possible (I already have a
> stemmer and a stopword filter on the analyzer). Could anyone point
> other ways to reduce the number of terms of an index?
>
> The fact is that I would like to create "extra vectors" with my own
> weighting scheme, and it is a quite costly algorithm, so the less
> terms I have the better it performs.
>
> Regards,
> Pablo
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing synonyms / reducing the index size

Luke Shannon
In reply to this post by Pablo Gomes Ludermir
Hi Pablo;

I handle synonyms in the Query rather than the Index. Whenever I build a
query I check to see if there is a synonym for each word, or a replacement
for the entire string the user is searching on. If there is (either or both
cases) I include all the synonyms/replacement strings applicable plus the
original word/string in the Query.

This reduces index size (synonyms not in there), but it did result in some
queries exceeding the default max clause count for the BooleanQuery. I ended
up having to increase this.

Luke

----- Original Message -----
From: "Pablo Gomes Ludermir" <[hidden email]>
To: "Lucene user list" <[hidden email]>
Sent: Wednesday, May 04, 2005 5:38 PM
Subject: indexing synonyms / reducing the index size


Hello all,

I know that we can expand a word to get its synonyms with Wordnet. I
was wondering if we could reduce the index size by including a synonym
instead of a word on the synonym list.

For instance, if "screen" shows up, I would like to replace it by
"monitor" (it is a stupid example, but it was the first thing that
crossed my mind). Thus, instead of having both entries on the index, I
would have only one.

Thus, I would need to pre-process any queries, replacing the words by
its synonyms as well. I was wondering if someone has done such a thing
in an analyzer already and could give me a little help.

My aim is to reduce the index as much as possible (I already have a
stemmer and a stopword filter on the analyzer). Could anyone point
other ways to reduce the number of terms of an index?

The fact is that I would like to create "extra vectors" with my own
weighting scheme, and it is a quite costly algorithm, so the less
terms I have the better it performs.

Regards,
Pablo

--
Pablo Gomes Ludermir
[hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing synonyms / reducing the index size

Andrew Boyd
In reply to this post by Pablo Gomes Ludermir
I have done the same as Luke but I needed lucene 1.9rc1 to accomplish it.
I tried it with 1.4.3 but the queryparser could not handle it.

Andrew

-----Original Message-----
From: Luke Shannon <[hidden email]>
Sent: May 5, 2005 8:54 AM
To: [hidden email], Pablo Gomes Ludermir <[hidden email]>
Subject: Re: indexing synonyms / reducing the index size

Hi Pablo;

I handle synonyms in the Query rather than the Index. Whenever I build a
query I check to see if there is a synonym for each word, or a replacement
for the entire string the user is searching on. If there is (either or both
cases) I include all the synonyms/replacement strings applicable plus the
original word/string in the Query.

This reduces index size (synonyms not in there), but it did result in some
queries exceeding the default max clause count for the BooleanQuery. I ended
up having to increase this.

Luke

----- Original Message -----
From: "Pablo Gomes Ludermir" <[hidden email]>
To: "Lucene user list" <[hidden email]>
Sent: Wednesday, May 04, 2005 5:38 PM
Subject: indexing synonyms / reducing the index size


Hello all,

I know that we can expand a word to get its synonyms with Wordnet. I
was wondering if we could reduce the index size by including a synonym
instead of a word on the synonym list.

For instance, if "screen" shows up, I would like to replace it by
"monitor" (it is a stupid example, but it was the first thing that
crossed my mind). Thus, instead of having both entries on the index, I
would have only one.

Thus, I would need to pre-process any queries, replacing the words by
its synonyms as well. I was wondering if someone has done such a thing
in an analyzer already and could give me a little help.

My aim is to reduce the index as much as possible (I already have a
stemmer and a stopword filter on the analyzer). Could anyone point
other ways to reduce the number of terms of an index?

The fact is that I would like to create "extra vectors" with my own
weighting scheme, and it is a quite costly algorithm, so the less
terms I have the better it performs.

Regards,
Pablo

--
Pablo Gomes Ludermir
[hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]