[lucy-user] Synonyms with Lucy

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-user] Synonyms with Lucy

Nils Diewald
Hello,
I'm working with Lucene as well as with Lucy and I'm wondering if there
is a possibility to store multiple terms with independent offset
informations in Lucy, like this is possible with Lucene.

Example:
The string "This is an example" should be indexed with the
offset-information:
* this,0-4
* is,5-7
* an,8-10
* example,11-18
* examplification, 11-18
so in case the user searches for "examplification" the highlighter
highlights the synonym "example".

I'm glad about any hints in the right direction. Thank you all for this
awesome tool!
Best, Nils
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Synonyms with Lucy

Nathan Kurz
Hi Nils --

I don't think this is directly supported, but it seems like a good addition.

Another approach might be to expand to the synonyms in the query
rather than in the index.   That is, expand a search for
[examplification]  to [example OR examplification], which should
already highlight correctly.

You'd be trading a less efficient query for a small index.

--nate

On Thu, Jul 4, 2013 at 6:07 AM, Nils Diewald <[hidden email]> wrote:

> Hello,
> I'm working with Lucene as well as with Lucy and I'm wondering if there
> is a possibility to store multiple terms with independent offset
> informations in Lucy, like this is possible with Lucene.
>
> Example:
> The string "This is an example" should be indexed with the
> offset-information:
> * this,0-4
> * is,5-7
> * an,8-10
> * example,11-18
> * examplification, 11-18
> so in case the user searches for "examplification" the highlighter
> highlights the synonym "example".
>
> I'm glad about any hints in the right direction. Thank you all for this
> awesome tool!
> Best, Nils
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Synonyms with Lucy

Nils Diewald
Hi Nathan,

That's a good idea for synonymy!

I think that independent offsets would be a good addition to core, too
(if it is not already possible).
This would - for example - also allow for compound tokenization (like
https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html).
So in case you have the word "Donaudampfschiff", you could index
"schiff" as well as "Donaudampfschiff" - and if you like, you could give
"schiff" the complete offset of "Donaudampfschiff" (as
"Donaudampfschiff" is just a special type of "schiff").
This wouldn't be feasible with expanded queries, as there are unlimited
types of "schiff" possible.

Best,
Nils

Am 04.07.2013 22:41, schrieb Nathan Kurz:

> Hi Nils --
>
> I don't think this is directly supported, but it seems like a good addition.
>
> Another approach might be to expand to the synonyms in the query
> rather than in the index.   That is, expand a search for
> [examplification]  to [example OR examplification], which should
> already highlight correctly.
>
> You'd be trading a less efficient query for a small index.
>
> --nate
>
> On Thu, Jul 4, 2013 at 6:07 AM, Nils Diewald <*@b**n.de> wrote:
>> Hello,
>> I'm working with Lucene as well as with Lucy and I'm wondering if there
>> is a possibility to store multiple terms with independent offset
>> informations in Lucy, like this is possible with Lucene.
>>
>> Example:
>> The string "This is an example" should be indexed with the
>> offset-information:
>> * this,0-4
>> * is,5-7
>> * an,8-10
>> * example,11-18
>> * examplification, 11-18
>> so in case the user searches for "examplification" the highlighter
>> highlights the synonym "example".
>>
>> I'm glad about any hints in the right direction. Thank you all for this
>> awesome tool!
>> Best, Nils

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Synonyms with Lucy

Nick Wellnhofer
It's easy to implement user-specified synonyms with a custom Analyzer. All you have to do is to map tokens to a synonym with a hash table. You can find some information on how to implement your own Analyzer in the mailing list archives.

Lucy's SnowballStopFilter already supports custom stoplists and could be leveraged to map synonyms with just a few changes. What do the Lucy developers think about supporting synonyms in core?

Nick

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Synonyms with Lucy

Marvin Humphrey
On Thu, Jul 4, 2013 at 3:09 PM, Nick Wellnhofer <[hidden email]> wrote:
> It's easy to implement user-specified synonyms with a custom Analyzer. All
> you have to do is to map tokens to a synonym with a hash table. You can find
> some information on how to implement your own Analyzer in the mailing list
> archives.

The (non-public) Token class's position increment is also designed to support
multiple terms at the same position.  It defaults to 1, but if you set it to
0, the next term gets put at the same position.

The advantage of handling the synonym expansion at index time is simplified
queries and streamlined performance at search-time.

> Lucy's SnowballStopFilter already supports custom stoplists and could be
> leveraged to map synonyms with just a few changes. What do the Lucy
> developers think about supporting synonyms in core?

I wish that we had completed compiled extension support by now.  This is the
kind of thing that it would be nice to see mature as a separately developed
extension under a different namespace, possibly going through multiple
iterations of API and implementation before taking on the backwards
compatibility burden that comes with putting something in core.

Since we're not there yet, I could see putting something under LucyX.

Marvin Humphrey