Lingustically-enhanced indexing for Lucene

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Lingustically-enhanced indexing for Lucene

Felipe Sánchez Martínez
The Transducens Group (http://transducens.dlsi.ua.es) at University
of Alicante (http://www.ua.es) has developed a tool that
allows the Lucene search engine to use morphological information
while indexing and then process smarter queries in which
morphological attributes can be used to specify query terms.

To that end, the tool makes use of morphological analyzers and
dictionaries developed for the open-source machine translation platform
Apertium (http://apertium.org) and, optionally, the part-of-speech
taggers developed for it. Currently there are morphological
dictionaries available for Spanish, Catalan, Galician, Portuguese,
Aranese, Romanian, French and English. In addition new dictionaries
are being developed for Esperanto, Occitan, Basque, Swedish, Danish,
Welsh, Polish and Italian, among others; we hope more language pairs
to be added to the Apertium machine translation platform in the
near future.

We are interested on releasing this tool as open source and we think
that the best way to do that would be to integrate it into the Lucene's
contrib folder, as other third-party tools. Who is the responsible
for that?, To whom should we address this petition?

Thank you very much.

========================== How it works ==========================

Indexing documents through this new framework involves the following
steps:

1. The texts to index must be analyzed using the morphological analyzer
and (optionally) the part-of-speech taggers of the Apertium machine
translation platform. Apertium supports files in plain text, rtf, odt,
sxw, html and doc.

2. Indexing the documents, as usual, by using a Lucene's analyzer
developed ad-hoc so as to properly interpret the documents previously
analyzed.

During indexing, the following morphological information is obtained for
each word: superficial form (the word as it appears in a non-analyzed
text), its lemma and relevant morphological information such as
part-of-speech and verb tense (if appropriate). The following example
illustrates which information is stored in the index for the following
English phrase "Blair does not resign":

* "Blair"
   - Superficial form: blair
   - Lemma: blair
   - Morphological information: np.ant (noun of a person)

* "does"
   - Superficial form: does
   - Lemma: do
   - Morphological information: vbdo.pri (auxiliar verb, present tense)

* "not"
   - Superficial form: no
   - Lemma: no
   - Morphological information: adv (adverb)

* "resign"
   - Superficial form: resign
   - Lemma: resign
   - Morphological information: vblex.inf (verb, infinitive tense)

To search, the language accepted by the query parser can be applied,
provided that a WhitespaceAnalyzer is used. In the query one can specify
information of different nature, to that end the following prefixes
are used:
- "sf:" for the superficial form (eg "sf:resign")
- "lem:" for the lema (eg "lem:resign")
- "tags:" for the morphological information (eg "tags:vblex.inf")

The following example illustrates the type of queries that can be used
to search for an specific document:

- Query: "tags:np.loc lem:airline sf:with lem:destination tags:np.loc"

This query searches for documents in which there is an airline or more
flying from anywhere to elsewhere, for example "Argentine airlines with
destination Madrid" or "British airlines with destination New York"


--
Felipe Sánchez Martínez <[hidden email]>
Departamento de Lenguajes y Sistemas Informáticos
Universidad de Alicante, E-03071 Alicante (Spain)
Tel.: +34 965 903 400, ext: 2038 Fax: +34 965 909 326
http://www.dlsi.ua.es/~fsanchez


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lingustically-enhanced indexing for Lucene

Grant Ingersoll-2
The best way to do this is to create a patch and attach it to a JIRA  
issue.  http://wiki.apache.org/lucene-java/HowToContribute has the  
details.

Sounds like an interesting project.  What are the licensing terms for  
Apertium?  On a side note, you might be interested in Mahout (http://lucene.apache.org/mahout 
)

Cheers,
Grant

On Feb 12, 2008, at 6:25 AM, [hidden email] wrote:

> The Transducens Group (http://transducens.dlsi.ua.es) at University
> of Alicante (http://www.ua.es) has developed a tool that
> allows the Lucene search engine to use morphological information
> while indexing and then process smarter queries in which
> morphological attributes can be used to specify query terms.
>
> To that end, the tool makes use of morphological analyzers and
> dictionaries developed for the open-source machine translation  
> platform
> Apertium (http://apertium.org) and, optionally, the part-of-speech
> taggers developed for it. Currently there are morphological
> dictionaries available for Spanish, Catalan, Galician, Portuguese,
> Aranese, Romanian, French and English. In addition new dictionaries
> are being developed for Esperanto, Occitan, Basque, Swedish, Danish,
> Welsh, Polish and Italian, among others; we hope more language pairs
> to be added to the Apertium machine translation platform in the
> near future.
>
> We are interested on releasing this tool as open source and we think
> that the best way to do that would be to integrate it into the  
> Lucene's
> contrib folder, as other third-party tools. Who is the responsible
> for that?, To whom should we address this petition?
>
> Thank you very much.
>
> ========================== How it works ==========================
>
> Indexing documents through this new framework involves the following
> steps:
>
> 1. The texts to index must be analyzed using the morphological  
> analyzer
> and (optionally) the part-of-speech taggers of the Apertium machine
> translation platform. Apertium supports files in plain text, rtf, odt,
> sxw, html and doc.
>
> 2. Indexing the documents, as usual, by using a Lucene's analyzer
> developed ad-hoc so as to properly interpret the documents previously
> analyzed.
>
> During indexing, the following morphological information is obtained  
> for
> each word: superficial form (the word as it appears in a non-analyzed
> text), its lemma and relevant morphological information such as
> part-of-speech and verb tense (if appropriate). The following example
> illustrates which information is stored in the index for the following
> English phrase "Blair does not resign":
>
> * "Blair"
>   - Superficial form: blair
>   - Lemma: blair
>   - Morphological information: np.ant (noun of a person)
>
> * "does"
>   - Superficial form: does
>   - Lemma: do
>   - Morphological information: vbdo.pri (auxiliar verb, present tense)
>
> * "not"
>   - Superficial form: no
>   - Lemma: no
>   - Morphological information: adv (adverb)
>
> * "resign"
>   - Superficial form: resign
>   - Lemma: resign
>   - Morphological information: vblex.inf (verb, infinitive tense)
>
> To search, the language accepted by the query parser can be applied,
> provided that a WhitespaceAnalyzer is used. In the query one can  
> specify
> information of different nature, to that end the following prefixes
> are used:
> - "sf:" for the superficial form (eg "sf:resign")
> - "lem:" for the lema (eg "lem:resign")
> - "tags:" for the morphological information (eg "tags:vblex.inf")
>
> The following example illustrates the type of queries that can be used
> to search for an specific document:
>
> - Query: "tags:np.loc lem:airline sf:with lem:destination tags:np.loc"
>
> This query searches for documents in which there is an airline or more
> flying from anywhere to elsewhere, for example "Argentine airlines  
> with
> destination Madrid" or "British airlines with destination New York"
>
>
> --
> Felipe Sánchez Martínez <[hidden email]>
> Departamento de Lenguajes y Sistemas Informáticos
> Universidad de Alicante, E-03071 Alicante (Spain)
> Tel.: +34 965 903 400, ext: 2038 Fax: +34 965 909 326
> http://www.dlsi.ua.es/~fsanchez
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lingustically-enhanced indexing for Lucene

Felipe Sánchez Martínez
In reply to this post by Felipe Sánchez Martínez

> The best way to do this is to create a patch and attach it to a JIRA  
> issue.  http://wiki.apache.org/lucene-java/HowToContribute has the  
> details.

Ok, I will read it. Thanks
>
> Sounds like an interesting project.  What are the licensing terms for  
> Apertium?  On a side note, you might be interested in Mahout
(http://lucene.apache.org/mahout)

Apertium is licensed under the GNU GPL license version 2.

regards
--
Felipe

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lingustically-enhanced indexing for Lucene

Grant Ingersoll-2

On Feb 12, 2008, at 9:47 AM, [hidden email] wrote:

>
>> The best way to do this is to create a patch and attach it to a JIRA
>> issue.  http://wiki.apache.org/lucene-java/HowToContribute has the
>> details.
>
> Ok, I will read it. Thanks
>>
>> Sounds like an interesting project.  What are the licensing terms for
>> Apertium?  On a side note, you might be interested in Mahout
> (http://lucene.apache.org/mahout)
>
> Apertium is licensed under the GNU GPL license version 2.

OK, this means that the Jars can not be included in the contrib.  The  
way to handle this is to have the build script download them for the  
user.  See the contrib/db module for how it handles the Berkeley  
database.

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lingustically-enhanced indexing for Lucene

Felipe Sánchez Martínez
In reply to this post by Felipe Sánchez Martínez

> >
> > Apertium is licensed under the GNU GPL license version 2.
>
> OK, this means that the Jars can not be included in the contrib.  The  
> way to handle this is to have the build script download them for the  
> user.  See the contrib/db module for how it handles the Berkeley  
> database.
>

Apertium is GPL, but that the part that works with Lucene can be
Apache compatible. It uses the Apertium dictionaries and some
Apertium tools to preprocess the files before indexing. The Java
classes that interpret the output of this preprocessing when
indexing are not part of the Apertium project.

--
Felipe

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lingustically-enhanced indexing for Lucene

Grant Ingersoll-2
OK, we'll have to see the patch. Don't you just love licensing?  :-)

-Grant

On Feb 13, 2008, at 3:49 AM, [hidden email] wrote:

>
>>>
>>> Apertium is licensed under the GNU GPL license version 2.
>>
>> OK, this means that the Jars can not be included in the contrib.  The
>> way to handle this is to have the build script download them for the
>> user.  See the contrib/db module for how it handles the Berkeley
>> database.
>>
>
> Apertium is GPL, but that the part that works with Lucene can be
> Apache compatible. It uses the Apertium dictionaries and some
> Apertium tools to preprocess the files before indexing. The Java
> classes that interpret the output of this preprocessing when
> indexing are not part of the Apertium project.
>
> --
> Felipe
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lingustically-enhanced indexing for Lucene

DM Smith
I am very interested in Apertium, especially if it is possible to grow  
it for biblical Greek and Hebrew.

Licensing threads seem to generate more heat than light. I hope that  
my question won't. I develop code under many different licenses  
including GPL, and feel that following licenses properly is important.

With Apertium being GPL, not LGPL, the license is "viral".

I have a non GPL library in which I'd like to use the code. Will the  
code in contrib protect my code from the viral nature of the GPL?

About the only ways I have seen around this are (IANAL and I'm  
probably very imprecise with the following):
1 Dual license with a GPL compatible license where the user is able to  
choose the license.

2 Re-licensing, the owners have the right to license their code in any  
fashion that they wish and can also grant a copy of the code under any  
other license of their choice, even a GPL incompatible one.

3 Plug-in, where the plug-in implements an "interface" of the  
application's core code to adapt the GPLv2  library to it. Because the  
plug-in implements the core application's interface it is not viral  
but the plugin is GPL.

-- DM


On Feb 13, 2008, at 6:44 AM, Grant Ingersoll wrote:

> OK, we'll have to see the patch. Don't you just love licensing?  :-)
>
> -Grant
>
> On Feb 13, 2008, at 3:49 AM, [hidden email] wrote:
>
>>
>>>>
>>>> Apertium is licensed under the GNU GPL license version 2.
>>>
>>> OK, this means that the Jars can not be included in the contrib.  
>>> The
>>> way to handle this is to have the build script download them for the
>>> user.  See the contrib/db module for how it handles the Berkeley
>>> database.
>>>
>>
>> Apertium is GPL, but that the part that works with Lucene can be
>> Apache compatible. It uses the Apertium dictionaries and some
>> Apertium tools to preprocess the files before indexing. The Java
>> classes that interpret the output of this preprocessing when
>> indexing are not part of the Apertium project.
>>
>> --
>> Felipe

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lingustically-enhanced indexing for Lucene

Grant Ingersoll-2
This question is probably best asked of the Apertium folks or asking a  
lawyer (sorry!) or reading some of the millions of opinions out there  
on licensing...


On Feb 13, 2008, at 8:29 AM, DM Smith wrote:

> I am very interested in Apertium, especially if it is possible to  
> grow it for biblical Greek and Hebrew.
>
> Licensing threads seem to generate more heat than light. I hope that  
> my question won't. I develop code under many different licenses  
> including GPL, and feel that following licenses properly is important.
>
> With Apertium being GPL, not LGPL, the license is "viral".
>
> I have a non GPL library in which I'd like to use the code. Will the  
> code in contrib protect my code from the viral nature of the GPL?
>
> About the only ways I have seen around this are (IANAL and I'm  
> probably very imprecise with the following):
> 1 Dual license with a GPL compatible license where the user is able  
> to choose the license.
>
> 2 Re-licensing, the owners have the right to license their code in  
> any fashion that they wish and can also grant a copy of the code  
> under any other license of their choice, even a GPL incompatible one.
>
> 3 Plug-in, where the plug-in implements an "interface" of the  
> application's core code to adapt the GPLv2  library to it. Because  
> the plug-in implements the core application's interface it is not  
> viral but the plugin is GPL.
>
> -- DM
>
>
> On Feb 13, 2008, at 6:44 AM, Grant Ingersoll wrote:
>
>> OK, we'll have to see the patch. Don't you just love licensing?  :-)
>>
>> -Grant
>>
>> On Feb 13, 2008, at 3:49 AM, [hidden email] wrote:
>>
>>>
>>>>>
>>>>> Apertium is licensed under the GNU GPL license version 2.
>>>>
>>>> OK, this means that the Jars can not be included in the contrib.  
>>>> The
>>>> way to handle this is to have the build script download them for  
>>>> the
>>>> user.  See the contrib/db module for how it handles the Berkeley
>>>> database.
>>>>
>>>
>>> Apertium is GPL, but that the part that works with Lucene can be
>>> Apache compatible. It uses the Apertium dictionaries and some
>>> Apertium tools to preprocess the files before indexing. The Java
>>> classes that interpret the output of this preprocessing when
>>> indexing are not part of the Apertium project.
>>>
>>> --
>>> Felipe
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lingustically-enhanced indexing for Lucene

Felipe Sánchez Martínez
In reply to this post by Felipe Sánchez Martínez

> > I am very interested in Apertium, especially if it is possible to  
> > grow it for biblical Greek and Hebrew.
> >
> > Licensing threads seem to generate more heat than light. I hope that  
> > my question won't. I develop code under many different licenses  
> > including GPL, and feel that following licenses properly is important.
> >

Sure is!

> > With Apertium being GPL, not LGPL, the license is "viral".

Exactly

> > I have a non GPL library in which I'd like to use the code. Will the  
> > code in contrib protect my code from the viral nature of the GPL?

No. The code in contrib  uses files preprocessed by Apertium (GPL application)
but does not use any GPL code. They are different projects.

> >
> > About the only ways I have seen around this are (IANAL and I'm  
> > probably very imprecise with the following):
> > 1 Dual license with a GPL compatible license where the user is able  
> > to choose the license.
> >
> > 2 Re-licensing, the owners have the right to license their code in  
> > any fashion that they wish and can also grant a copy of the code  
> > under any other license of their choice, even a GPL incompatible one.

Apertium has currently a GPL license. Only the owners of Apertium could
relicense it in this way; I think it will not be relicensed.

> > 3 Plug-in, where the plug-in implements an "interface" of the  
> > application's core code to adapt the GPLv2  library to it. Because  
> > the plug-in implements the core application's interface it is not  
> > viral but the plugin is GPL.

 
If you have more questions or suggestions about Apertium (not the Lucene
developement that uses files preprocessed by Apertium) please write to
[hidden email]

--
Felipe

> >
> > -- DM
> >
> >
> > On Feb 13, 2008, at 6:44 AM, Grant Ingersoll wrote:
> >
> >> OK, we'll have to see the patch. Don't you just love licensing?  :-)
> >>
> >> -Grant
> >>
> >> On Feb 13, 2008, at 3:49 AM, [hidden email] wrote:
> >>
> >>>
> >>>>>
> >>>>> Apertium is licensed under the GNU GPL license version 2.
> >>>>
> >>>> OK, this means that the Jars can not be included in the contrib.  
> >>>> The
> >>>> way to handle this is to have the build script download them for  
> >>>> the
> >>>> user.  See the contrib/db module for how it handles the Berkeley
> >>>> database.
> >>>>
> >>>
> >>> Apertium is GPL, but that the part that works with Lucene can be
> >>> Apache compatible. It uses the Apertium dictionaries and some
> >>> Apertium tools to preprocess the files before indexing. The Java
> >>> classes that interpret the output of this preprocessing when
> >>> indexing are not part of the Apertium project.
> >>>
> >>> --
> >>> Felipe
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
> http://www.lucenebootcamp.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]