Developing experimental "more advanced" analyzers

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Developing experimental "more advanced" analyzers

Christian Becker
Hi There,

I'm new to lucene (in fact im interested in ElasticSearch but in this case
its related to lucene) and I want to make some experiments with some
enhanced analyzers.

Indeed I have an external linguistic component which I want to connect to
Lucene / EleasticSearch. So before I'm producing a bunch of useless code, I
want to make sure that I'm going the right way.

The linguistic component needs at least a whole sentence as Input (at best
it would be the whole text at once).

So as far as I can see I would need to create a custom Analyzer and
overrride "createComponents" and "normalize".

Is that correct or am I on the wrong track?

Bests
Chris
Reply | Threaded
Open this post in threaded view
|

Re: Developing experimental "more advanced" analyzers

Robert Muir
On Mon, May 29, 2017 at 8:36 AM, Christian Becker
<[hidden email]> wrote:

> Hi There,
>
> I'm new to lucene (in fact im interested in ElasticSearch but in this case
> its related to lucene) and I want to make some experiments with some
> enhanced analyzers.
>
> Indeed I have an external linguistic component which I want to connect to
> Lucene / EleasticSearch. So before I'm producing a bunch of useless code, I
> want to make sure that I'm going the right way.
>
> The linguistic component needs at least a whole sentence as Input (at best
> it would be the whole text at once).
>
> So as far as I can see I would need to create a custom Analyzer and
> overrride "createComponents" and "normalize".
>

There is a base class for tokenizers that want to see
sentences-at-a-time in order to divide into words:

https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/SegmentingTokenizerBase.java#L197-L201

There are two examples that use it in the test class:

https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/test/org/apache/lucene/analysis/util/TestSegmentingTokenizerBase.java#L145
Reply | Threaded
Open this post in threaded view
|

Re: Developing experimental "more advanced" analyzers

Christian Becker
I'm sorry - I didn't write down, that my intention is to have linguistic
annotations like stems and maybe part of speech information. For sure,
tokenization is one of the things I want to do.

2017-05-29 19:02 GMT+02:00 Robert Muir <[hidden email]>:

> On Mon, May 29, 2017 at 8:36 AM, Christian Becker
> <[hidden email]> wrote:
> > Hi There,
> >
> > I'm new to lucene (in fact im interested in ElasticSearch but in this
> case
> > its related to lucene) and I want to make some experiments with some
> > enhanced analyzers.
> >
> > Indeed I have an external linguistic component which I want to connect to
> > Lucene / EleasticSearch. So before I'm producing a bunch of useless
> code, I
> > want to make sure that I'm going the right way.
> >
> > The linguistic component needs at least a whole sentence as Input (at
> best
> > it would be the whole text at once).
> >
> > So as far as I can see I would need to create a custom Analyzer and
> > overrride "createComponents" and "normalize".
> >
>
> There is a base class for tokenizers that want to see
> sentences-at-a-time in order to divide into words:
>
> https://github.com/apache/lucene-solr/blob/master/
> lucene/analysis/common/src/java/org/apache/lucene/analysis/util/
> SegmentingTokenizerBase.java#L197-L201
>
> There are two examples that use it in the test class:
>
> https://github.com/apache/lucene-solr/blob/master/
> lucene/analysis/common/src/test/org/apache/lucene/analysis/util/
> TestSegmentingTokenizerBase.java#L145
>
Reply | Threaded
Open this post in threaded view
|

Re: Developing experimental "more advanced" analyzers

Chris Brown-3
If you used our products which have Elastic plugins,POS, Stems and
Leminisation it would be much easier.




Kind Regards



Chris

VP International

E: [hidden email]

T: +44 208 622 2900

M: +44 7796946934

USA Number: +16173867107

Lakeside House, 1 Furzeground Way, Stockley Park, Middlesex, UB11 1BD, UK


On 29 May 2017 at 19:42, Christian Becker <[hidden email]>
wrote:

> I'm sorry - I didn't write down, that my intention is to have linguistic
> annotations like stems and maybe part of speech information. For sure,
> tokenization is one of the things I want to do.
>
> 2017-05-29 19:02 GMT+02:00 Robert Muir <[hidden email]>:
>
> > On Mon, May 29, 2017 at 8:36 AM, Christian Becker
> > <[hidden email]> wrote:
> > > Hi There,
> > >
> > > I'm new to lucene (in fact im interested in ElasticSearch but in this
> > case
> > > its related to lucene) and I want to make some experiments with some
> > > enhanced analyzers.
> > >
> > > Indeed I have an external linguistic component which I want to connect
> to
> > > Lucene / EleasticSearch. So before I'm producing a bunch of useless
> > code, I
> > > want to make sure that I'm going the right way.
> > >
> > > The linguistic component needs at least a whole sentence as Input (at
> > best
> > > it would be the whole text at once).
> > >
> > > So as far as I can see I would need to create a custom Analyzer and
> > > overrride "createComponents" and "normalize".
> > >
> >
> > There is a base class for tokenizers that want to see
> > sentences-at-a-time in order to divide into words:
> >
> > https://github.com/apache/lucene-solr/blob/master/
> > lucene/analysis/common/src/java/org/apache/lucene/analysis/util/
> > SegmentingTokenizerBase.java#L197-L201
> >
> > There are two examples that use it in the test class:
> >
> > https://github.com/apache/lucene-solr/blob/master/
> > lucene/analysis/common/src/test/org/apache/lucene/analysis/util/
> > TestSegmentingTokenizerBase.java#L145
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Developing experimental "more advanced" analyzers

Chris Collins-3
I am glad that basistech has tools to bring lemmings back :-}  I am guessing you also have lemmati[z|s]ation.


> On May 29, 2017, at 12:37 PM, Chris Brown <[hidden email]> wrote:
>
> If you used our products which have Elastic plugins,POS, Stems and
> Leminisation it would be much easier.
>
>
>
>
> Kind Regards
>
>
>
> Chris
>
> VP International
>
> E: [hidden email]
>
> T: +44 208 622 2900
>
> M: +44 7796946934
>
> USA Number: +16173867107
>
> Lakeside House, 1 Furzeground Way, Stockley Park, Middlesex, UB11 1BD, UK
>
>
> On 29 May 2017 at 19:42, Christian Becker <[hidden email]>
> wrote:
>
>> I'm sorry - I didn't write down, that my intention is to have linguistic
>> annotations like stems and maybe part of speech information. For sure,
>> tokenization is one of the things I want to do.
>>
>> 2017-05-29 19:02 GMT+02:00 Robert Muir <[hidden email]>:
>>
>>> On Mon, May 29, 2017 at 8:36 AM, Christian Becker
>>> <[hidden email]> wrote:
>>>> Hi There,
>>>>
>>>> I'm new to lucene (in fact im interested in ElasticSearch but in this
>>> case
>>>> its related to lucene) and I want to make some experiments with some
>>>> enhanced analyzers.
>>>>
>>>> Indeed I have an external linguistic component which I want to connect
>> to
>>>> Lucene / EleasticSearch. So before I'm producing a bunch of useless
>>> code, I
>>>> want to make sure that I'm going the right way.
>>>>
>>>> The linguistic component needs at least a whole sentence as Input (at
>>> best
>>>> it would be the whole text at once).
>>>>
>>>> So as far as I can see I would need to create a custom Analyzer and
>>>> overrride "createComponents" and "normalize".
>>>>
>>>
>>> There is a base class for tokenizers that want to see
>>> sentences-at-a-time in order to divide into words:
>>>
>>> https://github.com/apache/lucene-solr/blob/master/
>>> lucene/analysis/common/src/java/org/apache/lucene/analysis/util/
>>> SegmentingTokenizerBase.java#L197-L201
>>>
>>> There are two examples that use it in the test class:
>>>
>>> https://github.com/apache/lucene-solr/blob/master/
>>> lucene/analysis/common/src/test/org/apache/lucene/analysis/util/
>>> TestSegmentingTokenizerBase.java#L145
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Developing experimental "more advanced" analyzers

Christian Becker
Im this case its more fun not going the easy way 😉


Chris Collins <[hidden email]> schrieb am Mo. 29. Mai
2017 um 21:41:

> I am glad that basistech has tools to bring lemmings back :-}  I am
> guessing you also have lemmati[z|s]ation.
>
>
> > On May 29, 2017, at 12:37 PM, Chris Brown <[hidden email]> wrote:
> >
> > If you used our products which have Elastic plugins,POS, Stems and
> > Leminisation it would be much easier.
> >
> >
> >
> >
> > Kind Regards
> >
> >
> >
> > Chris
> >
> > VP International
> >
> > E: [hidden email]
> >
> > T: +44 208 622 2900
> >
> > M: +44 7796946934
> >
> > USA Number: +16173867107
> >
> > Lakeside House, 1 Furzeground Way, Stockley Park, Middlesex, UB11 1BD, UK
> >
> >
> > On 29 May 2017 at 19:42, Christian Becker <[hidden email]>
> > wrote:
> >
> >> I'm sorry - I didn't write down, that my intention is to have linguistic
> >> annotations like stems and maybe part of speech information. For sure,
> >> tokenization is one of the things I want to do.
> >>
> >> 2017-05-29 19:02 GMT+02:00 Robert Muir <[hidden email]>:
> >>
> >>> On Mon, May 29, 2017 at 8:36 AM, Christian Becker
> >>> <[hidden email]> wrote:
> >>>> Hi There,
> >>>>
> >>>> I'm new to lucene (in fact im interested in ElasticSearch but in this
> >>> case
> >>>> its related to lucene) and I want to make some experiments with some
> >>>> enhanced analyzers.
> >>>>
> >>>> Indeed I have an external linguistic component which I want to connect
> >> to
> >>>> Lucene / EleasticSearch. So before I'm producing a bunch of useless
> >>> code, I
> >>>> want to make sure that I'm going the right way.
> >>>>
> >>>> The linguistic component needs at least a whole sentence as Input (at
> >>> best
> >>>> it would be the whole text at once).
> >>>>
> >>>> So as far as I can see I would need to create a custom Analyzer and
> >>>> overrride "createComponents" and "normalize".
> >>>>
> >>>
> >>> There is a base class for tokenizers that want to see
> >>> sentences-at-a-time in order to divide into words:
> >>>
> >>> https://github.com/apache/lucene-solr/blob/master/
> >>> lucene/analysis/common/src/java/org/apache/lucene/analysis/util/
> >>> SegmentingTokenizerBase.java#L197-L201
> >>>
> >>> There are two examples that use it in the test class:
> >>>
> >>> https://github.com/apache/lucene-solr/blob/master/
> >>> lucene/analysis/common/src/test/org/apache/lucene/analysis/util/
> >>> TestSegmentingTokenizerBase.java#L145
> >>>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Developing experimental "more advanced" analyzers

Uwe Schindler
In reply to this post by Christian Becker
Hi,

as you are using Elasticsearch, there is no need to implement an Analyzer instance. In general, this is never needed in Lucene, too, as there is the class CustomAnalyzer that uses a builder pattern to construct an analyzer like Elasticsearch or Solr are doing.

For your use-case you need to implement a custom Tokenizer and/or several TokenFilters. In addition you need to create the corresponding factory classes and bundle everything as an Elasticsearch plugin. I'd suggest to ask on the Elasticsearch mailing lists about this. After that you can define your analyzer in the Elasticsearch mapping/index config.

The Tokenizer and TokenFilters can be implemented, e.g. like Robert Muir was telling you. The sentence stuff can be done as a segmenting tokenizer subclass. Keep in mind, that many tasks can be done with already existing TokenFilters and/or Tokenizers.

Lucene has no index support for POS tags, they are only used in the analysis chain. To somehow add them to the index, you may use TokenFilters as last stage that adds the POS tags to the term (e.g., term "Windmill", pos "subject" could be combined in the last TokenFilter to a term called "Windmill#subject" and indexed like that). For keeping track of POS tags during the analysis (between the tokenfilters and tokenizers), you may need to define custom attributes.

Check the UIMA analysis module for more information how to do this.

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Christian Becker [mailto:[hidden email]]
> Sent: Monday, May 29, 2017 2:37 PM
> To: [hidden email]
> Subject: Developing experimental "more advanced" analyzers
>
> Hi There,
>
> I'm new to lucene (in fact im interested in ElasticSearch but in this case
> its related to lucene) and I want to make some experiments with some
> enhanced analyzers.
>
> Indeed I have an external linguistic component which I want to connect to
> Lucene / EleasticSearch. So before I'm producing a bunch of useless code, I
> want to make sure that I'm going the right way.
>
> The linguistic component needs at least a whole sentence as Input (at best
> it would be the whole text at once).
>
> So as far as I can see I would need to create a custom Analyzer and
> overrride "createComponents" and "normalize".
>
> Is that correct or am I on the wrong track?
>
> Bests
> Chris