plugin analyzer

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

plugin analyzer

Robert Benea
I think would be neat to have the NutchAnalyzer also as a plugin, from my
understanding right now if I want to analyze in a different way, I need to
hack the nutch source code, if we are going to have different plugins for
different analyzers that will help. Some specific application may use porter
analyzer, some other uses Snowball for Italian ..., with the plugin approach
these will coexist nicely.

Same thing would be for providing summaries, for instance if we enable
clustering the way is summarized the search result helps to have meaningful
clusters.

Let me know if you find it as an attractive feature ;-), I can find some
free-time and do the coding.

Cheers,
R.
Reply | Threaded
Open this post in threaded view
|

Re: plugin analyzer

Jérôme Charron
> I think would be neat to have the NutchAnalyzer also as a plugin, from my
> understanding right now if I want to analyze in a different way, I need to
> hack the nutch source code, if we are going to have different plugins for
> different analyzers that will help. Some specific application may use
> porter
> analyzer, some other uses Snowball for Italian ..., with the plugin
> approach
> these will coexist nicely.
>
> Same thing would be for providing summaries, for instance if we enable
> clustering the way is summarized the search result helps to have
> meaningful
> clusters.
>
> Let me know if you find it as an attractive feature ;-), I can find some
> free-time and do the coding.


Yes, it is definitvely an attractive feature!

I have recently commited in the trunk a support for multi-lingual analyzer
plugins.
There is an Analyzer Extension point, so that you can develop your own
analyze-plugins.
For now, the analyzer factory uses a plugin depending on the result of the
language identifier.
I have committed two analyze plugins, one for french and one for german.
They are just some wrappers
of the Lucene french and german analyzers.
By default, these plugins are not deployed, since:
1. they are at an early testing stage.
2. these analyzers make sense only if some query analyzers are provided too
(not yet done).

You can take a look at the proposal I made earlier (not finished since I
worked on other issues for now):
http://wiki.apache.org/nutch/MultiLingualSupport

Cheers

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: plugin analyzer

Robert Benea
I read about the MultiLingualSupport, but I didn't see it in the repository,
I think is cool.


On 10/4/05, Jérôme Charron <[hidden email]> wrote:

>
>
> I think would be neat to have the NutchAnalyzer also as a plugin, from my
> > understanding right now if I want to analyze in a different way, I need
> > to
> > hack the nutch source code, if we are going to have different plugins
> > for
> > different analyzers that will help. Some specific application may use
> > porter
> > analyzer, some other uses Snowball for Italian ..., with the plugin
> > approach
> > these will coexist nicely.
> >
> > Same thing would be for providing summaries, for instance if we enable
> > clustering the way is summarized the search result helps to have
> > meaningful
> > clusters.
> >
> > Let me know if you find it as an attractive feature ;-), I can find some
> >
> > free-time and do the coding.
>
>
> Yes, it is definitvely an attractive feature!
>
> I have recently commited in the trunk a support for multi-lingual analyzer
> plugins.
> There is an Analyzer Extension point, so that you can develop your own
> analyze-plugins.
> For now, the analyzer factory uses a plugin depending on the result of the
> language identifier.
> I have committed two analyze plugins, one for french and one for german.
> They are just some wrappers
> of the Lucene french and german analyzers.
> By default, these plugins are not deployed, since:
> 1. they are at an early testing stage.
> 2. these analyzers make sense only if some query analyzers are provided
> too (not yet done).
>

Yes, I actually hacked the src code to provide stemming and I changed the
analyzer, added a new query-stemm plugin and changed the summarizer (as the
terms were not highlighted after using the stemmer).

I think once we use an anlayzer(hopefully from the plugin) all the
subsequent classes should use the analyzer instead to imply anything.

You can take a look at the proposal I made earlier (not finished since I
> worked on other issues for now):
> http://wiki.apache.org/nutch/MultiLingualSupport
>
> Cheers
>
> Jérôme
>

Regards,
R.

--
> http://motrech.free.fr/
> http://www.frutch.org/
>
Reply | Threaded
Open this post in threaded view
|

Re: plugin analyzer

Jérôme Charron
> I read about the MultiLingualSupport, but I didn't see it in the
> repository, I think is cool.

The analyzer extension point is defined by the Analyzer abstract class:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/NutchAnalyzer.java
The default analyzer is this one:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java
The choice of the analyzer to use is done by the AnalyzerFactory:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/AnalyzerFactory.java
The german analyzer is located at:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/plugin/analysis-de/
and the french one at:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/plugin/analysis-fr/

> Yes, I actually hacked the src code to provide stemming and I changed the
> analyzer, added a new query-stemm plugin and changed the summarizer (as the
> terms were not highlighted after using the stemmer).
>
Sounds good!

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: plugin analyzer

Rajasekar Karthik
Is there an option I can add in nutch-site.xml to let me change the analyzer I want?

Jérôme Charron wrote
> I read about the MultiLingualSupport, but I didn't see it in the
> repository, I think is cool.

The analyzer extension point is defined by the Analyzer abstract class:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/NutchAnalyzer.java
The default analyzer is this one:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java
The choice of the analyzer to use is done by the AnalyzerFactory:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/AnalyzerFactory.java
The german analyzer is located at:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/plugin/analysis-de/
and the french one at:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/plugin/analysis-fr/

> Yes, I actually hacked the src code to provide stemming and I changed the
> analyzer, added a new query-stemm plugin and changed the summarizer (as the
> terms were not highlighted after using the stemmer).
>
Sounds good!

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/