Status of language plugin

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Status of language plugin

T. Kuro Kurosaka
Hello Jérôme,
Because of other issues at work, I was away from Nutch.
Now I'm back, and I see you are making progresses according
to your notes in jira.

Is there an API doc or design doc that I can read to
understand where you are? Is the language plugin architecture
already in the main trunk?

Here are some issues that I've been worried about:
* Support of multilingual plugin?
** If one plugin can support more than one languages,
   the language needs to be passed at each analyzsis.
** This assumes language identification is done before
   analysis.  Is it the case ?

* Support of a different analyzer for query than index
** Analyzer for query may need to behave differently than
   analyzer for indexinging.  Can your architecture
   specify different analyzers for indexing and query?

Thanks.

-kuro
Reply | Threaded
Open this post in threaded view
|

Re: Status of language plugin

Jérôme Charron
> Is there an API doc or design doc that I can read to
> understand where you are? Is the language plugin architecture
> already in the main trunk?

The only available document is
http://wiki.apache.org/nutch/MultiLingualSupport
and sometimes I maintain this page
http://wiki.apache.org/nutch/JeromeCharron


> Here are some issues that I've been worried about:
> * Support of multilingual plugin?
> ** If one plugin can support more than one languages,
>    the language needs to be passed at each analyzsis.

I don't understand your need.
But if you have an analysis plugin that can handle many languages, you
can simply define many implementations in your plugin xml, ie

<extension id="org.apache.nutch.analysis.cjk"
              name="CJKAnalyzer"
              point="org.apache.nutch.analysis.NutchAnalyzer">

      <implementation id="org.apache.nutch.analysis.cn.ChineseAnalyzer"
                      class="org.apache.nutch.analysis.cjk.CJKAnalyzer ">
        <parameter name="lang" value="cn"/>
      </implementation>

      <implementation id="org.apache.nutch.analysis.kr.KoreanAnalyzer"
                      class="org.apache.nutch.analysis.cjk.CJKAnalyzer">
        <parameter name="lang" value="kr"/>
      </implementation>

      <implementation id="org.apache.nutch.analysis.jp.JapaneseAnalyzer"
                      class="org.apache.nutch.analysis.cjk.CJKAnalyzer">
        <parameter name="lang" value="jp"/>
      </implementation>

   </extension>


> ** This assumes language identification is done before
>    analysis.  Is it the case ?

Yes.


> * Support of a different analyzer for query than index
> ** Analyzer for query may need to behave differently than
>    analyzer for indexinging.  Can your architecture
>    specify different analyzers for indexing and query?

In fact, to avoid adding a QueryAnalyser extension point,
the Query use the same Analyzer implementation that the one
for document analysis.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/