Replace CJK lanaguage analyzer in nutch

classic Classic list List threaded Threaded
3 messages Options
jqq
Reply | Threaded
Open this post in threaded view
|

Replace CJK lanaguage analyzer in nutch

jqq
NutchAnalysis segments CJK term word-by-word.In order to make Nutch
support Chinese well, I developed a plug-in for Chinese.
I placed ChineseAnalyzer.class(extends NutchAnalyzer) and
ChineseTokenizer.class in a jar-nutch-0.8.1/plugins/analysis-zh.jar, and
configured plugin.xml and nutch-site.xml. I think nutch should
replace NutchDocumentAnalyzer by ChineseAnalyzer,but nutch didn't. What's
wrong ?
* plugin.xml configuration*:
*  <?xml version="1.0" encoding="UTF-8"?>*

*<plugin
   id="analysis-zh"
   name="Chinese Analysis Plug-in"
   version="1.0.0"
   provider-name="org.apache.nutch">*

*   <runtime>
      <library name="analysis-zh.jar">
         <export name="*"/>
      </library>
   </runtime>*

*   <requires>
      <import plugin="nutch-extensionpoints" />
   </requires>*

*   <extension id="org.apache.nutch.analysis.zh"
              name="ChineseAnalyzer"
              point="org.apache.nutch.analysis.NutchAnalyzer">*

*      <implementation id="ChineseAnalyzer"
                      class="org.apache.nutch.analysis.zh.ChineseAnalyzer">
        <parameter name="lang" value="zh" />
      </implementation>*

*   </extension>*

*</plugin>*

*Here are some excerpts from nute-site.xml*

*<property>
  <name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|analysis-zh</value>
  <description> indexing and search plugins.
  </description>
</property>*

*Here are some excerpts from the hadoop log:*

*2007-04-02 21:35:32,687 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository - Registered Plugins:
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  CyberNeko HTML
Parser (lib-nekohtml)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Site Query Filter
(query-site)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Html Parse Plug-in
(parse-html)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Regex URL Filter
Framework (lib-regex-filter)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Basic Indexing
Filter (index-basic)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Basic Summarizer
Plug-in (summary-basic)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Text Parse Plug-in
(parse-text)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  JavaScript Parser
(parse-js)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Regex URL Filter
(urlfilter-regex)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Basic Query Filter
(query-basic)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  HTTP Framework
(lib-http)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  URL Query Filter
(query-url)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Chinese Analysis
Plug-in (analysis-zh)*

*......*

*2007-04-02 21:36:26,234 INFO  indexer.Indexer -  Indexing [
http://2008.163.com/] with analyzer **
org.apache.nutch.analysis.NutchDocumentAnalyzer@1d0d45b*<org.apache.nutch.analysis.NutchDocumentAnalyzer@1d0d45b>
* (null)
2007-04-02 21:36:26,359 INFO  indexer.Indexer -  Indexing [
http://auto.163.com/] with analyzer **
org.apache.nutch.analysis.NutchDocumentAnalyzer@1d0d45b*<org.apache.nutch.analysis.NutchDocumentAnalyzer@1d0d45b>
* (null)*
Reply | Threaded
Open this post in threaded view
|

Re: Replace CJK lanaguage analyzer in nutch

wuqi-2
The selection of analyzers were triggered by the "lang" property in the doc object. The lang property of doc were set by the plug-in LanguageIdentifier.Unfortunately, LanguageIdentifier can't support Chinese now. If you only need to deal with Chinese documents and English documents,you can hardcode the lang property of doc to "zh" .In  "Indexer.java", modify the code as blow:
// NutchAnalyzer analyzer = factory.get(doc.get("lang"));
   NutchAnalyzer analyzer = factory.get("zh");

----- Original Message -----
From: "zhao xiuwen" <[hidden email]>
To: <[hidden email]>
Sent: Tuesday, April 03, 2007 12:33 AM
Subject: Replace CJK lanaguage analyzer in nutch


> NutchAnalysis segments CJK term word-by-word.In order to make Nutch
> support Chinese well, I developed a plug-in for Chinese.
> I placed ChineseAnalyzer.class(extends NutchAnalyzer) and
> ChineseTokenizer.class in a jar-nutch-0.8.1/plugins/analysis-zh.jar, and
> configured plugin.xml and nutch-site.xml. I think nutch should
> replace NutchDocumentAnalyzer by ChineseAnalyzer,but nutch didn't. What's
> wrong ?
> * plugin.xml configuration*:
> *  <?xml version="1.0" encoding="UTF-8"?>*
>
> *<plugin
>   id="analysis-zh"
>   name="Chinese Analysis Plug-in"
>   version="1.0.0"
>   provider-name="org.apache.nutch">*
>
> *   <runtime>
>      <library name="analysis-zh.jar">
>         <export name="*"/>
>      </library>
>   </runtime>*
>
> *   <requires>
>      <import plugin="nutch-extensionpoints" />
>   </requires>*
>
> *   <extension id="org.apache.nutch.analysis.zh"
>              name="ChineseAnalyzer"
>              point="org.apache.nutch.analysis.NutchAnalyzer">*
>
> *      <implementation id="ChineseAnalyzer"
>                      class="org.apache.nutch.analysis.zh.ChineseAnalyzer">
>        <parameter name="lang" value="zh" />
>      </implementation>*
>
> *   </extension>*
>
> *</plugin>*
>
> *Here are some excerpts from nute-site.xml*
>
> *<property>
>  <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|analysis-zh</value>
>  <description> indexing and search plugins.
>  </description>
> </property>*
>
> *Here are some excerpts from the hadoop log:*
>
> *2007-04-02 21:35:32,687 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository - Registered Plugins:
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Site Query Filter
> (query-site)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Html Parse Plug-in
> (parse-html)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Regex URL Filter
> Framework (lib-regex-filter)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Basic Indexing
> Filter (index-basic)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Basic Summarizer
> Plug-in (summary-basic)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Text Parse Plug-in
> (parse-text)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  JavaScript Parser
> (parse-js)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Regex URL Filter
> (urlfilter-regex)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Basic Query Filter
> (query-basic)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  HTTP Framework
> (lib-http)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  URL Query Filter
> (query-url)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Chinese Analysis
> Plug-in (analysis-zh)*
>
> *......*
>
> *2007-04-02 21:36:26,234 INFO  indexer.Indexer -  Indexing [
> http://2008.163.com/] with analyzer **
> org.apache.nutch.analysis.NutchDocumentAnalyzer@1d0d45b*<org.apache.nutch.analysis.NutchDocumentAnalyzer@1d0d45b>
> * (null)
> 2007-04-02 21:36:26,359 INFO  indexer.Indexer -  Indexing [
> http://auto.163.com/] with analyzer **
> org.apache.nutch.analysis.NutchDocumentAnalyzer@1d0d45b*<org.apache.nutch.analysis.NutchDocumentAnalyzer@1d0d45b>
> * (null)*
>
jqq
Reply | Threaded
Open this post in threaded view
|

Re: Replace CJK lanaguage analyzer in nutch

jqq
thanks qi wu.After modification,nutch invoked my plug-in.I will try modify
the plug-in LanguageIdentifier.
2007/4/3, qi wu <[hidden email]>:

>
> The selection of analyzers were triggered by the "lang" property in the
> doc object. The lang property of doc were set by the plug-in
> LanguageIdentifier.Unfortunately, LanguageIdentifier can't support Chinese
> now. If you only need to deal with Chinese documents and English
> documents,you can hardcode the lang property of doc to "zh" .In  "
> Indexer.java", modify the code as blow:
> // NutchAnalyzer analyzer = factory.get(doc.get("lang"));
>   NutchAnalyzer analyzer = factory.get("zh");
>
> ----- Original Message -----
> From: "zhao xiuwen" <[hidden email]>
> To: <[hidden email]>
> Sent: Tuesday, April 03, 2007 12:33 AM
> Subject: Replace CJK lanaguage analyzer in nutch
>
>
> > NutchAnalysis segments CJK term word-by-word.In order to make Nutch
> > support Chinese well, I developed a plug-in for Chinese.
> > I placed ChineseAnalyzer.class(extends NutchAnalyzer) and
> > ChineseTokenizer.class in a jar-nutch-0.8.1/plugins/analysis-zh.jar, and
> > configured plugin.xml and nutch-site.xml. I think nutch should
> > replace NutchDocumentAnalyzer by ChineseAnalyzer,but nutch didn't.
> What's
> > wrong ?
> > * plugin.xml configuration*:
> > *  <?xml version="1.0" encoding="UTF-8"?>*
> >
> > *<plugin
> >   id="analysis-zh"
> >   name="Chinese Analysis Plug-in"
> >   version="1.0.0"
> >   provider-name="org.apache.nutch">*
> >
> > *   <runtime>
> >      <library name="analysis-zh.jar">
> >         <export name="*"/>
> >      </library>
> >   </runtime>*
> >
> > *   <requires>
> >      <import plugin="nutch-extensionpoints" />
> >   </requires>*
> >
> > *   <extension id="org.apache.nutch.analysis.zh"
> >              name="ChineseAnalyzer"
> >              point="org.apache.nutch.analysis.NutchAnalyzer">*
> >
> > *      <implementation id="ChineseAnalyzer"
> >                      class="org.apache.nutch.analysis.zh.ChineseAnalyzer
> ">
> >        <parameter name="lang" value="zh" />
> >      </implementation>*
> >
> > *   </extension>*
> >
> > *</plugin>*
> >
> > *Here are some excerpts from nute-site.xml*
> >
> > *<property>
> >  <name>plugin.includes</name>
> >
> >
> <value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|analysis-zh</value>
> >  <description> indexing and search plugins.
> >  </description>
> > </property>*
> >
> > *Here are some excerpts from the hadoop log:*
> >
> > *2007-04-02 21:35:32,687 INFO  plugin.PluginRepository - Plugin
> > Auto-activation mode: [true]
> > 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository - Registered
> Plugins:
> > 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  CyberNeko HTML
> > Parser (lib-nekohtml)
> > 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Site Query
> Filter
> > (query-site)
> > 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Html Parse
> Plug-in
> > (parse-html)
> > 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Regex URL
> Filter
> > Framework (lib-regex-filter)
> > 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Basic Indexing
> > Filter (index-basic)
> > 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Basic
> Summarizer
> > Plug-in (summary-basic)
> > 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Text Parse
> Plug-in
> > (parse-text)
> > 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  JavaScript
> Parser
> > (parse-js)
> > 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Regex URL
> Filter
> > (urlfilter-regex)
> > 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Basic Query
> Filter
> > (query-basic)
> > 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  HTTP Framework
> > (lib-http)
> > 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  URL Query
> Filter
> > (query-url)
> > 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Chinese
> Analysis
> > Plug-in (analysis-zh)*
> >
> > *......*
> >
> > *2007-04-02 21:36:26,234 INFO  indexer.Indexer -  Indexing [
> > http://2008.163.com/] with analyzer **
> > org.apache.nutch.analysis.NutchDocumentAnalyzer@1d0d45b*<
> org.apache.nutch.analysis.NutchDocumentAnalyzer@1d0d45b>
> > * (null)
> > 2007-04-02 21:36:26,359 INFO  indexer.Indexer -  Indexing [
> > http://auto.163.com/] with analyzer **
> > org.apache.nutch.analysis.NutchDocumentAnalyzer@1d0d45b*<
> org.apache.nutch.analysis.NutchDocumentAnalyzer@1d0d45b>
> > * (null)*
> >