nutch: creating new plugins: query plugin

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

nutch: creating new plugins: query plugin

POIRIER David
Hello,

Following the info available on the wiki
(http://wiki.apache.org/nutch/CreateNewFilter), I have created two new
plugins:
- index-scope (based on index-more)
- query-scope (based on query-site)

As you can guess, the first plugin simply add the "scope" metadata to
every parsed document, giving them, as a test, a fixed value, while the
second plugin add the possibility to search for a "scope" using the
Lucene syntax.  

I have deploy the two new plugins, as JARS, in my plugins repository and
modified my nutch-site.xml file to look for them. To be sure of
everything I have performed a complete crawling of a "virgin" source. I
have also modified both plugin.xml files so that the system can find the
right java classes.

Looking at a resultset everything looks fine: every hit in the set
possesses the metadata scope=aScope, which is exactly what I am looking
for. Things stop working though when I try to search for the metadata
using the Lucene syntax. The query "aWord scope:aScope" returns
nothing...

When I check at my log files I can see that the query-scope plugin is
available:
[...]
2008-03-25 16:02:55,015 [http-8080-Processor23] INFO
org.apache.nutch.plugin.PluginRepository  - Scope Query Filter
(query-scope)
[...]
And that the proper extension point is registered:
[...]
2008-03-25 16:02:55,015 [http-8080-Processor23] INFO
org.apache.nutch.plugin.PluginRepository  - Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
[...]


Here is the plugin.xml file associated with the plugin:

<plugin
   id="query-scope"
   name="a description"
   version="1.0.0"
   provider-name="myName.xyz">

   <runtime>
      <library name="query-scope.jar">
         <export name="*"/>
      </library>
   </runtime>

   <requires>
      <import plugin="nutch-extensionpoints"/>
   </requires>

   <extension
id="org.apache.nutch.searcher.site.modified.SiteQueryFilterModified"
              name="Scope Query Filter"
              point="org.apache.nutch.searcher.QueryFilter">
      <implementation id="SiteQueryFilterModified"
 
class="org.apache.nutch.searcher.site.modified.SiteQueryFilterModified">
        <parameter name="raw-fields" value="scope"/>
      </implementation>    

   </extension>
</plugin>



If somebody has any idea... please let me know! Thank you in advance!

David


Reply | Threaded
Open this post in threaded view
|

RE: nutch: creating new plugins: query plugin

POIRIER David
Hello,

I really need your help here please. I tried a few more things; I
deleted my two plugins and instead of creating new ones I modified the
existing index-more and query-more plugins.

The index-more modification is working. Here's what I added:
private Document addScope(Document doc, ParseData data, String url) {
        doc.add(new Field("scope", "aScope", Field.Store.YES,
Field.Index.UN_TOKENIZED));
        return doc;
}

And made sure that the method is called by adding this in the filter
method:
addScope(doc, parse.getData(), url_s);

Using the Nutch API, when I check for the details of a hit, I look for:
String scope = detail.getValue("scope");

And as expected it always return "aScope".

The problem is when I try to filter a query using my modified query-more
plugin. When executing the query "aKeyword scope:aScope" (the double
quotes are there only for the email lisibility)the index always returns
0 result.

Here's the additional class to the org.apache.nutch.indexer.more
package:
import org.apache.nutch.searcher.RawFieldQueryFilter;
import org.apache.hadoop.conf.Configuration;

/**
 * Handles "scope:" query clauses, causing them to search the field
 * indexed by MoreIndexingFilter.
 *
 * @author John Xing / David Poirier
 */

public class ScopeQueryFilter extends RawFieldQueryFilter {
  private Configuration conf;

  public ScopeQueryFilter() {
    super("scope");
  }

  public void setConf(Configuration conf) {
    this.conf = conf;
    setBoost(conf.getFloat("query.scope.boost", 0.0f));
  }

  public Configuration getConf() {
    return this.conf;
  }
}

And the plugin.xml file associated with it:
<plugin
   id="query-more"
   name="More Query Filter"
   version="1.0.0"
   provider-name="nutch.org">

   <runtime>
      <library name="query-more.jar">
         <export name="*"/>
      </library>
   </runtime>

   <requires>
      <import plugin="nutch-extensionpoints"/>
   </requires>

   <extension id="org.apache.nutch.searcher.more"
              name="Nutch More Query Filter"
              point="org.apache.nutch.searcher.QueryFilter">
      <implementation id="TypeQueryFilter"
 
class="org.apache.nutch.searcher.more.TypeQueryFilter">
        <parameter name="raw-fields" value="type"/>
      </implementation>
     
   </extension>

   <extension id="org.apache.nutch.searcher.more"
              name="Nutch More Query Filter"
              point="org.apache.nutch.searcher.QueryFilter">
      <implementation id="DateQueryFilter"
 
class="org.apache.nutch.searcher.more.DateQueryFilter">
        <parameter name="raw-fields" value="date"/>
      </implementation>
     
   </extension>
   
   <extension id="org.apache.nutch.searcher.more"
              name="Nutch More Query Filter"
              point="org.apache.nutch.searcher.QueryFilter">
      <implementation id="ScopeQueryFilter"
 
class="org.apache.nutch.searcher.more.ScopeQueryFilter">
        <parameter name="raw-fields" value="scope"/>
      </implementation>
     
   </extension>

</plugin>

If this tells something to anybody, please let me know.

Thank you in advance,

David


-----------------------------------------
David Poirier
E-business Consultant - Software Engineer
 


-----Original Message-----
From: POIRIER David [mailto:[hidden email]]
Sent: mardi, 25. mars 2008 18:09
To: [hidden email]
Subject: nutch: creating new plugins: query plugin

Hello,

Following the info available on the wiki
(http://wiki.apache.org/nutch/CreateNewFilter), I have created two new
plugins:
- index-scope (based on index-more)
- query-scope (based on query-site)

As you can guess, the first plugin simply add the "scope" metadata to
every parsed document, giving them, as a test, a fixed value, while the
second plugin add the possibility to search for a "scope" using the
Lucene syntax.  

I have deploy the two new plugins, as JARS, in my plugins repository and
modified my nutch-site.xml file to look for them. To be sure of
everything I have performed a complete crawling of a "virgin" source. I
have also modified both plugin.xml files so that the system can find the
right java classes.

Looking at a resultset everything looks fine: every hit in the set
possesses the metadata scope=aScope, which is exactly what I am looking
for. Things stop working though when I try to search for the metadata
using the Lucene syntax. The query "aWord scope:aScope" returns
nothing...

When I check at my log files I can see that the query-scope plugin is
available:
[...]
2008-03-25 16:02:55,015 [http-8080-Processor23] INFO
org.apache.nutch.plugin.PluginRepository  - Scope Query Filter
(query-scope)
[...]
And that the proper extension point is registered:
[...]
2008-03-25 16:02:55,015 [http-8080-Processor23] INFO
org.apache.nutch.plugin.PluginRepository  - Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
[...]


Here is the plugin.xml file associated with the plugin:

<plugin
   id="query-scope"
   name="a description"
   version="1.0.0"
   provider-name="myName.xyz">

   <runtime>
      <library name="query-scope.jar">
         <export name="*"/>
      </library>
   </runtime>

   <requires>
      <import plugin="nutch-extensionpoints"/>
   </requires>

   <extension
id="org.apache.nutch.searcher.site.modified.SiteQueryFilterModified"
              name="Scope Query Filter"
              point="org.apache.nutch.searcher.QueryFilter">
      <implementation id="SiteQueryFilterModified"
 
class="org.apache.nutch.searcher.site.modified.SiteQueryFilterModified">
        <parameter name="raw-fields" value="scope"/>
      </implementation>    

   </extension>
</plugin>



If somebody has any idea... please let me know! Thank you in advance!

David


Reply | Threaded
Open this post in threaded view
|

RE: nutch: creating new plugins: query plugin

Brian Ulicny
What happens when you do the keyword query only?

Where are you executing the query from?  Using the NutchBean?  If so,
then the double-quotes would be necessary.

Why don't you try searching for the urls directly that you think should
be returned using the url: syntax to make sure they got indexed and you
are pointing at the right index.

Brian Ulicny




On Wed, 26 Mar 2008 11:50:43 +0100, "POIRIER David"
<[hidden email]> said:

> Hello,
>
> I really need your help here please. I tried a few more things; I
> deleted my two plugins and instead of creating new ones I modified the
> existing index-more and query-more plugins.
>
> The index-more modification is working. Here's what I added:
> private Document addScope(Document doc, ParseData data, String url) {
> doc.add(new Field("scope", "aScope", Field.Store.YES,
> Field.Index.UN_TOKENIZED));
> return doc;
> }
>
> And made sure that the method is called by adding this in the filter
> method:
> addScope(doc, parse.getData(), url_s);
>
> Using the Nutch API, when I check for the details of a hit, I look for:
> String scope = detail.getValue("scope");
>
> And as expected it always return "aScope".
>
> The problem is when I try to filter a query using my modified query-more
> plugin. When executing the query "aKeyword scope:aScope" (the double
> quotes are there only for the email lisibility)the index always returns
> 0 result.
>
> Here's the additional class to the org.apache.nutch.indexer.more
> package:
> import org.apache.nutch.searcher.RawFieldQueryFilter;
> import org.apache.hadoop.conf.Configuration;
>
> /**
>  * Handles "scope:" query clauses, causing them to search the field
>  * indexed by MoreIndexingFilter.
>  *
>  * @author John Xing / David Poirier
>  */
>
> public class ScopeQueryFilter extends RawFieldQueryFilter {
>   private Configuration conf;
>
>   public ScopeQueryFilter() {
>     super("scope");
>   }
>
>   public void setConf(Configuration conf) {
>     this.conf = conf;
>     setBoost(conf.getFloat("query.scope.boost", 0.0f));
>   }
>
>   public Configuration getConf() {
>     return this.conf;
>   }
> }
>
> And the plugin.xml file associated with it:
> <plugin
>    id="query-more"
>    name="More Query Filter"
>    version="1.0.0"
>    provider-name="nutch.org">
>
>    <runtime>
>       <library name="query-more.jar">
>          <export name="*"/>
>       </library>
>    </runtime>
>
>    <requires>
>       <import plugin="nutch-extensionpoints"/>
>    </requires>
>
>    <extension id="org.apache.nutch.searcher.more"
>               name="Nutch More Query Filter"
>               point="org.apache.nutch.searcher.QueryFilter">
>       <implementation id="TypeQueryFilter"
>  
> class="org.apache.nutch.searcher.more.TypeQueryFilter">
>         <parameter name="raw-fields" value="type"/>
>       </implementation>
>      
>    </extension>
>
>    <extension id="org.apache.nutch.searcher.more"
>               name="Nutch More Query Filter"
>               point="org.apache.nutch.searcher.QueryFilter">
>       <implementation id="DateQueryFilter"
>  
> class="org.apache.nutch.searcher.more.DateQueryFilter">
>         <parameter name="raw-fields" value="date"/>
>       </implementation>
>      
>    </extension>
>    
>    <extension id="org.apache.nutch.searcher.more"
>               name="Nutch More Query Filter"
>               point="org.apache.nutch.searcher.QueryFilter">
>       <implementation id="ScopeQueryFilter"
>  
> class="org.apache.nutch.searcher.more.ScopeQueryFilter">
>         <parameter name="raw-fields" value="scope"/>
>       </implementation>
>      
>    </extension>
>
> </plugin>
>
> If this tells something to anybody, please let me know.
>
> Thank you in advance,
>
> David
>
>
> -----------------------------------------
> David Poirier
> E-business Consultant - Software Engineer
>  
>
>
> -----Original Message-----
> From: POIRIER David [mailto:[hidden email]]
> Sent: mardi, 25. mars 2008 18:09
> To: [hidden email]
> Subject: nutch: creating new plugins: query plugin
>
> Hello,
>
> Following the info available on the wiki
> (http://wiki.apache.org/nutch/CreateNewFilter), I have created two new
> plugins:
> - index-scope (based on index-more)
> - query-scope (based on query-site)
>
> As you can guess, the first plugin simply add the "scope" metadata to
> every parsed document, giving them, as a test, a fixed value, while the
> second plugin add the possibility to search for a "scope" using the
> Lucene syntax.  
>
> I have deploy the two new plugins, as JARS, in my plugins repository and
> modified my nutch-site.xml file to look for them. To be sure of
> everything I have performed a complete crawling of a "virgin" source. I
> have also modified both plugin.xml files so that the system can find the
> right java classes.
>
> Looking at a resultset everything looks fine: every hit in the set
> possesses the metadata scope=aScope, which is exactly what I am looking
> for. Things stop working though when I try to search for the metadata
> using the Lucene syntax. The query "aWord scope:aScope" returns
> nothing...
>
> When I check at my log files I can see that the query-scope plugin is
> available:
> [...]
> 2008-03-25 16:02:55,015 [http-8080-Processor23] INFO
> org.apache.nutch.plugin.PluginRepository  - Scope Query Filter
> (query-scope)
> [...]
> And that the proper extension point is registered:
> [...]
> 2008-03-25 16:02:55,015 [http-8080-Processor23] INFO
> org.apache.nutch.plugin.PluginRepository  - Nutch Query Filter
> (org.apache.nutch.searcher.QueryFilter)
> [...]
>
>
> Here is the plugin.xml file associated with the plugin:
>
> <plugin
>    id="query-scope"
>    name="a description"
>    version="1.0.0"
>    provider-name="myName.xyz">
>
>    <runtime>
>       <library name="query-scope.jar">
>          <export name="*"/>
>       </library>
>    </runtime>
>
>    <requires>
>       <import plugin="nutch-extensionpoints"/>
>    </requires>
>
>    <extension
> id="org.apache.nutch.searcher.site.modified.SiteQueryFilterModified"
>               name="Scope Query Filter"
>               point="org.apache.nutch.searcher.QueryFilter">
>       <implementation id="SiteQueryFilterModified"
>  
> class="org.apache.nutch.searcher.site.modified.SiteQueryFilterModified">
>         <parameter name="raw-fields" value="scope"/>
>       </implementation>    
>
>    </extension>
> </plugin>
>
>
>
> If somebody has any idea... please let me know! Thank you in advance!
>
> David
>
>
--
  Brian Ulicny
  bulicny at alum dot mit dot edu
  home: 781-721-5746
  fax: 360-361-5746


Reply | Threaded
Open this post in threaded view
|

RE: nutch: creating new plugins: query plugin

POIRIER David
Brian,

Thank you for your answer.

Q: What happens when you do the keyword query only?
A: It works. Example:
        query: cancer
        results: yes

Q: Where are you executing the query from?  Using the NutchBean?
A: From the nutchBean. Here's a few tests I made:
        query: cancer scope:aScope
        results: no

        query: cancer scope:"aScope"
        results: no

        query: cancer "scope:aScope"
        results: no

        query: "cancer scope:aScope"
        results: no

Q: Why don't you try searching for the urls directly that you think should
be returned using the url: syntax to make sure they got indexed and you
are pointing at the right index.
A: Thanks for the tip. I am indeed 100% certain that an index metadata named scope with a value aScope exist for ALL the reference in my index.

Example:
Query: cancer
Results:
    * segment = 20080326104113
    * digest = 678e47f1a52ce036b89e2dc4c6f3571c
    * url = http://www.aWebsite.com/article/511833.aspx
    * title = Arimidex with Tamoxifen efficacy and safety trial for advanced breast cancer (1033IL/0027)
    * tstamp = 20080326094143023
    * contentLength = 45167
    * primaryType = text
    * subType = html
    * scope = aScope
    * boost = 0.028375218

I am turning in circle (if we can say that in english)... I went back to my first plugin, which is a modification of the query-site plugin, without success.

If you, or anybody, think of something else, please let me know.

David





-----------------------------------------
David Poirier
E-business Consultant - Software Engineer
 
Direct: +41 (0)22 596 10 35
 
Cross Systems - Groupe Micropole Univers
Route des Acacias 45 B
1227 Carouge / Genève
Tél: +41 (0)22 308 48 60
Fax: +41 (0)22 308 48 68
 





-----Original Message-----
From: Brian Ulicny [mailto:[hidden email]]
Sent: mercredi, 26. mars 2008 15:30
To: [hidden email]; [hidden email]
Subject: RE: nutch: creating new plugins: query plugin

What happens when you do the keyword query only?

Where are you executing the query from?  Using the NutchBean?  If so,
then the double-quotes would be necessary.

Why don't you try searching for the urls directly that you think should
be returned using the url: syntax to make sure they got indexed and you
are pointing at the right index.

Brian Ulicny




On Wed, 26 Mar 2008 11:50:43 +0100, "POIRIER David"
<[hidden email]> said:

> Hello,
>
> I really need your help here please. I tried a few more things; I
> deleted my two plugins and instead of creating new ones I modified the
> existing index-more and query-more plugins.
>
> The index-more modification is working. Here's what I added:
> private Document addScope(Document doc, ParseData data, String url) {
> doc.add(new Field("scope", "aScope", Field.Store.YES,
> Field.Index.UN_TOKENIZED));
> return doc;
> }
>
> And made sure that the method is called by adding this in the filter
> method:
> addScope(doc, parse.getData(), url_s);
>
> Using the Nutch API, when I check for the details of a hit, I look for:
> String scope = detail.getValue("scope");
>
> And as expected it always return "aScope".
>
> The problem is when I try to filter a query using my modified query-more
> plugin. When executing the query "aKeyword scope:aScope" (the double
> quotes are there only for the email lisibility)the index always returns
> 0 result.
>
> Here's the additional class to the org.apache.nutch.indexer.more
> package:
> import org.apache.nutch.searcher.RawFieldQueryFilter;
> import org.apache.hadoop.conf.Configuration;
>
> /**
>  * Handles "scope:" query clauses, causing them to search the field
>  * indexed by MoreIndexingFilter.
>  *
>  * @author John Xing / David Poirier
>  */
>
> public class ScopeQueryFilter extends RawFieldQueryFilter {
>   private Configuration conf;
>
>   public ScopeQueryFilter() {
>     super("scope");
>   }
>
>   public void setConf(Configuration conf) {
>     this.conf = conf;
>     setBoost(conf.getFloat("query.scope.boost", 0.0f));
>   }
>
>   public Configuration getConf() {
>     return this.conf;
>   }
> }
>
> And the plugin.xml file associated with it:
> <plugin
>    id="query-more"
>    name="More Query Filter"
>    version="1.0.0"
>    provider-name="nutch.org">
>
>    <runtime>
>       <library name="query-more.jar">
>          <export name="*"/>
>       </library>
>    </runtime>
>
>    <requires>
>       <import plugin="nutch-extensionpoints"/>
>    </requires>
>
>    <extension id="org.apache.nutch.searcher.more"
>               name="Nutch More Query Filter"
>               point="org.apache.nutch.searcher.QueryFilter">
>       <implementation id="TypeQueryFilter"
>  
> class="org.apache.nutch.searcher.more.TypeQueryFilter">
>         <parameter name="raw-fields" value="type"/>
>       </implementation>
>      
>    </extension>
>
>    <extension id="org.apache.nutch.searcher.more"
>               name="Nutch More Query Filter"
>               point="org.apache.nutch.searcher.QueryFilter">
>       <implementation id="DateQueryFilter"
>  
> class="org.apache.nutch.searcher.more.DateQueryFilter">
>         <parameter name="raw-fields" value="date"/>
>       </implementation>
>      
>    </extension>
>    
>    <extension id="org.apache.nutch.searcher.more"
>               name="Nutch More Query Filter"
>               point="org.apache.nutch.searcher.QueryFilter">
>       <implementation id="ScopeQueryFilter"
>  
> class="org.apache.nutch.searcher.more.ScopeQueryFilter">
>         <parameter name="raw-fields" value="scope"/>
>       </implementation>
>      
>    </extension>
>
> </plugin>
>
> If this tells something to anybody, please let me know.
>
> Thank you in advance,
>
> David
>
>
> -----------------------------------------
> David Poirier
> E-business Consultant - Software Engineer
>  
>
>
> -----Original Message-----
> From: POIRIER David [mailto:[hidden email]]
> Sent: mardi, 25. mars 2008 18:09
> To: [hidden email]
> Subject: nutch: creating new plugins: query plugin
>
> Hello,
>
> Following the info available on the wiki
> (http://wiki.apache.org/nutch/CreateNewFilter), I have created two new
> plugins:
> - index-scope (based on index-more)
> - query-scope (based on query-site)
>
> As you can guess, the first plugin simply add the "scope" metadata to
> every parsed document, giving them, as a test, a fixed value, while the
> second plugin add the possibility to search for a "scope" using the
> Lucene syntax.  
>
> I have deploy the two new plugins, as JARS, in my plugins repository and
> modified my nutch-site.xml file to look for them. To be sure of
> everything I have performed a complete crawling of a "virgin" source. I
> have also modified both plugin.xml files so that the system can find the
> right java classes.
>
> Looking at a resultset everything looks fine: every hit in the set
> possesses the metadata scope=aScope, which is exactly what I am looking
> for. Things stop working though when I try to search for the metadata
> using the Lucene syntax. The query "aWord scope:aScope" returns
> nothing...
>
> When I check at my log files I can see that the query-scope plugin is
> available:
> [...]
> 2008-03-25 16:02:55,015 [http-8080-Processor23] INFO
> org.apache.nutch.plugin.PluginRepository  - Scope Query Filter
> (query-scope)
> [...]
> And that the proper extension point is registered:
> [...]
> 2008-03-25 16:02:55,015 [http-8080-Processor23] INFO
> org.apache.nutch.plugin.PluginRepository  - Nutch Query Filter
> (org.apache.nutch.searcher.QueryFilter)
> [...]
>
>
> Here is the plugin.xml file associated with the plugin:
>
> <plugin
>    id="query-scope"
>    name="a description"
>    version="1.0.0"
>    provider-name="myName.xyz">
>
>    <runtime>
>       <library name="query-scope.jar">
>          <export name="*"/>
>       </library>
>    </runtime>
>
>    <requires>
>       <import plugin="nutch-extensionpoints"/>
>    </requires>
>
>    <extension
> id="org.apache.nutch.searcher.site.modified.SiteQueryFilterModified"
>               name="Scope Query Filter"
>               point="org.apache.nutch.searcher.QueryFilter">
>       <implementation id="SiteQueryFilterModified"
>  
> class="org.apache.nutch.searcher.site.modified.SiteQueryFilterModified">
>         <parameter name="raw-fields" value="scope"/>
>       </implementation>    
>
>    </extension>
> </plugin>
>
>
>
> If somebody has any idea... please let me know! Thank you in advance!
>
> David
>
>
--
  Brian Ulicny
  bulicny at alum dot mit dot edu
  home: 781-721-5746
  fax: 360-361-5746


Reply | Threaded
Open this post in threaded view
|

RE: nutch: creating new plugins: query plugin

Brian Ulicny
Date range queries are part of the query-more functionality, right?  Do
they work?

Brian

On Wed, 26 Mar 2008 15:57:44 +0100, "POIRIER David"
<[hidden email]> said:

> Brian,
>
> Thank you for your answer.
>
> Q: What happens when you do the keyword query only?
> A: It works. Example:
> query: cancer
> results: yes
>
> Q: Where are you executing the query from?  Using the NutchBean?
> A: From the nutchBean. Here's a few tests I made:
> query: cancer scope:aScope
> results: no
>
> query: cancer scope:"aScope"
> results: no
>
> query: cancer "scope:aScope"
> results: no
>
> query: "cancer scope:aScope"
> results: no
>
> Q: Why don't you try searching for the urls directly that you think
> should
> be returned using the url: syntax to make sure they got indexed and you
> are pointing at the right index.
> A: Thanks for the tip. I am indeed 100% certain that an index metadata
> named scope with a value aScope exist for ALL the reference in my index.
>
> Example:
> Query: cancer
> Results:
>     * segment = 20080326104113
>     * digest = 678e47f1a52ce036b89e2dc4c6f3571c
>     * url = http://www.aWebsite.com/article/511833.aspx
>     * title = Arimidex with Tamoxifen efficacy and safety trial for
>     advanced breast cancer (1033IL/0027)
>     * tstamp = 20080326094143023
>     * contentLength = 45167
>     * primaryType = text
>     * subType = html
>     * scope = aScope
>     * boost = 0.028375218
>
> I am turning in circle (if we can say that in english)... I went back to
> my first plugin, which is a modification of the query-site plugin,
> without success.
>
> If you, or anybody, think of something else, please let me know.
>
> David
>
>
>
>
>
> -----------------------------------------
> David Poirier
> E-business Consultant - Software Engineer
>  
> Direct: +41 (0)22 596 10 35
>  
> Cross Systems - Groupe Micropole Univers
> Route des Acacias 45 B
> 1227 Carouge / Genève
> Tél: +41 (0)22 308 48 60
> Fax: +41 (0)22 308 48 68
>  
>
>
>
>
>
> -----Original Message-----
> From: Brian Ulicny [mailto:[hidden email]]
> Sent: mercredi, 26. mars 2008 15:30
> To: [hidden email]; [hidden email]
> Subject: RE: nutch: creating new plugins: query plugin
>
> What happens when you do the keyword query only?
>
> Where are you executing the query from?  Using the NutchBean?  If so,
> then the double-quotes would be necessary.
>
> Why don't you try searching for the urls directly that you think should
> be returned using the url: syntax to make sure they got indexed and you
> are pointing at the right index.
>
> Brian Ulicny
>
>
>
>
> On Wed, 26 Mar 2008 11:50:43 +0100, "POIRIER David"
> <[hidden email]> said:
> > Hello,
> >
> > I really need your help here please. I tried a few more things; I
> > deleted my two plugins and instead of creating new ones I modified the
> > existing index-more and query-more plugins.
> >
> > The index-more modification is working. Here's what I added:
> > private Document addScope(Document doc, ParseData data, String url) {
> > doc.add(new Field("scope", "aScope", Field.Store.YES,
> > Field.Index.UN_TOKENIZED));
> > return doc;
> > }
> >
> > And made sure that the method is called by adding this in the filter
> > method:
> > addScope(doc, parse.getData(), url_s);
> >
> > Using the Nutch API, when I check for the details of a hit, I look for:
> > String scope = detail.getValue("scope");
> >
> > And as expected it always return "aScope".
> >
> > The problem is when I try to filter a query using my modified query-more
> > plugin. When executing the query "aKeyword scope:aScope" (the double
> > quotes are there only for the email lisibility)the index always returns
> > 0 result.
> >
> > Here's the additional class to the org.apache.nutch.indexer.more
> > package:
> > import org.apache.nutch.searcher.RawFieldQueryFilter;
> > import org.apache.hadoop.conf.Configuration;
> >
> > /**
> >  * Handles "scope:" query clauses, causing them to search the field
> >  * indexed by MoreIndexingFilter.
> >  *
> >  * @author John Xing / David Poirier
> >  */
> >
> > public class ScopeQueryFilter extends RawFieldQueryFilter {
> >   private Configuration conf;
> >
> >   public ScopeQueryFilter() {
> >     super("scope");
> >   }
> >
> >   public void setConf(Configuration conf) {
> >     this.conf = conf;
> >     setBoost(conf.getFloat("query.scope.boost", 0.0f));
> >   }
> >
> >   public Configuration getConf() {
> >     return this.conf;
> >   }
> > }
> >
> > And the plugin.xml file associated with it:
> > <plugin
> >    id="query-more"
> >    name="More Query Filter"
> >    version="1.0.0"
> >    provider-name="nutch.org">
> >
> >    <runtime>
> >       <library name="query-more.jar">
> >          <export name="*"/>
> >       </library>
> >    </runtime>
> >
> >    <requires>
> >       <import plugin="nutch-extensionpoints"/>
> >    </requires>
> >
> >    <extension id="org.apache.nutch.searcher.more"
> >               name="Nutch More Query Filter"
> >               point="org.apache.nutch.searcher.QueryFilter">
> >       <implementation id="TypeQueryFilter"
> >  
> > class="org.apache.nutch.searcher.more.TypeQueryFilter">
> >         <parameter name="raw-fields" value="type"/>
> >       </implementation>
> >      
> >    </extension>
> >
> >    <extension id="org.apache.nutch.searcher.more"
> >               name="Nutch More Query Filter"
> >               point="org.apache.nutch.searcher.QueryFilter">
> >       <implementation id="DateQueryFilter"
> >  
> > class="org.apache.nutch.searcher.more.DateQueryFilter">
> >         <parameter name="raw-fields" value="date"/>
> >       </implementation>
> >      
> >    </extension>
> >    
> >    <extension id="org.apache.nutch.searcher.more"
> >               name="Nutch More Query Filter"
> >               point="org.apache.nutch.searcher.QueryFilter">
> >       <implementation id="ScopeQueryFilter"
> >  
> > class="org.apache.nutch.searcher.more.ScopeQueryFilter">
> >         <parameter name="raw-fields" value="scope"/>
> >       </implementation>
> >      
> >    </extension>
> >
> > </plugin>
> >
> > If this tells something to anybody, please let me know.
> >
> > Thank you in advance,
> >
> > David
> >
> >
> > -----------------------------------------
> > David Poirier
> > E-business Consultant - Software Engineer
> >  
> >
> >
> > -----Original Message-----
> > From: POIRIER David [mailto:[hidden email]]
> > Sent: mardi, 25. mars 2008 18:09
> > To: [hidden email]
> > Subject: nutch: creating new plugins: query plugin
> >
> > Hello,
> >
> > Following the info available on the wiki
> > (http://wiki.apache.org/nutch/CreateNewFilter), I have created two new
> > plugins:
> > - index-scope (based on index-more)
> > - query-scope (based on query-site)
> >
> > As you can guess, the first plugin simply add the "scope" metadata to
> > every parsed document, giving them, as a test, a fixed value, while the
> > second plugin add the possibility to search for a "scope" using the
> > Lucene syntax.  
> >
> > I have deploy the two new plugins, as JARS, in my plugins repository and
> > modified my nutch-site.xml file to look for them. To be sure of
> > everything I have performed a complete crawling of a "virgin" source. I
> > have also modified both plugin.xml files so that the system can find the
> > right java classes.
> >
> > Looking at a resultset everything looks fine: every hit in the set
> > possesses the metadata scope=aScope, which is exactly what I am looking
> > for. Things stop working though when I try to search for the metadata
> > using the Lucene syntax. The query "aWord scope:aScope" returns
> > nothing...
> >
> > When I check at my log files I can see that the query-scope plugin is
> > available:
> > [...]
> > 2008-03-25 16:02:55,015 [http-8080-Processor23] INFO
> > org.apache.nutch.plugin.PluginRepository  - Scope Query Filter
> > (query-scope)
> > [...]
> > And that the proper extension point is registered:
> > [...]
> > 2008-03-25 16:02:55,015 [http-8080-Processor23] INFO
> > org.apache.nutch.plugin.PluginRepository  - Nutch Query Filter
> > (org.apache.nutch.searcher.QueryFilter)
> > [...]
> >
> >
> > Here is the plugin.xml file associated with the plugin:
> >
> > <plugin
> >    id="query-scope"
> >    name="a description"
> >    version="1.0.0"
> >    provider-name="myName.xyz">
> >
> >    <runtime>
> >       <library name="query-scope.jar">
> >          <export name="*"/>
> >       </library>
> >    </runtime>
> >
> >    <requires>
> >       <import plugin="nutch-extensionpoints"/>
> >    </requires>
> >
> >    <extension
> > id="org.apache.nutch.searcher.site.modified.SiteQueryFilterModified"
> >               name="Scope Query Filter"
> >               point="org.apache.nutch.searcher.QueryFilter">
> >       <implementation id="SiteQueryFilterModified"
> >  
> > class="org.apache.nutch.searcher.site.modified.SiteQueryFilterModified">
> >         <parameter name="raw-fields" value="scope"/>
> >       </implementation>    
> >
> >    </extension>
> > </plugin>
> >
> >
> >
> > If somebody has any idea... please let me know! Thank you in advance!
> >
> > David
> >
> >
> --
>   Brian Ulicny
>   bulicny at alum dot mit dot edu
>   home: 781-721-5746
>   fax: 360-361-5746
>
>
--
  Brian Ulicny
  bulicny at alum dot mit dot edu
  home: 781-721-5746
  fax: 360-361-5746


Reply | Threaded
Open this post in threaded view
|

RE: nutch: creating new plugins: query plugin

POIRIER David
I didn't try the date range queries, but I did try the type queries, which is also part of the query-more functionalities and it works.

David

-----------------------------------------
David Poirier
E-business Consultant - Software Engineer
 
Direct: +41 (0)22 596 10 35
 
Cross Systems - Groupe Micropole Univers
Route des Acacias 45 B
1227 Carouge / Genève
Tél: +41 (0)22 308 48 60
Fax: +41 (0)22 308 48 68
 

-----Original Message-----
From: Brian Ulicny [mailto:[hidden email]]
Sent: mercredi, 26. mars 2008 16:06
To: [hidden email]; [hidden email]
Subject: RE: nutch: creating new plugins: query plugin

Date range queries are part of the query-more functionality, right?  Do
they work?

Brian

On Wed, 26 Mar 2008 15:57:44 +0100, "POIRIER David"
<[hidden email]> said:

> Brian,
>
> Thank you for your answer.
>
> Q: What happens when you do the keyword query only?
> A: It works. Example:
> query: cancer
> results: yes
>
> Q: Where are you executing the query from?  Using the NutchBean?
> A: From the nutchBean. Here's a few tests I made:
> query: cancer scope:aScope
> results: no
>
> query: cancer scope:"aScope"
> results: no
>
> query: cancer "scope:aScope"
> results: no
>
> query: "cancer scope:aScope"
> results: no
>
> Q: Why don't you try searching for the urls directly that you think
> should
> be returned using the url: syntax to make sure they got indexed and you
> are pointing at the right index.
> A: Thanks for the tip. I am indeed 100% certain that an index metadata
> named scope with a value aScope exist for ALL the reference in my index.
>
> Example:
> Query: cancer
> Results:
>     * segment = 20080326104113
>     * digest = 678e47f1a52ce036b89e2dc4c6f3571c
>     * url = http://www.aWebsite.com/article/511833.aspx
>     * title = Arimidex with Tamoxifen efficacy and safety trial for
>     advanced breast cancer (1033IL/0027)
>     * tstamp = 20080326094143023
>     * contentLength = 45167
>     * primaryType = text
>     * subType = html
>     * scope = aScope
>     * boost = 0.028375218
>
> I am turning in circle (if we can say that in english)... I went back to
> my first plugin, which is a modification of the query-site plugin,
> without success.
>
> If you, or anybody, think of something else, please let me know.
>
> David
>
>
>
>
>
> -----------------------------------------
> David Poirier
> E-business Consultant - Software Engineer
>  
> Direct: +41 (0)22 596 10 35
>  
> Cross Systems - Groupe Micropole Univers
> Route des Acacias 45 B
> 1227 Carouge / Genève
> Tél: +41 (0)22 308 48 60
> Fax: +41 (0)22 308 48 68
>  
>
>
>
>
>
> -----Original Message-----
> From: Brian Ulicny [mailto:[hidden email]]
> Sent: mercredi, 26. mars 2008 15:30
> To: [hidden email]; [hidden email]
> Subject: RE: nutch: creating new plugins: query plugin
>
> What happens when you do the keyword query only?
>
> Where are you executing the query from?  Using the NutchBean?  If so,
> then the double-quotes would be necessary.
>
> Why don't you try searching for the urls directly that you think should
> be returned using the url: syntax to make sure they got indexed and you
> are pointing at the right index.
>
> Brian Ulicny
>
>
>
>
> On Wed, 26 Mar 2008 11:50:43 +0100, "POIRIER David"
> <[hidden email]> said:
> > Hello,
> >
> > I really need your help here please. I tried a few more things; I
> > deleted my two plugins and instead of creating new ones I modified the
> > existing index-more and query-more plugins.
> >
> > The index-more modification is working. Here's what I added:
> > private Document addScope(Document doc, ParseData data, String url) {
> > doc.add(new Field("scope", "aScope", Field.Store.YES,
> > Field.Index.UN_TOKENIZED));
> > return doc;
> > }
> >
> > And made sure that the method is called by adding this in the filter
> > method:
> > addScope(doc, parse.getData(), url_s);
> >
> > Using the Nutch API, when I check for the details of a hit, I look for:
> > String scope = detail.getValue("scope");
> >
> > And as expected it always return "aScope".
> >
> > The problem is when I try to filter a query using my modified query-more
> > plugin. When executing the query "aKeyword scope:aScope" (the double
> > quotes are there only for the email lisibility)the index always returns
> > 0 result.
> >
> > Here's the additional class to the org.apache.nutch.indexer.more
> > package:
> > import org.apache.nutch.searcher.RawFieldQueryFilter;
> > import org.apache.hadoop.conf.Configuration;
> >
> > /**
> >  * Handles "scope:" query clauses, causing them to search the field
> >  * indexed by MoreIndexingFilter.
> >  *
> >  * @author John Xing / David Poirier
> >  */
> >
> > public class ScopeQueryFilter extends RawFieldQueryFilter {
> >   private Configuration conf;
> >
> >   public ScopeQueryFilter() {
> >     super("scope");
> >   }
> >
> >   public void setConf(Configuration conf) {
> >     this.conf = conf;
> >     setBoost(conf.getFloat("query.scope.boost", 0.0f));
> >   }
> >
> >   public Configuration getConf() {
> >     return this.conf;
> >   }
> > }
> >
> > And the plugin.xml file associated with it:
> > <plugin
> >    id="query-more"
> >    name="More Query Filter"
> >    version="1.0.0"
> >    provider-name="nutch.org">
> >
> >    <runtime>
> >       <library name="query-more.jar">
> >          <export name="*"/>
> >       </library>
> >    </runtime>
> >
> >    <requires>
> >       <import plugin="nutch-extensionpoints"/>
> >    </requires>
> >
> >    <extension id="org.apache.nutch.searcher.more"
> >               name="Nutch More Query Filter"
> >               point="org.apache.nutch.searcher.QueryFilter">
> >       <implementation id="TypeQueryFilter"
> >  
> > class="org.apache.nutch.searcher.more.TypeQueryFilter">
> >         <parameter name="raw-fields" value="type"/>
> >       </implementation>
> >      
> >    </extension>
> >
> >    <extension id="org.apache.nutch.searcher.more"
> >               name="Nutch More Query Filter"
> >               point="org.apache.nutch.searcher.QueryFilter">
> >       <implementation id="DateQueryFilter"
> >  
> > class="org.apache.nutch.searcher.more.DateQueryFilter">
> >         <parameter name="raw-fields" value="date"/>
> >       </implementation>
> >      
> >    </extension>
> >    
> >    <extension id="org.apache.nutch.searcher.more"
> >               name="Nutch More Query Filter"
> >               point="org.apache.nutch.searcher.QueryFilter">
> >       <implementation id="ScopeQueryFilter"
> >  
> > class="org.apache.nutch.searcher.more.ScopeQueryFilter">
> >         <parameter name="raw-fields" value="scope"/>
> >       </implementation>
> >      
> >    </extension>
> >
> > </plugin>
> >
> > If this tells something to anybody, please let me know.
> >
> > Thank you in advance,
> >
> > David
> >
> >
> > -----------------------------------------
> > David Poirier
> > E-business Consultant - Software Engineer
> >  
> >
> >
> > -----Original Message-----
> > From: POIRIER David [mailto:[hidden email]]
> > Sent: mardi, 25. mars 2008 18:09
> > To: [hidden email]
> > Subject: nutch: creating new plugins: query plugin
> >
> > Hello,
> >
> > Following the info available on the wiki
> > (http://wiki.apache.org/nutch/CreateNewFilter), I have created two new
> > plugins:
> > - index-scope (based on index-more)
> > - query-scope (based on query-site)
> >
> > As you can guess, the first plugin simply add the "scope" metadata to
> > every parsed document, giving them, as a test, a fixed value, while the
> > second plugin add the possibility to search for a "scope" using the
> > Lucene syntax.  
> >
> > I have deploy the two new plugins, as JARS, in my plugins repository and
> > modified my nutch-site.xml file to look for them. To be sure of
> > everything I have performed a complete crawling of a "virgin" source. I
> > have also modified both plugin.xml files so that the system can find the
> > right java classes.
> >
> > Looking at a resultset everything looks fine: every hit in the set
> > possesses the metadata scope=aScope, which is exactly what I am looking
> > for. Things stop working though when I try to search for the metadata
> > using the Lucene syntax. The query "aWord scope:aScope" returns
> > nothing...
> >
> > When I check at my log files I can see that the query-scope plugin is
> > available:
> > [...]
> > 2008-03-25 16:02:55,015 [http-8080-Processor23] INFO
> > org.apache.nutch.plugin.PluginRepository  - Scope Query Filter
> > (query-scope)
> > [...]
> > And that the proper extension point is registered:
> > [...]
> > 2008-03-25 16:02:55,015 [http-8080-Processor23] INFO
> > org.apache.nutch.plugin.PluginRepository  - Nutch Query Filter
> > (org.apache.nutch.searcher.QueryFilter)
> > [...]
> >
> >
> > Here is the plugin.xml file associated with the plugin:
> >
> > <plugin
> >    id="query-scope"
> >    name="a description"
> >    version="1.0.0"
> >    provider-name="myName.xyz">
> >
> >    <runtime>
> >       <library name="query-scope.jar">
> >          <export name="*"/>
> >       </library>
> >    </runtime>
> >
> >    <requires>
> >       <import plugin="nutch-extensionpoints"/>
> >    </requires>
> >
> >    <extension
> > id="org.apache.nutch.searcher.site.modified.SiteQueryFilterModified"
> >               name="Scope Query Filter"
> >               point="org.apache.nutch.searcher.QueryFilter">
> >       <implementation id="SiteQueryFilterModified"
> >  
> > class="org.apache.nutch.searcher.site.modified.SiteQueryFilterModified">
> >         <parameter name="raw-fields" value="scope"/>
> >       </implementation>    
> >
> >    </extension>
> > </plugin>
> >
> >
> >
> > If somebody has any idea... please let me know! Thank you in advance!
> >
> > David
> >
> >
> --
>   Brian Ulicny
>   bulicny at alum dot mit dot edu
>   home: 781-721-5746
>   fax: 360-361-5746
>
>
--
  Brian Ulicny
  bulicny at alum dot mit dot edu
  home: 781-721-5746
  fax: 360-361-5746


Reply | Threaded
Open this post in threaded view
|

Re: nutch: creating new plugins: query plugin

Fred Gilmore
I'm watching this thread with interest as I'm stuck in the same place.  
 From reading three years of list archives, people seem to get over the
hump of indexing custom fields and then get mired in query side.  My
index is fine.  Luke shows me the fields, the values.  I can change my
index plugin code to not split on commas and it obeys.   I can search it
with Luke and it pulls data.  I realize that only means so much since
it's parsing those queries with a Lucene class.

But I can't get past the query plugin.  No matter how closely I follow
the example on the wiki.  I can look at the query-url, query-more,
doesn't seem to matter.  In fact, right now, if I load the query-plugin
listed below (in addition to query-basic) it breaks all searching.  
keyword, fielded, whatever.


<plugin
   id="query-placename"
   name="Placename Query Filter"
   version="1.0.0"
   provider-name="utexas.edu">

   <runtime>
      <library name="query-placename.jar">
         <export name="*"/>
      </library>
   </runtime>

   <requires>
      <import plugin="nutch-extensionpoints"/>
   </requires>

   <extension id="org.apache.nutch.searcher.placename.PlacenameQueryFilter"
              name="Placename Query Filter"
              point="org.apache.nutch.searcher.QueryFilter">
      <implementation id="PlacenameQueryFilter"
                     
class="org.apache.nutch.searcher.placename.PlacenameQueryFilter">
        <parameter name="fields" value="placename"/>
      </implementation>
   </extension>
</plugin>

===============
[search1]:nutch> pg PlacenameQueryFilter.java
package org.apache.nutch.searcher.placename;

import org.apache.nutch.searcher.FieldQueryFilter;
import org.apache.hadoop.conf.Configuration;


public class PlacenameQueryFilter extends FieldQueryFilter {

  public PlacenameQueryFilter() {
    super("placename", 5f);
  }

  public void setConf(Configuration conf) {
    super.setConf(conf);
  }

}

The wiki plugin example omits setConf as above, the query-url code sets
it as does query-more.  Some use rawfieldqueryfilter, some use
queryfilter, doesn't seem that should matter.

The plugin gets shuttled over to the tomcat side, the nutch-site.xml
gets updated with a new plugins.include stanza and the webapp
redeployed.  I've tried loading nutch-extensionpoints first here as
well, doesn't seem to matter.  Maybe the boost is messing things up,
it's set high but that's because previous threads have indicated it was
the only way to get the field only searches like placename:london working.

  <property>
    <name>searcher.dir</name>
    <value>/usr/local/db/nutch/search1/crawls/missions-test</value>
  </property>

<property>
  <name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|meta)|index-(basic|more|meta)|query-(basic|more|placename|creator|url)|summary-lucene|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

 No other nutch-default.xml or nutch-site.xml settings are altered.  But
there must be something obvious I'm leaving unset or that's conflicting
on the tomcat side which is breaking this.

removing the query-placename and query-creator plugins, keyword
searching resumes.  url: works, so the syntax is accepted.  

After several weeks of trying diff things, I'm all out.  But there must
be something I'm missing.  Any ideas at all?

thanks,

Fred Gilmore
University of Texas Austin Libraries
>>    
Reply | Threaded
Open this post in threaded view
|

RE: nutch: creating new plugins: query plugin

POIRIER David
Fred,

I must say I am happy to see that I am not the only one!

You are right: Using Luke and the
org.apache.lucene.analysis.KeywordAnalyzer I can search for my added
field (scope). An example: +content:"cancer"  +scope:"aScope". What I
understand is that using this analyzer you can filter your query using
any of the stored fields.

When executing a query through Nutch, the analyzer used is
org.apache.nutch.analysis.NutchAnalyzer. I guess it might execute
similar tasks...

The class called by my query plugin is
org.apache.nutch.searcher.RawFieldQueryFilter. I'll check into that
also.

I'll plunge into the details and let you know if I find something.


David

-----------------------------------------
David Poirier
E-business Consultant - Software Engineer






-----Original Message-----
From: Fred Gilmore [mailto:[hidden email]]
Sent: mercredi, 26. mars 2008 18:34
To: [hidden email]
Subject: Re: nutch: creating new plugins: query plugin

I'm watching this thread with interest as I'm stuck in the same place.  
 From reading three years of list archives, people seem to get over the
hump of indexing custom fields and then get mired in query side.  My
index is fine.  Luke shows me the fields, the values.  I can change my
index plugin code to not split on commas and it obeys.   I can search it

with Luke and it pulls data.  I realize that only means so much since
it's parsing those queries with a Lucene class.

But I can't get past the query plugin.  No matter how closely I follow
the example on the wiki.  I can look at the query-url, query-more,
doesn't seem to matter.  In fact, right now, if I load the query-plugin
listed below (in addition to query-basic) it breaks all searching.  
keyword, fielded, whatever.


<plugin
   id="query-placename"
   name="Placename Query Filter"
   version="1.0.0"
   provider-name="utexas.edu">

   <runtime>
      <library name="query-placename.jar">
         <export name="*"/>
      </library>
   </runtime>

   <requires>
      <import plugin="nutch-extensionpoints"/>
   </requires>

   <extension
id="org.apache.nutch.searcher.placename.PlacenameQueryFilter"
              name="Placename Query Filter"
              point="org.apache.nutch.searcher.QueryFilter">
      <implementation id="PlacenameQueryFilter"
                     
class="org.apache.nutch.searcher.placename.PlacenameQueryFilter">
        <parameter name="fields" value="placename"/>
      </implementation>
   </extension>
</plugin>

===============
[search1]:nutch> pg PlacenameQueryFilter.java
package org.apache.nutch.searcher.placename;

import org.apache.nutch.searcher.FieldQueryFilter;
import org.apache.hadoop.conf.Configuration;


public class PlacenameQueryFilter extends FieldQueryFilter {

  public PlacenameQueryFilter() {
    super("placename", 5f);
  }

  public void setConf(Configuration conf) {
    super.setConf(conf);
  }

}

The wiki plugin example omits setConf as above, the query-url code sets
it as does query-more.  Some use rawfieldqueryfilter, some use
queryfilter, doesn't seem that should matter.

The plugin gets shuttled over to the tomcat side, the nutch-site.xml
gets updated with a new plugins.include stanza and the webapp
redeployed.  I've tried loading nutch-extensionpoints first here as
well, doesn't seem to matter.  Maybe the boost is messing things up,
it's set high but that's because previous threads have indicated it was
the only way to get the field only searches like placename:london
working.

  <property>
    <name>searcher.dir</name>
    <value>/usr/local/db/nutch/search1/crawls/missions-test</value>
  </property>

<property>
  <name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|meta)|index-(basic
|more|meta)|query-(basic|more|placename|creator|url)|summary-lucene|scor
ing-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

 No other nutch-default.xml or nutch-site.xml settings are altered.  But

there must be something obvious I'm leaving unset or that's conflicting
on the tomcat side which is breaking this.

removing the query-placename and query-creator plugins, keyword
searching resumes.  url: works, so the syntax is accepted.  

After several weeks of trying diff things, I'm all out.  But there must
be something I'm missing.  Any ideas at all?

thanks,

Fred Gilmore
University of Texas Austin Libraries
>>    
Reply | Threaded
Open this post in threaded view
|

RE: nutch: creating new plugins: query plugin

POIRIER David
Fred,

It works!!!

I added a third plugin, implementing (extending) the org.apache.nutch.parse.HtmlParseFilter this time, to my existing org.apache.nutch.indexer.IndexingFilter and org.apache.nutch.searcher.QueryFilter plugins.

HtmlParseFilter:
I check for certain patterns inside the body of the document (html) being parsed. Based on what I find I add a metadata (named scope) to the active Parse object.

IndexingFilter:
Analysing every Parse object available I look for the scope metadata. If I find it I add the scope field to the index for this particular document:

String scopeMetaContent = data.getMeta("scope");  
if (scopeMetaContent != null) {
Field scopeField = new Field("scope", scopeMetaContent, Field.Store.YES, Field.Index.UN_TOKENIZED);
scopeField.setBoost(5.0f);
doc.add(scopeField);
LOG.info("Added " + scopeMetaContent + " to the scope field");
}

QueryFilter:
I haven't changed a thing to this one...

I basically followed this tutorial: http://wiki.apache.org/nutch/WritingPluginExample-0%2e9 


Analysing one more time my code I see that I set my field boost value twice: inside the indexing filter and inside the query filter, where before I was just doing it for the query filter. There might be a beginning of an answer there.

Anyway, if you want more details don't hesitate to ask me.

Have a nice week-end.

David



------------------------
David Poirier
E-business Consultant - Software Engineer
 
Cross Systems - Groupe Micropole Univers
1227 Carouge / Genève



-----Original Message-----
From: POIRIER David [mailto:[hidden email]]
Sent: jeudi, 27. mars 2008 09:27
To: [hidden email]; [hidden email]
Subject: RE: nutch: creating new plugins: query plugin

Fred,

I must say I am happy to see that I am not the only one!

You are right: Using Luke and the
org.apache.lucene.analysis.KeywordAnalyzer I can search for my added
field (scope). An example: +content:"cancer"  +scope:"aScope". What I
understand is that using this analyzer you can filter your query using
any of the stored fields.

When executing a query through Nutch, the analyzer used is
org.apache.nutch.analysis.NutchAnalyzer. I guess it might execute
similar tasks...

The class called by my query plugin is
org.apache.nutch.searcher.RawFieldQueryFilter. I'll check into that
also.

I'll plunge into the details and let you know if I find something.


David

-----------------------------------------
David Poirier
E-business Consultant - Software Engineer






-----Original Message-----
From: Fred Gilmore [mailto:[hidden email]]
Sent: mercredi, 26. mars 2008 18:34
To: [hidden email]
Subject: Re: nutch: creating new plugins: query plugin

I'm watching this thread with interest as I'm stuck in the same place.  
 From reading three years of list archives, people seem to get over the
hump of indexing custom fields and then get mired in query side.  My
index is fine.  Luke shows me the fields, the values.  I can change my
index plugin code to not split on commas and it obeys.   I can search it

with Luke and it pulls data.  I realize that only means so much since
it's parsing those queries with a Lucene class.

But I can't get past the query plugin.  No matter how closely I follow
the example on the wiki.  I can look at the query-url, query-more,
doesn't seem to matter.  In fact, right now, if I load the query-plugin
listed below (in addition to query-basic) it breaks all searching.  
keyword, fielded, whatever.


<plugin
   id="query-placename"
   name="Placename Query Filter"
   version="1.0.0"
   provider-name="utexas.edu">

   <runtime>
      <library name="query-placename.jar">
         <export name="*"/>
      </library>
   </runtime>

   <requires>
      <import plugin="nutch-extensionpoints"/>
   </requires>

   <extension
id="org.apache.nutch.searcher.placename.PlacenameQueryFilter"
              name="Placename Query Filter"
              point="org.apache.nutch.searcher.QueryFilter">
      <implementation id="PlacenameQueryFilter"
                     
class="org.apache.nutch.searcher.placename.PlacenameQueryFilter">
        <parameter name="fields" value="placename"/>
      </implementation>
   </extension>
</plugin>

===============
[search1]:nutch> pg PlacenameQueryFilter.java
package org.apache.nutch.searcher.placename;

import org.apache.nutch.searcher.FieldQueryFilter;
import org.apache.hadoop.conf.Configuration;


public class PlacenameQueryFilter extends FieldQueryFilter {

  public PlacenameQueryFilter() {
    super("placename", 5f);
  }

  public void setConf(Configuration conf) {
    super.setConf(conf);
  }

}

The wiki plugin example omits setConf as above, the query-url code sets
it as does query-more.  Some use rawfieldqueryfilter, some use
queryfilter, doesn't seem that should matter.

The plugin gets shuttled over to the tomcat side, the nutch-site.xml
gets updated with a new plugins.include stanza and the webapp
redeployed.  I've tried loading nutch-extensionpoints first here as
well, doesn't seem to matter.  Maybe the boost is messing things up,
it's set high but that's because previous threads have indicated it was
the only way to get the field only searches like placename:london
working.

  <property>
    <name>searcher.dir</name>
    <value>/usr/local/db/nutch/search1/crawls/missions-test</value>
  </property>

<property>
  <name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|meta)|index-(basic
|more|meta)|query-(basic|more|placename|creator|url)|summary-lucene|scor
ing-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

 No other nutch-default.xml or nutch-site.xml settings are altered.  But

there must be something obvious I'm leaving unset or that's conflicting
on the tomcat side which is breaking this.

removing the query-placename and query-creator plugins, keyword
searching resumes.  url: works, so the syntax is accepted.  

After several weeks of trying diff things, I'm all out.  But there must
be something I'm missing.  Any ideas at all?

thanks,

Fred Gilmore
University of Texas Austin Libraries
>>