[jira] Created: (NUTCH-260) Three new plugins that parse, index and query meta tags defined in the configuration

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-260) Three new plugins that parse, index and query meta tags defined in the configuration

JIRA jira@apache.org
Three new plugins that parse, index and query meta tags defined in the configuration
------------------------------------------------------------------------------------

         Key: NUTCH-260
         URL: http://issues.apache.org/jira/browse/NUTCH-260
     Project: Nutch
        Type: New Feature

  Components: indexer, searcher  
    Versions: 0.7.2    
 Environment: Built and tested on Linux so far.
    Reporter: Jake Vanderdray
    Priority: Minor


These plugins allow you to define meta tags in you're nutch-site file that you want to include in parseing, indexing and searching.  The query plugin must replace query-basic.  The format for adding query terms to nutch-site.xml is:

<property>
  <name>meta.names</name>
  <value>keywords,recommended</value>
  <description>This is a comma seperated list of meta tag names that will
  be parsed, indexed and searched against when parse-meta, index-meta and
  query-meta are used.</description>
</property>

<property>
  <name>meta.boosts</name>
  <value>1.0,5.0</value>
  <description>Comma seperated list of boost values when searching using
  query-meta.  The order of the values should match the order of meta.names.
  </description>
</property>

Meta tags found are assumed to have either a single value or be a comma seperated list of values.  The values found are added to the index as lucene keywords (i.e. meta name=keywords values="First Thing, Second Thing" would result in two keyword fields named "keywords".  The first would countain "First Thing" and the second would contain "Second Thing").

I had to replace the query-basic plugin in order to allow matches in the meta fields to return hits even if there were no matches in any of the default fields.  The query-basic field only returns hits when every search term is found in at least one default field.  I needed hits returned if matches were found in at least one field for every term, and/or the entire search phrase appeared in a meta index field.

One known bug is that common terms are not getting stripped out of the fields' values before they get indexed, so "The Next Big Thing" could not be matched because the query engine will strip out "the" from all queries.  I intend to fix this by stipping out common terms from meta fields before indexing them.

Another issue is that searching for "Next Big Thing" would not match meta index values for "Next", "Big" or "Thing".  You can consider that a bug or a feature depending on how you look at it.

These plugins were written for and only work on the 0.7.2 branch.

I'm going to attache a tarball of the source of these three plugins after I create the issue.  To use the plugins, you'll need to untar them in your src/plugins directory and add them to the ant build.xml directive (and of course add them in your nutch-site.xml file).  If these end up getting added to the project, I'll write up documentation on the wiki.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-260) Three new plugins that parse, index and query meta tags defined in the configuration

JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-260?page=all ]

Jake Vanderdray updated NUTCH-260:
----------------------------------

    Attachment: nutch_customizations.tar

The attachment is a tarball of the plugin source.

> Three new plugins that parse, index and query meta tags defined in the configuration
> ------------------------------------------------------------------------------------
>
>          Key: NUTCH-260
>          URL: http://issues.apache.org/jira/browse/NUTCH-260
>      Project: Nutch
>         Type: New Feature

>   Components: indexer, searcher
>     Versions: 0.7.2
>  Environment: Built and tested on Linux so far.
>     Reporter: Jake Vanderdray
>     Priority: Minor
>  Attachments: nutch_customizations.tar
>
> These plugins allow you to define meta tags in you're nutch-site file that you want to include in parseing, indexing and searching.  The query plugin must replace query-basic.  The format for adding query terms to nutch-site.xml is:
> <property>
>   <name>meta.names</name>
>   <value>keywords,recommended</value>
>   <description>This is a comma seperated list of meta tag names that will
>   be parsed, indexed and searched against when parse-meta, index-meta and
>   query-meta are used.</description>
> </property>
> <property>
>   <name>meta.boosts</name>
>   <value>1.0,5.0</value>
>   <description>Comma seperated list of boost values when searching using
>   query-meta.  The order of the values should match the order of meta.names.
>   </description>
> </property>
> Meta tags found are assumed to have either a single value or be a comma seperated list of values.  The values found are added to the index as lucene keywords (i.e. meta name=keywords values="First Thing, Second Thing" would result in two keyword fields named "keywords".  The first would countain "First Thing" and the second would contain "Second Thing").
> I had to replace the query-basic plugin in order to allow matches in the meta fields to return hits even if there were no matches in any of the default fields.  The query-basic field only returns hits when every search term is found in at least one default field.  I needed hits returned if matches were found in at least one field for every term, and/or the entire search phrase appeared in a meta index field.
> One known bug is that common terms are not getting stripped out of the fields' values before they get indexed, so "The Next Big Thing" could not be matched because the query engine will strip out "the" from all queries.  I intend to fix this by stipping out common terms from meta fields before indexing them.
> Another issue is that searching for "Next Big Thing" would not match meta index values for "Next", "Big" or "Thing".  You can consider that a bug or a feature depending on how you look at it.
> These plugins were written for and only work on the 0.7.2 branch.
> I'm going to attache a tarball of the source of these three plugins after I create the issue.  To use the plugins, you'll need to untar them in your src/plugins directory and add them to the ant build.xml directive (and of course add them in your nutch-site.xml file).  If these end up getting added to the project, I'll write up documentation on the wiki.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira