category string gets matched as a term

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

category string gets matched as a term

Dima Gritsenko
Hi,

I have categorized web sites during crawl to provide filtered results similar to google Video, Images tabs.
 
But when I enter
category:video MySearchString
nutch matches both the video and MySearchString as terms (though it filters results correctly and displays links to only video categorized pages) but the search is not relevant since "video" string is matched as well.

How do I filter category string off during search?

Great thanks.
Dima.
Reply | Threaded
Open this post in threaded view
|

Re: category string gets matched as a term

Alvaro Cabrerizo
It looks you syntax is correct ( category:video searchString). Try to
write a LOG.info line into
org.apache.nutch.searcher.LuceneQueryOptimizer(Line 178), just at the
begining of the optimize method:

public TopDocs optimize(BooleanQuery original,
Searcher searcher, int numHits,
String sortField, boolean reverse)
throws IOException {
LOG.info("Query -> "+original.toString());

Recompile nutch a make a query, for example category:video funny if your
category plugin works fine you'll get an info line within hadoop.log similar
to this:

+(url:funny^0.0 anchor:funny^0.0 content:funny title:funny^0.0
host:funny^0.0) +category:video

First part means (+(url:funny^0.0 anchor:funny^0.0 content:funny
title:funny^0.0
host:funny^0.0)) that funny must appear at least in one of that fields (url,
anchor...). The second part filters results to obtain only the ones
tagged as video.

In your case it looks like the word video is being included into the first
part. Check your plugin implementation is correct, and the plugin.xml and
build.xml are correct. Your plugin.xml should look similar to this:

...
<extension id="..."
                    name="...."
                    point="org.apache.nutch.searcher.QueryFilter">
   <implementation id="..."  class="...."/>
   <parameter name="raw-fields" value="category"/>
</extension>

Hope it helps.

2006/10/3, Dima Gritsenko < [hidden email]>:

>
> Hi,
>
> I have categorized web sites during crawl to provide filtered results
> similar to google Video, Images tabs.
>
> But when I enter
> category:video MySearchString
> nutch matches both the video and MySearchString as terms (though it
> filters results correctly and displays links to only video categorized
> pages) but the search is not relevant since "video" string is matched as
> well.
>
> How do I filter category string off during search?
>
> Great thanks.
> Dima.
>
>