Simple question about query terms

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Simple question about query terms

Chaz Hickman
I'm experiencing some trouble in forming simple queries that include
non-alphabetic characters. One specific instance is if I want to search
for the string "@test".

If I build up the query using either addRequiredPhrase, addRequiredTerm,
or Query.parse, the search term loses the "@" sign at the front and
returns all the hits for the word "test".

Is there any way I can stop this happening and have more control over
how the query string is handled?

Thanks,
Chaz

Reply | Threaded
Open this post in threaded view
|

Re: Simple question about query terms

Jasper Kamperman
Yes. When you index your pages, the text is run through an analyzer  
that parses it into tokens. The analyzer does interesting stuff like  
lowercasing, throwing away bothersome characters, stemming  
(tokenizing the word "looking" into "look" because that is the stem  
of the verb). There are many analyzers that do different types of  
normalization, you can check in Luke to see a bunch that are available.

Obviously, in order to get the correct search results, you have to  
run your search string through the same analyzer that was used when  
indexing the text. Again, with Luke you can see how your query gets  
parsed.

So getting more control over this means you have to select a  
different analyzer and use it both during indexing and during search.  
Worst case if there is no analyzer that supports your needs you have  
to write one yourself.

Hope this helps,

Jasper

On Jan 30, 2008, at 3:34 AM, Chaz Hickman wrote:

> I'm experiencing some trouble in forming simple queries that  
> include non-alphabetic characters. One specific instance is if I  
> want to search for the string "@test".
>
> If I build up the query using either addRequiredPhrase,  
> addRequiredTerm, or Query.parse, the search term loses the "@" sign  
> at the front and returns all the hits for the word "test".
>
> Is there any way I can stop this happening and have more control  
> over how the query string is handled?
>
> Thanks,
> Chaz
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Simple question about query terms

Chaz Hickman
Jasper,

Thanks for the reply, yes, that helps my understanding. I had a little
look at the Luke tool which allowed me to see how different analyzers
were handling any given text, and seeing the tokens produced by using
org.apache.lucene.analysis.WhitespaceAnalyzer. I thought I'd attempt to
use that as a starting point. As I understand in, I need to use it both
when indexing and when querying.

I'm not certain whether I can use this, given it's a lucene analyzer and
not a nutch one, but even if I could, how do I use it for indexing? Is
it as simple as specifying it as a plugin in nutch-site.xml, or do I
need to do something more complicated? I've looked through the mailing
list, but can't find anything definitive.

If I can't use that, are there pre-built Nutch analyzers that can be
used? Where can I find out more details about those?

Any more help you could offer would be very gratefully received.

Thanks,
Chaz

Jasper Kamperman wrote:

> Yes. When you index your pages, the text is run through an analyzer
> that parses it into tokens. The analyzer does interesting stuff like
> lowercasing, throwing away bothersome characters, stemming
> (tokenizing the word "looking" into "look" because that is the stem
> of the verb). There are many analyzers that do different types of
> normalization, you can check in Luke to see a bunch that are available.
>
> Obviously, in order to get the correct search results, you have to
> run your search string through the same analyzer that was used when
> indexing the text. Again, with Luke you can see how your query gets
> parsed.
>
> So getting more control over this means you have to select a
> different analyzer and use it both during indexing and during search.
> Worst case if there is no analyzer that supports your needs you have
> to write one yourself.
>
> Hope this helps,
>
> Jasper
>
> On Jan 30, 2008, at 3:34 AM, Chaz Hickman wrote:
>
>> I'm experiencing some trouble in forming simple queries that
>> include non-alphabetic characters. One specific instance is if I
>> want to search for the string "@test".
>>
>> If I build up the query using either addRequiredPhrase,
>> addRequiredTerm, or Query.parse, the search term loses the "@" sign
>> at the front and returns all the hits for the word "test".
>>
>> Is there any way I can stop this happening and have more control
>> over how the query string is handled?
>>
>> Thanks,
>> Chaz
>>
>>
>