How to search for "C++"?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

How to search for "C++"?

Leonardo Dias-5
Hello there!

Currently we're having a problem in here and we're looking for some
solutions. Right now we use the Standard Tokenizer to separate tokens
and we just found out that we cannot search for "c++" in our index
because it is not considered a word.

Since we need this search to work properly (including a search for C#)
we'd like to know what are you guys doing when people search for words
that have symbols, like these programming languages. I thought there
could be a list of "protected words" in the standard tokenizer, so that
we could protect these tokens. Another possibility would be using the
Pattern Tokenizer, but it seems it is kinda slow when it comes to index
a huge amount of data, which is our case.

What do you think the best solution would be?

Best,

Leonardo

--


Reply | Threaded
Open this post in threaded view
|

RE: How to search for "C++"?

Jana, Kumar Raja
Hi Leonardo,
1. U can change the fieldtype to "string" in which case no tokenizers
will act on ur data and the content will be stored as is.
2. If u are using Solr 1.4 (latest) then there is a provision to mention
protected words for WordDelimiterFilterFactory which will take care of
your issue.

-Kumar

-----Original Message-----
From: Leonardo Dias [mailto:[hidden email]]
Sent: Thursday, March 26, 2009 6:53 PM
To: [hidden email]
Subject: How to search for "C++"?

Hello there!

Currently we're having a problem in here and we're looking for some
solutions. Right now we use the Standard Tokenizer to separate tokens
and we just found out that we cannot search for "c++" in our index
because it is not considered a word.

Since we need this search to work properly (including a search for C#)
we'd like to know what are you guys doing when people search for words
that have symbols, like these programming languages. I thought there
could be a list of "protected words" in the standard tokenizer, so that
we could protect these tokens. Another possibility would be using the
Pattern Tokenizer, but it seems it is kinda slow when it comes to index
a huge amount of data, which is our case.

What do you think the best solution would be?

Best,

Leonardo

--


Reply | Threaded
Open this post in threaded view
|

Re: How to search for "C++"?

Yonik Seeley-2-2
Synonym mappings are an easy way to handle specific cases like these...
C++ => cplusplus
C# => csharp

-Yonik
http://www.lucidimagination.com


On Thu, Mar 26, 2009 at 9:27 AM, Jana, Kumar Raja <[hidden email]> wrote:

> Hi Leonardo,
> 1. U can change the fieldtype to "string" in which case no tokenizers
> will act on ur data and the content will be stored as is.
> 2. If u are using Solr 1.4 (latest) then there is a provision to mention
> protected words for WordDelimiterFilterFactory which will take care of
> your issue.
>
> -Kumar
>
> -----Original Message-----
> From: Leonardo Dias [mailto:[hidden email]]
> Sent: Thursday, March 26, 2009 6:53 PM
> To: [hidden email]
> Subject: How to search for "C++"?
>
> Hello there!
>
> Currently we're having a problem in here and we're looking for some
> solutions. Right now we use the Standard Tokenizer to separate tokens
> and we just found out that we cannot search for "c++" in our index
> because it is not considered a word.
>
> Since we need this search to work properly (including a search for C#)
> we'd like to know what are you guys doing when people search for words
> that have symbols, like these programming languages. I thought there
> could be a list of "protected words" in the standard tokenizer, so that
> we could protect these tokens. Another possibility would be using the
> Pattern Tokenizer, but it seems it is kinda slow when it comes to index
> a huge amount of data, which is our case.
>
> What do you think the best solution would be?
>
> Best,
>
> Leonardo
>
> --
>
>
>