How to index and query "C#" as whole term?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

How to index and query "C#" as whole term?

Gnanakumar
Hi,

I'm using Apache Solr v3.1.

How do I configure/allow Solr to both index and query the term "c#" as a
whole word/term?  From "Analysis" page, I could see that the term "c#" is
being reduced/converted into just "c" by solr.WordDelimiterFilterFactory.

Regards,
Gnanam

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: How to index and query "C#" as whole term?

Gora Mohanty-3
On Mon, May 16, 2011 at 7:05 PM, Gnanakumar <[hidden email]> wrote:
> Hi,
>
> I'm using Apache Solr v3.1.
>
> How do I configure/allow Solr to both index and query the term "c#" as a
> whole word/term?  From "Analysis" page, I could see that the term "c#" is
> being reduced/converted into just "c" by solr.WordDelimiterFilterFactory.
[...]

Yes, as you have discovered the analyzers for the field type in
question will affect the values indexed.

To index "c#" exactly as is, you can use the "string" type, instead
of the "text" type. However, what you probably want some filters
to be applied, e.g., LowerCaseFilterFactory. Take a look at the
definition of the fieldType "text" in schema.xml, define a new field
type that has only the tokenizers and analyzers that you need, and
use that type for your field. This Wiki page should be helpful:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Regards,
Gora
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: How to index and query "C#" as whole term?

Jonathan Rochkind
I don't think you'd want to use the string type here. String type is
almost never appropriate for a field you want to actually search on (it
is appropriate for fields to facet on).

But you may want to use Text type with different analyzers selected.  
You probably want Text type so the value is still split into different
tokens on word boundaries; you just don't want an analyzer set that
removes punctuation.

On 5/16/2011 10:46 AM, Gora Mohanty wrote:

> On Mon, May 16, 2011 at 7:05 PM, Gnanakumar<[hidden email]>  wrote:
>> Hi,
>>
>> I'm using Apache Solr v3.1.
>>
>> How do I configure/allow Solr to both index and query the term "c#" as a
>> whole word/term?  From "Analysis" page, I could see that the term "c#" is
>> being reduced/converted into just "c" by solr.WordDelimiterFilterFactory.
> [...]
>
> Yes, as you have discovered the analyzers for the field type in
> question will affect the values indexed.
>
> To index "c#" exactly as is, you can use the "string" type, instead
> of the "text" type. However, what you probably want some filters
> to be applied, e.g., LowerCaseFilterFactory. Take a look at the
> definition of the fieldType "text" in schema.xml, define a new field
> type that has only the tokenizers and analyzers that you need, and
> use that type for your field. This Wiki page should be helpful:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>
> Regards,
> Gora
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

RE: How to index and query "C#" as whole term?

Robert Petersen-3
I have always just converted terms like 'C#' or 'C++' into 'csharp' and
'cplusplus' before indexing them and similarly converted those terms if
someone searched on them.  That always has worked just fine for me...
:)

-----Original Message-----
From: Jonathan Rochkind [mailto:[hidden email]]
Sent: Monday, May 16, 2011 8:28 AM
To: [hidden email]
Subject: Re: How to index and query "C#" as whole term?

I don't think you'd want to use the string type here. String type is
almost never appropriate for a field you want to actually search on (it
is appropriate for fields to facet on).

But you may want to use Text type with different analyzers selected.  
You probably want Text type so the value is still split into different
tokens on word boundaries; you just don't want an analyzer set that
removes punctuation.

On 5/16/2011 10:46 AM, Gora Mohanty wrote:
> On Mon, May 16, 2011 at 7:05 PM, Gnanakumar<[hidden email]>  wrote:
>> Hi,
>>
>> I'm using Apache Solr v3.1.
>>
>> How do I configure/allow Solr to both index and query the term "c#"
as a
>> whole word/term?  From "Analysis" page, I could see that the term
"c#" is
>> being reduced/converted into just "c" by
solr.WordDelimiterFilterFactory.

> [...]
>
> Yes, as you have discovered the analyzers for the field type in
> question will affect the values indexed.
>
> To index "c#" exactly as is, you can use the "string" type, instead
> of the "text" type. However, what you probably want some filters
> to be applied, e.g., LowerCaseFilterFactory. Take a look at the
> definition of the fieldType "text" in schema.xml, define a new field
> type that has only the tokenizers and analyzers that you need, and
> use that type for your field. This Wiki page should be helpful:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>
> Regards,
> Gora
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: How to index and query "C#" as whole term?

Markus Jelsma-2
Before indexing so outside Solr? Using the SynonymFilter would be easier i
guess.

On Monday 16 May 2011 17:44:24 Robert Petersen wrote:

> I have always just converted terms like 'C#' or 'C++' into 'csharp' and
> 'cplusplus' before indexing them and similarly converted those terms if
> someone searched on them.  That always has worked just fine for me...
>
> :)
>
> -----Original Message-----
> From: Jonathan Rochkind [mailto:[hidden email]]
> Sent: Monday, May 16, 2011 8:28 AM
> To: [hidden email]
> Subject: Re: How to index and query "C#" as whole term?
>
> I don't think you'd want to use the string type here. String type is
> almost never appropriate for a field you want to actually search on (it
> is appropriate for fields to facet on).
>
> But you may want to use Text type with different analyzers selected.
> You probably want Text type so the value is still split into different
> tokens on word boundaries; you just don't want an analyzer set that
> removes punctuation.
>
> On 5/16/2011 10:46 AM, Gora Mohanty wrote:
> > On Mon, May 16, 2011 at 7:05 PM, Gnanakumar<[hidden email]>  wrote:
> >> Hi,
> >>
> >> I'm using Apache Solr v3.1.
> >>
> >> How do I configure/allow Solr to both index and query the term "c#"
>
> as a
>
> >> whole word/term?  From "Analysis" page, I could see that the term
>
> "c#" is
>
> >> being reduced/converted into just "c" by
>
> solr.WordDelimiterFilterFactory.
>
> > [...]
> >
> > Yes, as you have discovered the analyzers for the field type in
> > question will affect the values indexed.
> >
> > To index "c#" exactly as is, you can use the "string" type, instead
> > of the "text" type. However, what you probably want some filters
> > to be applied, e.g., LowerCaseFilterFactory. Take a look at the
> > definition of the fieldType "text" in schema.xml, define a new field
> > type that has only the tokenizers and analyzers that you need, and
> > use that type for your field. This Wiki page should be helpful:
> > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
> >
> > Regards,
> > Gora

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

RE: How to index and query "C#" as whole term?

Robert Petersen-3
Sorry I am also using a synonyms.txt for this in the analysis stack.  I
was not clear, sorry for any confusion.  I am not doing it outside of
Solr but on the way into the index it is converted...  :)

-----Original Message-----
From: Markus Jelsma [mailto:[hidden email]]
Sent: Monday, May 16, 2011 8:51 AM
To: [hidden email]
Subject: Re: How to index and query "C#" as whole term?

Before indexing so outside Solr? Using the SynonymFilter would be easier
i
guess.

On Monday 16 May 2011 17:44:24 Robert Petersen wrote:
> I have always just converted terms like 'C#' or 'C++' into 'csharp'
and
> 'cplusplus' before indexing them and similarly converted those terms
if

> someone searched on them.  That always has worked just fine for me...
>
> :)
>
> -----Original Message-----
> From: Jonathan Rochkind [mailto:[hidden email]]
> Sent: Monday, May 16, 2011 8:28 AM
> To: [hidden email]
> Subject: Re: How to index and query "C#" as whole term?
>
> I don't think you'd want to use the string type here. String type is
> almost never appropriate for a field you want to actually search on
(it
> is appropriate for fields to facet on).
>
> But you may want to use Text type with different analyzers selected.
> You probably want Text type so the value is still split into different
> tokens on word boundaries; you just don't want an analyzer set that
> removes punctuation.
>
> On 5/16/2011 10:46 AM, Gora Mohanty wrote:
> > On Mon, May 16, 2011 at 7:05 PM, Gnanakumar<[hidden email]>
wrote:

> >> Hi,
> >>
> >> I'm using Apache Solr v3.1.
> >>
> >> How do I configure/allow Solr to both index and query the term "c#"
>
> as a
>
> >> whole word/term?  From "Analysis" page, I could see that the term
>
> "c#" is
>
> >> being reduced/converted into just "c" by
>
> solr.WordDelimiterFilterFactory.
>
> > [...]
> >
> > Yes, as you have discovered the analyzers for the field type in
> > question will affect the values indexed.
> >
> > To index "c#" exactly as is, you can use the "string" type, instead
> > of the "text" type. However, what you probably want some filters
> > to be applied, e.g., LowerCaseFilterFactory. Take a look at the
> > definition of the fieldType "text" in schema.xml, define a new field
> > type that has only the tokenizers and analyzers that you need, and
> > use that type for your field. This Wiki page should be helpful:
> > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
> >
> > Regards,
> > Gora

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: How to index and query "C#" as whole term?

Erick Erickson
The other advantage to the synonyms approach is it will be much less
of a headache down the road.

For instance, imagine you've defined "whitespacetokenizer" and
"lowercasefilter".
That'll fix your example just fine. It'll also cause all punctuation
to be included in
the tokens, so if you indexed "try to find me." (note the period) and
searched for
"me" (without the period) you'd not get a hit.

Then, let's say you get clever and do a regex manipulation via
PatternReplaceCharFilterFactory to leave in '#' but remove other
punctuation.....
Then any miscellaneous stream that contains a # will give surprising
results. Consider 15# (for 15 pounds). Won't match 15 in a search now.

So whatever solution you choose, think about it pretty carefully before
you jump <G>..

Best
Erick

On Mon, May 16, 2011 at 2:10 PM, Robert Petersen <[hidden email]> wrote:

> Sorry I am also using a synonyms.txt for this in the analysis stack.  I
> was not clear, sorry for any confusion.  I am not doing it outside of
> Solr but on the way into the index it is converted...  :)
>
> -----Original Message-----
> From: Markus Jelsma [mailto:[hidden email]]
> Sent: Monday, May 16, 2011 8:51 AM
> To: [hidden email]
> Subject: Re: How to index and query "C#" as whole term?
>
> Before indexing so outside Solr? Using the SynonymFilter would be easier
> i
> guess.
>
> On Monday 16 May 2011 17:44:24 Robert Petersen wrote:
>> I have always just converted terms like 'C#' or 'C++' into 'csharp'
> and
>> 'cplusplus' before indexing them and similarly converted those terms
> if
>> someone searched on them.  That always has worked just fine for me...
>>
>> :)
>>
>> -----Original Message-----
>> From: Jonathan Rochkind [mailto:[hidden email]]
>> Sent: Monday, May 16, 2011 8:28 AM
>> To: [hidden email]
>> Subject: Re: How to index and query "C#" as whole term?
>>
>> I don't think you'd want to use the string type here. String type is
>> almost never appropriate for a field you want to actually search on
> (it
>> is appropriate for fields to facet on).
>>
>> But you may want to use Text type with different analyzers selected.
>> You probably want Text type so the value is still split into different
>> tokens on word boundaries; you just don't want an analyzer set that
>> removes punctuation.
>>
>> On 5/16/2011 10:46 AM, Gora Mohanty wrote:
>> > On Mon, May 16, 2011 at 7:05 PM, Gnanakumar<[hidden email]>
> wrote:
>> >> Hi,
>> >>
>> >> I'm using Apache Solr v3.1.
>> >>
>> >> How do I configure/allow Solr to both index and query the term "c#"
>>
>> as a
>>
>> >> whole word/term?  From "Analysis" page, I could see that the term
>>
>> "c#" is
>>
>> >> being reduced/converted into just "c" by
>>
>> solr.WordDelimiterFilterFactory.
>>
>> > [...]
>> >
>> > Yes, as you have discovered the analyzers for the field type in
>> > question will affect the values indexed.
>> >
>> > To index "c#" exactly as is, you can use the "string" type, instead
>> > of the "text" type. However, what you probably want some filters
>> > to be applied, e.g., LowerCaseFilterFactory. Take a look at the
>> > definition of the fieldType "text" in schema.xml, define a new field
>> > type that has only the tokenizers and analyzers that you need, and
>> > use that type for your field. This Wiki page should be helpful:
>> > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>> >
>> > Regards,
>> > Gora
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

RE: How to index and query "C#" as whole term?

Gnanakumar
Thank you all for your valuable suggestion/approach.  I'll set it up in
synonyms.txt using solr.SynonymFilterFactory. Hope this fit the bill.

-----Original Message-----
From: Erick Erickson [mailto:[hidden email]]
Sent: Tuesday, May 17, 2011 2:12 AM
To: [hidden email]
Subject: Re: How to index and query "C#" as whole term?

The other advantage to the synonyms approach is it will be much less
of a headache down the road.

For instance, imagine you've defined "whitespacetokenizer" and
"lowercasefilter".
That'll fix your example just fine. It'll also cause all punctuation
to be included in
the tokens, so if you indexed "try to find me." (note the period) and
searched for
"me" (without the period) you'd not get a hit.

Then, let's say you get clever and do a regex manipulation via
PatternReplaceCharFilterFactory to leave in '#' but remove other
punctuation.....
Then any miscellaneous stream that contains a # will give surprising
results. Consider 15# (for 15 pounds). Won't match 15 in a search now.

So whatever solution you choose, think about it pretty carefully before
you jump <G>..

Best
Erick

On Mon, May 16, 2011 at 2:10 PM, Robert Petersen <[hidden email]> wrote:

> Sorry I am also using a synonyms.txt for this in the analysis stack.  I
> was not clear, sorry for any confusion.  I am not doing it outside of
> Solr but on the way into the index it is converted...  :)
>
> -----Original Message-----
> From: Markus Jelsma [mailto:[hidden email]]
> Sent: Monday, May 16, 2011 8:51 AM
> To: [hidden email]
> Subject: Re: How to index and query "C#" as whole term?
>
> Before indexing so outside Solr? Using the SynonymFilter would be easier
> i
> guess.
>
> On Monday 16 May 2011 17:44:24 Robert Petersen wrote:
>> I have always just converted terms like 'C#' or 'C++' into 'csharp'
> and
>> 'cplusplus' before indexing them and similarly converted those terms
> if
>> someone searched on them.  That always has worked just fine for me...
>>
>> :)
>>
>> -----Original Message-----
>> From: Jonathan Rochkind [mailto:[hidden email]]
>> Sent: Monday, May 16, 2011 8:28 AM
>> To: [hidden email]
>> Subject: Re: How to index and query "C#" as whole term?
>>
>> I don't think you'd want to use the string type here. String type is
>> almost never appropriate for a field you want to actually search on
> (it
>> is appropriate for fields to facet on).
>>
>> But you may want to use Text type with different analyzers selected.
>> You probably want Text type so the value is still split into different
>> tokens on word boundaries; you just don't want an analyzer set that
>> removes punctuation.
>>
>> On 5/16/2011 10:46 AM, Gora Mohanty wrote:
>> > On Mon, May 16, 2011 at 7:05 PM, Gnanakumar<[hidden email]>
> wrote:
>> >> Hi,
>> >>
>> >> I'm using Apache Solr v3.1.
>> >>
>> >> How do I configure/allow Solr to both index and query the term "c#"
>>
>> as a
>>
>> >> whole word/term?  From "Analysis" page, I could see that the term
>>
>> "c#" is
>>
>> >> being reduced/converted into just "c" by
>>
>> solr.WordDelimiterFilterFactory.
>>
>> > [...]
>> >
>> > Yes, as you have discovered the analyzers for the field type in
>> > question will affect the values indexed.
>> >
>> > To index "c#" exactly as is, you can use the "string" type, instead
>> > of the "text" type. However, what you probably want some filters
>> > to be applied, e.g., LowerCaseFilterFactory. Take a look at the
>> > definition of the fieldType "text" in schema.xml, define a new field
>> > type that has only the tokenizers and analyzers that you need, and
>> > use that type for your field. This Wiki page should be helpful:
>> > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>> >
>> > Regards,
>> > Gora
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>


Loading...