Question about word treatment...

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about word treatment...

escher2k
(1) How does one ensure that Solr treats words like .Net and 3D correctly ? Right now, they get
translated into Net and 3 d respectively.

(2) Is it possible to force Lucene to treat a multiword (e.g. Ruby on Rails) as one word ? I am not sure
if there is a mechanism to do this by creating a special text file (like the one that exists for synonyms for
instance) ?

Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: Question about word treatment...

Otis Gospodnetic-2
Hi,

Didn't see anyone answering your questions...

1) You'll have to write your own analyzer and tokenizer that does the right thing for your input.  From what you described so far, maybe you can simply use the WhitespaceAnalyzer or some such.

2) Again, you'd have to write your own analyzer and tokenizer that keeps track of the sliding window of the last N tokens and looks them up in your synonym table.  When it finds the given phrase in the lookup table, it returns those last N tokens as a single token.  Something like that....

Otis

--
Lucene Consulting -- http://lucene-consulting.com/


----- Original Message ----
From: escher2k <[hidden email]>
To: [hidden email]
Sent: Friday, May 4, 2007 4:08:03 PM
Subject: Question about word treatment...


(1) How does one ensure that Solr treats words like .Net and 3D correctly ?
Right now, they get
translated into Net and 3 d respectively.

(2) Is it possible to force Lucene to treat a multiword (e.g. Ruby on Rails)
as one word ? I am not sure
if there is a mechanism to do this by creating a special text file (like the
one that exists for synonyms for
instance) ?

Thanks.
--
View this message in context: http://www.nabble.com/Question-about-word-treatment...-tf3693913.html#a10329261
Sent from the Solr - User mailing list archive at Nabble.com.




Reply | Threaded
Open this post in threaded view
|

Re: Question about word treatment...

Chris Hostetter-3
In reply to this post by escher2k
: (1) How does one ensure that Solr treats words like .Net and 3D correctly ?
: Right now, they get
: translated into Net and 3 d respectively.

Solr doesn't do anything special with your input by default -- it only
does what your schema.xml tells it to do .. if you use the example schema,
then some text fields might be configured to use the WordDelimiterFilter
(which would split 3D into 3, D) ... if you don't like that behavior you
cna change it .. there are a lot of Tokenizer and TokenFilter options
available out of the box ... all of which are well documented on the Wiki,
and as you ply with them it's easy to see what they do using the ANALYSIS
link on the Solr admin screen.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Question about word treatment...

Yonik Seeley-2
In reply to this post by escher2k
On 5/4/07, escher2k <[hidden email]> wrote:
> (2) Is it possible to force Lucene to treat a multiword (e.g. Ruby on Rails)
> as one word ? I am not sure
> if there is a mechanism to do this by creating a special text file (like the
> one that exists for synonyms for
> instance) ?

Solr's SynonymFilter can handle multi-token synonyms.  That can be
used for things like Ruby on Rails.

-Yonik