Doub't in the way lucene works

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Doub't in the way lucene works

Roopesh P Raj
Hi,

I am using solr in my project. I have used the schema almost similar to
the one given in the example folder which comes along when we download
solr. Most of the fields that I use is of type "text", and the rest are
of type "string".

Some of the search results are as follows:

When I search with a query, "attach", documents containing "attach",
"attachment", "attachments" comes as the result.
When the search string is "attachment", then also documents containing
"attach", "attachment", "attachments" comes as the result.

When I search for "newsletter", documents with keyword "newsletter" results.
But when I search for "news", no results appear.
When I search for "letter", then also there are no results.

Why does this happen?
Why is lucene not giving documents with "newsletter" when the search
string given is "letter" or "news"?

I am pasting the "text" fieldtype declaration also. Please help me.

    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

Regards
Roopesh


------------------
DigitalGlue, India



Reply | Threaded
Open this post in threaded view
|

RE: Doub't in the way lucene works

Stu Hood
Hello Roopesh,

What you are seeing is called 'Stemming'. Stemming takes tokens and reduces them to their language specific prefixes. So for instance, when you search for attach, you get the word 'attachment', which shares a common English language specific prefix.

Newsletter is an interesting example: you will never get a match when you search for 'letter', because stemming only handles prefixes. The fact that you don't get a match for news is a bit more complicated. The stemming engine did not reduce newsletter all the way to the 'news' prefix, perhaps because the words have semantically different meanings (where in the attach/attachment case, an attachment is something that you attach).

I can't find any good Solr specific stemming links, but check out the Wikipedia page: http://en.wikipedia.org/wiki/Stemming

Thanks,
Stu


-----Original Message-----
From: Roopesh P Raj <[hidden email]>
Sent: Wednesday, February 13, 2008 1:43am
To: [hidden email]
Subject: Doub't in the way lucene works

Hi,

I am using solr in my project. I have used the schema almost similar to
the one given in the example folder which comes along when we download
solr. Most of the fields that I use is of type "text", and the rest are
of type "string".

Some of the search results are as follows:

When I search with a query, "attach", documents containing "attach",
"attachment", "attachments" comes as the result.
When the search string is "attachment", then also documents containing
"attach", "attachment", "attachments" comes as the result.

When I search for "newsletter", documents with keyword "newsletter" results.
But when I search for "news", no results appear.
When I search for "letter", then also there are no results.

Why does this happen?
Why is lucene not giving documents with "newsletter" when the search
string given is "letter" or "news"?

I am pasting the "text" fieldtype declaration also. Please help me.

    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

Regards
Roopesh


------------------
DigitalGlue, India





Reply | Threaded
Open this post in threaded view
|

Re: Doub't in the way lucene works

Roopesh P Raj
Hi Stu,

Thank you very much for your reply. It cleared very many things.

Thanks,
Roopesh

Stu Hood wrote:

> Hello Roopesh,
>
> What you are seeing is called 'Stemming'. Stemming takes tokens and reduces them to their language specific prefixes. So for instance, when you search for attach, you get the word 'attachment', which shares a common English language specific prefix.
>
> Newsletter is an interesting example: you will never get a match when you search for 'letter', because stemming only handles prefixes. The fact that you don't get a match for news is a bit more complicated. The stemming engine did not reduce newsletter all the way to the 'news' prefix, perhaps because the words have semantically different meanings (where in the attach/attachment case, an attachment is something that you attach).
>
> I can't find any good Solr specific stemming links, but check out the Wikipedia page: http://en.wikipedia.org/wiki/Stemming
>
> Thanks,
> Stu
>
>
> -----Original Message-----
> From: Roopesh P Raj <[hidden email]>
> Sent: Wednesday, February 13, 2008 1:43am
> To: [hidden email]
> Subject: Doub't in the way lucene works
>
> Hi,
>
> I am using solr in my project. I have used the schema almost similar to
> the one given in the example folder which comes along when we download
> solr. Most of the fields that I use is of type "text", and the rest are
> of type "string".
>
> Some of the search results are as follows:
>
> When I search with a query, "attach", documents containing "attach",
> "attachment", "attachments" comes as the result.
> When the search string is "attachment", then also documents containing
> "attach", "attachment", "attachments" comes as the result.
>
> When I search for "newsletter", documents with keyword "newsletter" results.
> But when I search for "news", no results appear.
> When I search for "letter", then also there are no results.
>
> Why does this happen?
> Why is lucene not giving documents with "newsletter" when the search
> string given is "letter" or "news"?
>
> I am pasting the "text" fieldtype declaration also. Please help me.
>
>     <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <!-- in this example, we will only use synonyms at query time
>         <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>         -->
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>     <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>     <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
> Regards
> Roopesh
>
>
> ------------------
> DigitalGlue, India
>
>
>
>
>
>
>
>  


------------------
DigitalGlue, India