Doing Shingle but also keep special single word

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Doing Shingle but also keep special single word

scott.chu
I am building index with Shingle filter. We know it's minimum 2-gram but I also want keep some special single word, e.g. IBM, Microsoft, etc. i.e. I want to do a minimum 2-gram but also want to have these single word in my index, Is it possible?

Scott
Reply | Threaded
Open this post in threaded view
|

Re: Doing Shingle but also keep special single word

Brendan Grainger
Hi Scott,

Is there a reason why you wouldn't just index these special words into another field and then search over both fields? That would also have the nice property of being able to boost on the special word field if you wanted.

HTH
Brendan

On Aug 20, 2010, at 6:19 AM, scott chu (朱炎詹) wrote:

> I am building index with Shingle filter. We know it's minimum 2-gram but I also want keep some special single word, e.g. IBM, Microsoft, etc. i.e. I want to do a minimum 2-gram but also want to have these single word in my index, Is it possible?
>
> Scott

Reply | Threaded
Open this post in threaded view
|

Re: Doing Shingle but also keep special single word

scott.chu
Hi, Brendan,

    Thanks for reply. The real case is that I can't predict when there's a
new important special word that users are interesting cause I am building a
daily news article data. Therefore, I don't know when & what single words
should include into that new field.  I've ever thought about manually
maintaining a special word dictionary but it costs too much effort, so I
gave up that idea.

However, you suggestion still sound a good trade-off to me, I'll take into
account seriously.

Scott

----- Original Message -----
From: "Brendan Grainger" <[hidden email]>
To: <[hidden email]>
Sent: Friday, August 20, 2010 10:06 PM
Subject: Re: Doing Shingle but also keep special single word


Hi Scott,

Is there a reason why you wouldn't just index these special words into
another field and then search over both fields? That would also have the
nice property of being able to boost on the special word field if you
wanted.

HTH
Brendan

On Aug 20, 2010, at 6:19 AM, scott chu (朱炎詹) wrote:

> I am building index with Shingle filter. We know it's minimum 2-gram but I
> also want keep some special single word, e.g. IBM, Microsoft, etc. i.e. I
> want to do a minimum 2-gram but also want to have these single word in my
> index, Is it possible?
>
> Scott


Reply | Threaded
Open this post in threaded view
|

Re: Doing Shingle but also keep special single word

iorixxx
In reply to this post by scott.chu
> I am building index with Shingle
> filter. We know it's minimum 2-gram but I also want keep
> some special single word, e.g. IBM, Microsoft, etc. i.e. I
> want to do a minimum 2-gram but also want to have these
> single word in my index, Is it possible?

outputUnigrams="true" parameter does not work for you?

After that you can cast <filter class="solr.KeepWordFilterFactory" words="keepwords.txt" ignoreCase="true"/> with keepwords.txt=IBM, Microsoft.


     
Reply | Threaded
Open this post in threaded view
|

Re: Doing Shingle but also keep special single word

scott.chu
Isn't set outputUnigrams="true" will make index size about twice than when
it's set to false?

Scott

----- Original Message -----
From: "Ahmet Arslan" <[hidden email]>
To: <[hidden email]>
Sent: Saturday, August 21, 2010 1:15 AM
Subject: Re: Doing Shingle but also keep special single word


>> I am building index with Shingle
>> filter. We know it's minimum 2-gram but I also want keep
>> some special single word, e.g. IBM, Microsoft, etc. i.e. I
>> want to do a minimum 2-gram but also want to have these
>> single word in my index, Is it possible?
>
> outputUnigrams="true" parameter does not work for you?
>
> After that you can cast <filter class="solr.KeepWordFilterFactory"
> words="keepwords.txt" ignoreCase="true"/> with keepwords.txt=IBM,
> Microsoft.
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Doing Shingle but also keep special single word

iorixxx
> Isn't set outputUnigrams="true" will
> make index size about twice than when it's set to false?

Sure index will be bigger. I didn't know that this is problem for you. But if you have a list of special single words that you want to keep, keepwordfilter can eliminate other tokens. So index size will be okey.

>
> Scott
>
> ----- Original Message ----- From: "Ahmet Arslan" <[hidden email]>
> To: <[hidden email]>
> Sent: Saturday, August 21, 2010 1:15 AM
> Subject: Re: Doing Shingle but also keep special single
> word
>
>
> >> I am building index with Shingle
> >> filter. We know it's minimum 2-gram but I also
> want keep
> >> some special single word, e.g. IBM, Microsoft,
> etc. i.e. I
> >> want to do a minimum 2-gram but also want to have
> these
> >> single word in my index, Is it possible?
> >
> > outputUnigrams="true" parameter does not work for
> you?
> >
> > After that you can cast <filter
> class="solr.KeepWordFilterFactory" words="keepwords.txt"
> ignoreCase="true"/> with keepwords.txt=IBM, Microsoft.
> >
> >
> >
> >
>
>


     
Reply | Threaded
Open this post in threaded view
|

Re: Doing Shingle but also keep special single word

MitchK
Hi,

keepword-filter is no solution for this problem, since this would lead to the problematic that one has to manage a word-dictionary. As explained, this would lead to too much effort.

You can easily add outputUnigrams=true and check out the analysis.jsp for this field. So you can see how much bigger a single field will become with this option.
However, I am quite sure that the difference between using outputUnigrams=true and indexing in a seperate field is not noteworthy.

I would suggest you to do it the additionally-field-way, since this would lead to more flexibility in boosting the different fields.

Unfortunately, I haven't understood your explanation about the use-case. But it sounds a little bit like tagging?

Kind regards,
- Mitch

iorixxx wrote
> Isn't set outputUnigrams="true" will
> make index size about twice than when it's set to false?

Sure index will be bigger. I didn't know that this is problem for you. But if you have a list of special single words that you want to keep, keepwordfilter can eliminate other tokens. So index size will be okey.

>
> Scott
>
> ----- Original Message ----- From: "Ahmet Arslan" <iorixxx@yahoo.com>
> To: <solr-user@lucene.apache.org>
> Sent: Saturday, August 21, 2010 1:15 AM
> Subject: Re: Doing Shingle but also keep special single
> word
>
>
> >> I am building index with Shingle
> >> filter. We know it's minimum 2-gram but I also
> want keep
> >> some special single word, e.g. IBM, Microsoft,
> etc. i.e. I
> >> want to do a minimum 2-gram but also want to have
> these
> >> single word in my index, Is it possible?
> >
> > outputUnigrams="true" parameter does not work for
> you?
> >
> > After that you can cast <filter
> class="solr.KeepWordFilterFactory" words="keepwords.txt"
> ignoreCase="true"/> with keepwords.txt=IBM, Microsoft.
> >
> >
> >
> >
>
>


     
Reply | Threaded
Open this post in threaded view
|

Re: Doing Shingle but also keep special single word

scott.chu
I think I didn't state my problem very well, allow me rephrase my case here:

1. We have over ten million news articles to build into Solr index.
2. We copy several fields, such as title, author, body, caption of attahed
photos into a new field for default search.
3. We then wanna use shingle filter on this new field.
4. We can't predict what new single-word noun that our users may be
interesting cause it's "news", you know. For exmple, the word "ECFA" is only
very popular word in news here recently, so I wish users can type in 'ECFA'
to search and Solr will output see some relevant news articles.
5. I wish to keep index as smaller as possible.
6. I also wish to do same thing descirbed in 5 when I search by explicitly
specifyng field name of those fields, too.

I don't quite understand additional-field-way? Do you mean making another
field that stores special words particularly but no indexing for that field?

Scott

----- Original Message -----
From: "MitchK" <[hidden email]>
To: <[hidden email]>
Sent: Sunday, August 22, 2010 11:48 PM
Subject: Re: Doing Shingle but also keep special single word


>
> Hi,
>
> keepword-filter is no solution for this problem, since this would lead to
> the problematic that one has to manage a word-dictionary. As explained,
> this
> would lead to too much effort.
>
> You can easily add outputUnigrams=true and check out the analysis.jsp for
> this field. So you can see how much bigger a single field will become with
> this option.
> However, I am quite sure that the difference between using
> outputUnigrams=true and indexing in a seperate field is not noteworthy.
>
> I would suggest you to do it the additionally-field-way, since this would
> lead to more flexibility in boosting the different fields.
>
> Unfortunately, I haven't understood your explanation about the use-case.
> But
> it sounds a little bit like tagging?
>
> Kind regards,
> - Mitch
>
>
> iorixxx wrote:
>>
>>> Isn't set outputUnigrams="true" will
>>> make index size about twice than when it's set to false?
>>
>> Sure index will be bigger. I didn't know that this is problem for you.
>> But
>> if you have a list of special single words that you want to keep,
>> keepwordfilter can eliminate other tokens. So index size will be okey.
>>
>>>
>>> Scott
>>>
>>> ----- Original Message ----- From: "Ahmet Arslan" <[hidden email]>
>>> To: <[hidden email]>
>>> Sent: Saturday, August 21, 2010 1:15 AM
>>> Subject: Re: Doing Shingle but also keep special single
>>> word
>>>
>>>
>>> >> I am building index with Shingle
>>> >> filter. We know it's minimum 2-gram but I also
>>> want keep
>>> >> some special single word, e.g. IBM, Microsoft,
>>> etc. i.e. I
>>> >> want to do a minimum 2-gram but also want to have
>>> these
>>> >> single word in my index, Is it possible?
>>> >
>>> > outputUnigrams="true" parameter does not work for
>>> you?
>>> >
>>> > After that you can cast <filter
>>> class="solr.KeepWordFilterFactory" words="keepwords.txt"
>>> ignoreCase="true"/> with keepwords.txt=IBM, Microsoft.
>>> >
>>> >
>>> >
>>> >
>>>
>>>
>>
>>
>>
>>
>>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1276506.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


--------------------------------------------------------------------------------



¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3083 - Release Date: 08/20/10
14:35:00

Reply | Threaded
Open this post in threaded view
|

Re: Doing Shingle but also keep special single word

iorixxx
> 1. We have over ten million news articles to build into
> Solr index.
> 2. We copy several fields, such as title, author, body,
> caption of attahed photos into a new field for default
> search.
> 3. We then wanna use shingle filter on this new field.
> 4. We can't predict what new single-word noun that our
> users may be interesting cause it's "news", you know. For
> exmple, the word "ECFA" is only very popular word in news
> here recently, so I wish users can type in 'ECFA' to search
> and Solr will output see some relevant news articles.
> 5. I wish to keep index as smaller as possible.
> 6. I also wish to do same thing descirbed in 5 when I
> search by explicitly specifyng field name of those fields,
> too.

Can i ask why do you need/use shingle filter?


     
Reply | Threaded
Open this post in threaded view
|

Re: Doing Shingle but also keep special single word

MitchK
In reply to this post by scott.chu
No, I mean that you use an additional field (indexed) for searching (i.e. whitespace-tokenized, so every word - seperated by a whitespace - becomes to a token .
So you have got two fields (shingle-token-field and single-token-field).
So you can search accross both fields.
This provides several benefits: i.e. you can boost the shingle-field at query-time, since a match in a shingle-field would mean, that there matches an exact phrase.

Additionally: You can search with single-word-queries as well as multi-word-queries.
Furthermore you can apply synonyms to your single-token-field.

If you want to keep your index as small as possible but as large as needed, try to understand Lucene's similarity implementation to consider, whether you can set the field option "omitNorms"=true or omitTermFreqAndPositions="true".
http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/search/Similarity.html
Keep in mind what happens, if you omit one of those options.

A small example of the consequences of setting omitNorms = true;.
doc1: "this is a short example doc"
doc2: "this is a longer example doc for presenting the effect of omitNorms"

If you are searching for "doc" while omitNorms=false your response will look like this:
doc1,
doc2
This is because the norm-value for doc1 is smaller as the norm-value for doc2, because doc1 is shorter than doc2 (have a look at the provided link).

If omitNorms=true, the scores for both docs will be equal.

Kind regards,
- Mitch

scott chu wrote
I don't quite understand additional-field-way? Do you mean making another
field that stores special words particularly but no indexing for that field?

Scott

----- Original Message -----
From: "MitchK" <mitch91@web.de>
To: <solr-user@lucene.apache.org>
Sent: Sunday, August 22, 2010 11:48 PM
Subject: Re: Doing Shingle but also keep special single word


>
> Hi,
>
> keepword-filter is no solution for this problem, since this would lead to
> the problematic that one has to manage a word-dictionary. As explained,
> this
> would lead to too much effort.
>
> You can easily add outputUnigrams=true and check out the analysis.jsp for
> this field. So you can see how much bigger a single field will become with
> this option.
> However, I am quite sure that the difference between using
> outputUnigrams=true and indexing in a seperate field is not noteworthy.
>
> I would suggest you to do it the additionally-field-way, since this would
> lead to more flexibility in boosting the different fields.
>
> Unfortunately, I haven't understood your explanation about the use-case.
> But
> it sounds a little bit like tagging?
>
> Kind regards,
> - Mitch
>
>
> iorixxx wrote:
>>
>>> Isn't set outputUnigrams="true" will
>>> make index size about twice than when it's set to false?
>>
>> Sure index will be bigger. I didn't know that this is problem for you.
>> But
>> if you have a list of special single words that you want to keep,
>> keepwordfilter can eliminate other tokens. So index size will be okey.
>>
>>>
>>> Scott
>>>
>>> ----- Original Message ----- From: "Ahmet Arslan" <iorixxx@yahoo.com>
>>> To: <solr-user@lucene.apache.org>
>>> Sent: Saturday, August 21, 2010 1:15 AM
>>> Subject: Re: Doing Shingle but also keep special single
>>> word
>>>
>>>
>>> >> I am building index with Shingle
>>> >> filter. We know it's minimum 2-gram but I also
>>> want keep
>>> >> some special single word, e.g. IBM, Microsoft,
>>> etc. i.e. I
>>> >> want to do a minimum 2-gram but also want to have
>>> these
>>> >> single word in my index, Is it possible?
>>> >
>>> > outputUnigrams="true" parameter does not work for
>>> you?
>>> >
>>> > After that you can cast <filter
>>> class="solr.KeepWordFilterFactory" words="keepwords.txt"
>>> ignoreCase="true"/> with keepwords.txt=IBM, Microsoft.
>>> >
>>> >
>>> >
>>> >
>>>
>>>
>>
>>
>>
>>
>>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1276506.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


--------------------------------------------------------------------------------



¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3083 - Release Date: 08/20/10
14:35:00
Reply | Threaded
Open this post in threaded view
|

Re: Doing Shingle but also keep special single word

scott.chu
In reply to this post by iorixxx
The request is from our business team, they wish user of our product can
type in partial string of a word that exists in title or body field. But now
I also doubt if this request is really necessary?

Scott

----- Original Message -----
From: "Ahmet Arslan" <[hidden email]>
To: <[hidden email]>
Sent: Monday, August 23, 2010 8:35 PM
Subject: Re: Doing Shingle but also keep special single word


>> 1. We have over ten million news articles to build into
>> Solr index.
>> 2. We copy several fields, such as title, author, body,
>> caption of attahed photos into a new field for default
>> search.
>> 3. We then wanna use shingle filter on this new field.
>> 4. We can't predict what new single-word noun that our
>> users may be interesting cause it's "news", you know. For
>> exmple, the word "ECFA" is only very popular word in news
>> here recently, so I wish users can type in 'ECFA' to search
>> and Solr will output see some relevant news articles.
>> 5. I wish to keep index as smaller as possible.
>> 6. I also wish to do same thing descirbed in 5 when I
>> search by explicitly specifyng field name of those fields,
>> too.
>
> Can i ask why do you need/use shingle filter?
>
>
>
>


--------------------------------------------------------------------------------



¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3088 - Release Date: 08/23/10
02:35:00

Reply | Threaded
Open this post in threaded view
|

Re: Doing Shingle but also keep special single word

scott.chu
In reply to this post by MitchK
Thanks! I'll give more effort to understand your suggestion & that Norm
thing.

----- Original Message -----
From: "MitchK" <[hidden email]>
To: <[hidden email]>
Sent: Tuesday, August 24, 2010 5:28 AM
Subject: Re: Doing Shingle but also keep special single word



No, I mean that you use an additional field (indexed) for searching (i.e.
whitespace-tokenized, so every word - seperated by a whitespace - becomes to
a token .
So you have got two fields (shingle-token-field and single-token-field).
So you can search accross both fields.
This provides several benefits: i.e. you can boost the shingle-field at
query-time, since a match in a shingle-field would mean, that there matches
an exact phrase.

Additionally: You can search with single-word-queries as well as
multi-word-queries.
Furthermore you can apply synonyms to your single-token-field.

If you want to keep your index as small as possible but as large as needed,
try to understand Lucene's similarity implementation to consider, whether
you can set the field option "omitNorms"=true or
omitTermFreqAndPositions="true".
http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/search/Similarity.html
Keep in mind what happens, if you omit one of those options.

A small example of the consequences of setting omitNorms = true;.
doc1: "this is a short example doc"
doc2: "this is a longer example doc for presenting the effect of omitNorms"

If you are searching for "doc" while omitNorms=false your response will look
like this:
doc1,
doc2
This is because the norm-value for doc1 is smaller as the norm-value for
doc2, because doc1 is shorter than doc2 (have a look at the provided link).

If omitNorms=true, the scores for both docs will be equal.

Kind regards,
- Mitch


scott chu wrote:

>
> I don't quite understand additional-field-way? Do you mean making another
> field that stores special words particularly but no indexing for that
> field?
>
> Scott
>
> ----- Original Message -----
> From: "MitchK" <[hidden email]>
> To: <[hidden email]>
> Sent: Sunday, August 22, 2010 11:48 PM
> Subject: Re: Doing Shingle but also keep special single word
>
>
>>
>> Hi,
>>
>> keepword-filter is no solution for this problem, since this would lead to
>> the problematic that one has to manage a word-dictionary. As explained,
>> this
>> would lead to too much effort.
>>
>> You can easily add outputUnigrams=true and check out the analysis.jsp for
>> this field. So you can see how much bigger a single field will become
>> with
>> this option.
>> However, I am quite sure that the difference between using
>> outputUnigrams=true and indexing in a seperate field is not noteworthy.
>>
>> I would suggest you to do it the additionally-field-way, since this would
>> lead to more flexibility in boosting the different fields.
>>
>> Unfortunately, I haven't understood your explanation about the use-case.
>> But
>> it sounds a little bit like tagging?
>>
>> Kind regards,
>> - Mitch
>>
>>
>> iorixxx wrote:
>>>
>>>> Isn't set outputUnigrams="true" will
>>>> make index size about twice than when it's set to false?
>>>
>>> Sure index will be bigger. I didn't know that this is problem for you.
>>> But
>>> if you have a list of special single words that you want to keep,
>>> keepwordfilter can eliminate other tokens. So index size will be okey.
>>>
>>>>
>>>> Scott
>>>>
>>>> ----- Original Message ----- From: "Ahmet Arslan" <[hidden email]>
>>>> To: <[hidden email]>
>>>> Sent: Saturday, August 21, 2010 1:15 AM
>>>> Subject: Re: Doing Shingle but also keep special single
>>>> word
>>>>
>>>>
>>>> >> I am building index with Shingle
>>>> >> filter. We know it's minimum 2-gram but I also
>>>> want keep
>>>> >> some special single word, e.g. IBM, Microsoft,
>>>> etc. i.e. I
>>>> >> want to do a minimum 2-gram but also want to have
>>>> these
>>>> >> single word in my index, Is it possible?
>>>> >
>>>> > outputUnigrams="true" parameter does not work for
>>>> you?
>>>> >
>>>> > After that you can cast <filter
>>>> class="solr.KeepWordFilterFactory" words="keepwords.txt"
>>>> ignoreCase="true"/> with keepwords.txt=IBM, Microsoft.
>>>> >
>>>> >
>>>> >
>>>> >
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1276506.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
> --------------------------------------------------------------------------------
>
>
>
> ¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
> Checked by AVG - www.avg.com
> Version: 9.0.851 / Virus Database: 271.1.1/3083 - Release Date: 08/20/10
> 14:35:00
>
>
>
--
View this message in context:
http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1300497.html
Sent from the Solr - User mailing list archive at Nabble.com.



--------------------------------------------------------------------------------



___b___J_T_________f_r_C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3090 - Release Date: 08/24/10
02:34:00

Reply | Threaded
Open this post in threaded view
|

Why it's boosted up?

scott.chu
In reply to this post by MitchK
In Lucene's web page, there's a paragraph:

"Indexing time boosts are preprocessed for storage efficiency and written to
the directory (when writing the document) in a single byte (!) as follows:
For each field of a document, all boosts of that field (i.e. all boosts
under the same field name in that doc) are multiplied. The result is
multiplied by the boost of the document, and also multiplied by a "field
length norm" value that represents the length of that field in that doc (so
shorter fields are automatically boosted up). "

I though the greater the value, the boosting is upper. Then why short fields
are boost up? Isn't Norm value for short fields smaller?

Reply | Threaded
Open this post in threaded view
|

Re: Why it's boosted up?

MitchK
Hi Scott,

(so  shorter fields are automatically boosted up). "
The theory behind that is the following (in easy words):
Let's say you got two documents, each doc contains on 1 field (like it was in my example).
Additionally we got a query that contains two words.
Let's say doc1 contains on 10 words and doc2 contains on 20 words.
The query matches both docs with both words.
The idea of boosting shorter fields stronger than longer fields is the following:
In doc1, 2/10 = 0.2 => 20% of the words are matching your query.
In doc2 2/20 = 0.1 => 10% of the words are matching your query.

So doc1 should get a better score, because the rate of matching words vs the total number of occuring words is greater than in doc2
This is the idea of using norms as an index-time-boosting-factor. NOTE: This does not mean that doc1 get's boosted by 20% and doc1 by 10%! It only illustrates what the idea behind such norms is.

From the similarity-class's documentation of lengthNorm():

Matches in longer fields are less precise, so implementations of this method usually return smaller values when numTokens is large, and larger values when numTokens is small.
However, you, as a search-application-developer got the task, that you have to decide whether this theory applies to your application or not. In some cases using norms makes no sense, in others it does.
If you think that norms are applying to your project, ommitting them is no good approach to save disk-space.
Furthermore: If you think the theory does apply to the business-needs of your application but its impact is currently to heavy, you can have a look at the sweetSpotSimilarity in Lucene.

The request is from our business team, they wish user of our product can
type in partial string of a word that exists in title or body field.
You mean something like typing "note" and also getting results like "notebook"?
The correct approach for something like that is not using shingleFilter but NGrams or edged NGrams.
Shingles are doing something like that:
"This is my shingle sentence" -> "This is, is my, my shingle, shingle sentence" -> it breaks up the sentence into smaller pieces. The benefit of doins so is, that, if a query matches one of these shingles, you have found a short phrase without using the performance-consuming phraseQuery-feature.

Kind regards,
- Mitch

scott chu wrote
In Lucene's web page, there's a paragraph:

"Indexing time boosts are preprocessed for storage efficiency and written to
the directory (when writing the document) in a single byte (!) as follows:
For each field of a document, all boosts of that field (i.e. all boosts
under the same field name in that doc) are multiplied. The result is
multiplied by the boost of the document, and also multiplied by a "field
length norm" value that represents the length of that field in that doc (so
shorter fields are automatically boosted up). "

I though the greater the value, the boosting is upper. Then why short fields
are boost up? Isn't Norm value for short fields smaller?
Reply | Threaded
Open this post in threaded view
|

Re: Doing Shingle but also keep special single word

iorixxx
In reply to this post by scott.chu
> The request is from our business
> team, they wish user of our product can
> type in partial string of a word that exists in title or
> body field. But now
> I also doubt if this request is really necessary?

"partial string of a word"? I think there is misunderstanding here. SingleFilter operates token level.

please divide this text => "please divide", "divide this", "this text"

If you want partial string of a single word, then EdgeNGramFilter and NGramFilter is used for that purpose.


     
Reply | Threaded
Open this post in threaded view
|

Re: Why it's boosted up?

iorixxx
In reply to this post by scott.chu
> Then why short fields are boost up?

In other words longer documents are punished. Because they contain possibly many terms/words. If this mechanism does not exist, longer documents takes over and pops up usually in the first page.


     
Reply | Threaded
Open this post in threaded view
|

Re: Why it's boosted up?

scott.chu
In reply to this post by MitchK
Thanks for your clear explanation! I got it :)
----- Original Message -----
From: "MitchK" <[hidden email]>
To: <[hidden email]>
Sent: Tuesday, August 24, 2010 3:37 PM
Subject: Re: Why it's boosted up?


>
> Hi Scott,
>
>
>
>> (so  shorter fields are automatically boosted up). "
>>
> The theory behind that is the following (in easy words):
> Let's say you got two documents, each doc contains on 1 field (like it was
> in my example).
> Additionally we got a query that contains two words.
> Let's say doc1 contains on 10 words and doc2 contains on 20 words.
> The query matches both docs with both words.
> The idea of boosting shorter fields stronger than longer fields is the
> following:
> In doc1, 2/10 = 0.2 => 20% of the words are matching your query.
> In doc2 2/20 = 0.1 => 10% of the words are matching your query.
>
> So doc1 should get a better score, because the rate of matching words vs
> the
> total number of occuring words is greater than in doc2
> This is the idea of using norms as an index-time-boosting-factor. NOTE:
> This
> does not mean that doc1 get's boosted by 20% and doc1 by 10%! It only
> illustrates what the idea behind such norms is.
>
> From the similarity-class's documentation of lengthNorm():
>
>
>
>> Matches in longer fields are less precise, so implementations of this
>> method usually return smaller values when numTokens is large, and larger
>> values when numTokens is small.
>>
>
> However, you, as a search-application-developer got the task, that you
> have
> to decide whether this theory applies to your application or not. In some
> cases using norms makes no sense, in others it does.
> If you think that norms are applying to your project, ommitting them is no
> good approach to save disk-space.
> Furthermore: If you think the theory does apply to the business-needs of
> your application but its impact is currently to heavy, you can have a look
> at the sweetSpotSimilarity in Lucene.
>
>
>
>> The request is from our business team, they wish user of our product can
>> type in partial string of a word that exists in title or body field.
>>
> You mean something like typing "note" and also getting results like
> "notebook"?
> The correct approach for something like that is not using shingleFilter
> but
> NGrams or edged NGrams.
> Shingles are doing something like that:
> "This is my shingle sentence" -> "This is, is my, my shingle, shingle
> sentence" -> it breaks up the sentence into smaller pieces. The benefit of
> doins so is, that, if a query matches one of these shingles, you have
> found
> a short phrase without using the performance-consuming
> phraseQuery-feature.
>
> Kind regards,
> - Mitch
>
>
> scott chu wrote:
>>
>> In Lucene's web page, there's a paragraph:
>>
>> "Indexing time boosts are preprocessed for storage efficiency and written
>> to
>> the directory (when writing the document) in a single byte (!) as
>> follows:
>> For each field of a document, all boosts of that field (i.e. all boosts
>> under the same field name in that doc) are multiplied. The result is
>> multiplied by the boost of the document, and also multiplied by a "field
>> length norm" value that represents the length of that field in that doc
>> (so
>> shorter fields are automatically boosted up). "
>>
>> I though the greater the value, the boosting is upper. Then why short
>> fields
>> are boost up? Isn't Norm value for short fields smaller?
>>
>>
>>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1306419.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


--------------------------------------------------------------------------------



¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3090 - Release Date: 08/24/10
02:34:00

Reply | Threaded
Open this post in threaded view
|

Re: Why it's boosted up?

scott.chu
In reply to this post by iorixxx
Thanks! That' make sense :)

----- Original Message -----
From: "Ahmet Arslan" <[hidden email]>
To: <[hidden email]>
Sent: Tuesday, August 24, 2010 4:30 PM
Subject: Re: Why it's boosted up?


>> Then why short fields are boost up?
>
> In other words longer documents are punished. Because they contain
> possibly many terms/words. If this mechanism does not exist, longer
> documents takes over and pops up usually in the first page.
>
>
>
>


--------------------------------------------------------------------------------



¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3090 - Release Date: 08/24/10
02:34:00

Reply | Threaded
Open this post in threaded view
|

Re: Doing Shingle but also keep special single word

scott.chu
In reply to this post by iorixxx
Thanks! It seems that I really go the wrong direction.

----- Original Message -----
From: "Ahmet Arslan" <[hidden email]>
To: <[hidden email]>
Sent: Tuesday, August 24, 2010 4:21 PM
Subject: Re: Doing Shingle but also keep special single word


>> The request is from our business
>> team, they wish user of our product can
>> type in partial string of a word that exists in title or
>> body field. But now
>> I also doubt if this request is really necessary?
>
> "partial string of a word"? I think there is misunderstanding here.
> SingleFilter operates token level.
>
> please divide this text => "please divide", "divide this", "this text"
>
> If you want partial string of a single word, then EdgeNGramFilter and
> NGramFilter is used for that purpose.
>
>
>
>


--------------------------------------------------------------------------------



¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3090 - Release Date: 08/24/10
02:34:00