common words not stop words?? how to ??

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

common words not stop words?? how to ??

rubdabadub
Hi:

I was wondering how are you guys dealing with "common words"? What I
mean by common words  is the ones that fall outside the "stop words"
category. Offcourse "stop words" is subjective i.e. its up to the
implementor. What I would like to do is how do i increase or decrease
boost value based on such "common words". Should I have a field
"Common_Words_Plus" and "Common_Words_Minus"? Plus for words that
needs to be boosted up and minus for the words that gets boosted
down?.. No?

The above sounds like not so professional -- quick fix.. does any one
have a better solution.. how are you dealing with the above?

Regards
Reply | Threaded
Open this post in threaded view
|

Re: common words not stop words?? how to ??

Walter Underwood, Netflix
Lucene/Solr does this automatically. That is how a tf.idf
engine works, it boosts rare words.

Do you have examples of problems or are you worrying about
something that might happen?

wunder

On 2/19/07 1:22 AM, "rubdabadub" <[hidden email]> wrote:

> Hi:
>
> I was wondering how are you guys dealing with "common words"? What I
> mean by common words  is the ones that fall outside the "stop words"
> category. Offcourse "stop words" is subjective i.e. its up to the
> implementor. What I would like to do is how do i increase or decrease
> boost value based on such "common words". Should I have a field
> "Common_Words_Plus" and "Common_Words_Minus"? Plus for words that
> needs to be boosted up and minus for the words that gets boosted
> down?.. No?
>
> The above sounds like not so professional -- quick fix.. does any one
> have a better solution.. how are you dealing with the above?
>
> Regards

Reply | Threaded
Open this post in threaded view
|

Re: common words not stop words?? how to ??

rubdabadub
Walter:

Thanks for the feedback.

On 2/19/07, Walter Underwood <[hidden email]> wrote:
> Lucene/Solr does this automatically. That is how a tf.idf
> engine works, it boosts rare words.
>
> Do you have examples of problems or are you worrying about
> something that might happen?

Actually my use case is the following: Lets say hypothetically you
have a field with 100 "sentence long title". If you read those title
you can pretty much group them into 5 subject matter. A hypothetical
example  is.. (Total number of title is 125, 25 of them can not be
grouped)

22 title is about = How good is Person X
14 title is about = How bad is Product Y
10 title is about = London weather
36 title is about = How cool is the movie Z
18 title is about = The next big MS virus.

What I am trying to achive is

I would like to weed out "London weather" as a group cos it is not
interesting in my use case .. Lets say it is noise not signal. So I
thought I could use some "common words" ..  Furthermore I was thinking
having common words .. I could boost certain field i.e. if the Person
X is a known person example a "Prime minister" or " a "movie star"
having certain word attached to another known word meaning its
important.  Maybe I defined my problem wrongly.. I hope above gives
you an overview..

Regards