Question: Lucene MoreLikeThis score values all the same:

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Question: Lucene MoreLikeThis score values all the same:

vybe3142
As a test, I tried to compare a few documents on various topics (a few on
linux, and another on the U.S. constitution) to a source document on linux
using a query formed by MoreLikeThis.

1. Looking at the hits, they have the same score. I'd expect them to be
different, based on their relevance to the source document. Any ideas?
2. I can't figure out how the search picks up the article on the US
constitution as similar. Any hints? I can tweak the search a bit to limit
this by increasing the min. term frequency via setMinTermFreq() but given
that the interesting terms are so vastly different, I wouldn't have thought
this necessary.

This is my output. I can paste my source code in too if needed.

Thanks

=================================================================



read file c:/tmp/similarity2/linux.txt

read file c:/tmp/similarity2/linuxkernel.txt

read file c:/tmp/similarity2/linuxtoo.txt

read file c:/tmp/similarity2/constitution.txt
index size 4
Field option:title
Field option:name
MLT parms:
    maxQueryTerms  : 25
    minWordLen     : 3
    maxWordLen     : 0
    fieldNames     : title
    boost          : false
    minTermFreq    : 5
    minDocFreq     : 1

query formed for source doc linux.txt is title:linux title:can title:want
title:system title:operating title:our title:information title:you
title:linus title:released title:found title:kernel title:use title:its
title:about title:page title:more
Interesting terms for c:/tmp/similarity2/linux.txt :linux can want system
operating our information you linus released found kernel use its about page
more
Interesting terms for c:/tmp/similarity2/linuxtoo.txt :linux you free
software can like your operating system journal other systems development
commercial productivity aug-22-08 many web even which video network popular
platforms source
Interesting terms for c:/tmp/similarity2/linuxkernel.txt :linux kernel
version torvalds support edit code system stable gpl retrieved linus sco
released changes series drivers also only number operating were binary new
has
Interesting terms for c:/tmp/similarity2/constitution.txt :shall president
states united votes senate vice office electors congress law person them
have may from which number time all state other


number of matches : 4
1.  [c:/tmp/similarity2/linux.txt]  score: 0.46413344
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< why the same score
2.  [c:/tmp/similarity2/linuxkernel.txt]  score: 0.46413344
3.  [c:/tmp/similarity2/linuxtoo.txt]  score: 0.46413344
4.  [c:/tmp/similarity2/constitution.txt]  score: 0.46413344
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< why ???
Reply | Threaded
Open this post in threaded view
|

Re: Question: Lucene MoreLikeThis score values all the same:

hossman

: 1. Looking at the hits, they have the same score. I'd expect them to be
: different, based on their relevance to the source document. Any ideas?
        ...
: This is my output. I can paste my source code in too if needed.

The output of arbitrary "secret" code isn't really a very useful for the
purposes of debugging ... since we have no idea what your code does, we
can't really begin to guess why it produces the output that it does.

Off the cuff, based on your choice of wording and the fact, i suspect
you are using the "Hits" class which attempts to normalize scores -- which
could explain a big part of your confusion.

In general: when you want to understand why a Document scores they way it
does against a query: use explain().



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]