Boost Document

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Boost Document

John Pailet
Hello,

I have a problem with the BOOST DOCUMENT method.

When I have 2 documents that have exactly the same data, but different boost value.
The order does not respect the boost value. It the following exemples, the first document of the search is the document with the lower boost value... is it a bug ?

Here is my test code:

public void testIt() throws IOException, ParseException
        {
                RAMDirectory ramDir = new RAMDirectory();
                IndexWriter writer = new IndexWriter(ramDir, new StandardAnalyzer(), true);
               
                Document doc1 = new Document();
                Field word = new Field("WORD","home", Field.Store.YES,Field.Index.TOKENIZED);
                word.setBoost(15);
                doc1.add(word);
                Field id = new Field("ID","111", Field.Store.YES,Field.Index.NO);
                doc1.add(id);
                doc1.setBoost(3163);
               
                Document doc2 = new Document();
                word = new Field("WORD","home", Field.Store.YES,Field.Index.TOKENIZED);
                word.setBoost(15);
                doc2.add(word);
                id = new Field("ID","222", Field.Store.YES,Field.Index.NO);
                doc2.add(id);
                doc2.setBoost(3150);
               
               
                writer.addDocument(doc2);
                writer.addDocument(doc1);
               
                writer.optimize();
                writer.close();
               
                Searcher searcher = new IndexSearcher(ramDir);
               
                QueryParser queryParserWord = new QueryParser("WORD",new StandardAnalyzer());
                Query query = queryParserWord.parse("home");
               
                Hits hits = searcher.search(query);
               
                for (int i = 0; i < hits.length(); i++) {
                        System.out.println("----- DOC " + hits.doc(i).get("ID") +"-----" + hits.doc(i).get("WORD")+"-");
                        System.out.println(searcher.explain(query,hits.id(i)).toString());
                }
        }


Thanks a lot !
Reply | Threaded
Open this post in threaded view
|

Re: Boost Document

Chris Hostetter-3

: When I have 2 documents that have exactly the same data, but different boost
: value.
: The order does not respect the boost value. It the following exemples, the
: first document of the search is the document with the lower boost value...
: is it a bug ?

i would suggest you look at the explain output ... but it seems you
already are: what does that tell you? what scores do the indvidual
documents get, and how do their explanations differ?

if i had to guess, i would suspect that your boosts are big enough that
they become equivilent when byte encoded and stored as fieldNorms (that's
what happens to document boosts, they get written as part of hte fieldNorm
for each field in the document) and the only reason you are getting the
order that you are is because doc2 has a lower internal id then doc1 ....
if i'm right, then the scores and the explanations would be equal.

...does that jive with what you are seeing?


:
: Here is my test code:
:
: public void testIt() throws IOException, ParseException
: {
: RAMDirectory ramDir = new RAMDirectory();
: IndexWriter writer = new IndexWriter(ramDir, new StandardAnalyzer(),
: true);
:
: Document doc1 = new Document();
: Field word = new Field("WORD","home",
: Field.Store.YES,Field.Index.TOKENIZED);
: word.setBoost(15);
: doc1.add(word);
: Field id = new Field("ID","111", Field.Store.YES,Field.Index.NO);
: doc1.add(id);
: doc1.setBoost(3163);
:
: Document doc2 = new Document();
: word = new Field("WORD","home", Field.Store.YES,Field.Index.TOKENIZED);
: word.setBoost(15);
: doc2.add(word);
: id = new Field("ID","222", Field.Store.YES,Field.Index.NO);
: doc2.add(id);
: doc2.setBoost(3150);
:
:
: writer.addDocument(doc2);
: writer.addDocument(doc1);
:
: writer.optimize();
: writer.close();
:
: Searcher searcher = new IndexSearcher(ramDir);
:
: QueryParser queryParserWord = new QueryParser("WORD",new
: StandardAnalyzer());
: Query query = queryParserWord.parse("home");
:
: Hits hits = searcher.search(query);
:
: for (int i = 0; i < hits.length(); i++) {
: System.out.println("----- DOC " + hits.doc(i).get("ID") +"-----" +
: hits.doc(i).get("WORD")+"-");
: System.out.println(searcher.explain(query,hits.id(i)).toString());
: }
: }
:
:
: Thanks a lot !
: --
: View this message in context: http://www.nabble.com/Boost-Document-tf2654631.html#a7405959
: Sent from the Lucene - Java Users mailing list archive at Nabble.com.
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [hidden email]
: For additional commands, e-mail: [hidden email]
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Boost Document

John Pailet
Hi Chris,

You are right !!!
Here is the explain output:

----- DOC 222-----home-
40960.0 = fieldWeight(WORD:home in 0), product of:
  1.0 = tf(termFreq(WORD:home)=1)
  1.0 = idf(docFreq=2)
  40960.0 = fieldNorm(field=WORD, doc=0)

----- DOC 111-----home-
40960.0 = fieldWeight(WORD:home in 1), product of:
  1.0 = tf(termFreq(WORD:home)=1)
  1.0 = idf(docFreq=2)
  40960.0 = fieldNorm(field=WORD, doc=1)

As you can see, doc 222 and doc 111 get exactly the same score, even if data are exactly the same and even id Doc111 boost value is greater than  Doc222 boost value !!!

Is the internal document ID part of score computation ?

How can I solve this problem of gertting same score for different boost values?

Thanks a lot !

John




Chris Hostetter wrote
: When I have 2 documents that have exactly the same data, but different boost
: value.
: The order does not respect the boost value. It the following exemples, the
: first document of the search is the document with the lower boost value...
: is it a bug ?

i would suggest you look at the explain output ... but it seems you
already are: what does that tell you? what scores do the indvidual
documents get, and how do their explanations differ?

if i had to guess, i would suspect that your boosts are big enough that
they become equivilent when byte encoded and stored as fieldNorms (that's
what happens to document boosts, they get written as part of hte fieldNorm
for each field in the document) and the only reason you are getting the
order that you are is because doc2 has a lower internal id then doc1 ....
if i'm right, then the scores and the explanations would be equal.

...does that jive with what you are seeing?


:
: Here is my test code:
:
: public void testIt() throws IOException, ParseException
: {
: RAMDirectory ramDir = new RAMDirectory();
: IndexWriter writer = new IndexWriter(ramDir, new StandardAnalyzer(),
: true);
:
: Document doc1 = new Document();
: Field word = new Field("WORD","home",
: Field.Store.YES,Field.Index.TOKENIZED);
: word.setBoost(15);
: doc1.add(word);
: Field id = new Field("ID","111", Field.Store.YES,Field.Index.NO);
: doc1.add(id);
: doc1.setBoost(3163);
:
: Document doc2 = new Document();
: word = new Field("WORD","home", Field.Store.YES,Field.Index.TOKENIZED);
: word.setBoost(15);
: doc2.add(word);
: id = new Field("ID","222", Field.Store.YES,Field.Index.NO);
: doc2.add(id);
: doc2.setBoost(3150);
:
:
: writer.addDocument(doc2);
: writer.addDocument(doc1);
:
: writer.optimize();
: writer.close();
:
: Searcher searcher = new IndexSearcher(ramDir);
:
: QueryParser queryParserWord = new QueryParser("WORD",new
: StandardAnalyzer());
: Query query = queryParserWord.parse("home");
:
: Hits hits = searcher.search(query);
:
: for (int i = 0; i < hits.length(); i++) {
: System.out.println("----- DOC " + hits.doc(i).get("ID") +"-----" +
: hits.doc(i).get("WORD")+"-");
: System.out.println(searcher.explain(query,hits.id(i)).toString());
: }
: }
:
:
: Thanks a lot !
: --
: View this message in context: http://www.nabble.com/Boost-Document-tf2654631.html#a7405959
: Sent from the Lucene - Java Users mailing list archive at Nabble.com.
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Boost Document

Erick Erickson
I can answer a small part of your question... Doc IDs have nothing to do
with scoring.  Each time you index a document, it get a doc id greater than
any already in the index, and they get reassigned if you delete docs and
optimize.... They *may* be used when scoring to break ties but that doesn't
do you any good.......

For the other part of your question, I'll defer to the experts....

Erick

On 11/18/06, John Pailet <[hidden email]> wrote:

>
>
> Hi Chris,
>
> You are right !!!
> Here is the explain output:
>
> ----- DOC 222-----home-
> 40960.0 = fieldWeight(WORD:home in 0), product of:
>   1.0 = tf(termFreq(WORD:home)=1)
>   1.0 = idf(docFreq=2)
>   40960.0 = fieldNorm(field=WORD, doc=0)
>
> ----- DOC 111-----home-
> 40960.0 = fieldWeight(WORD:home in 1), product of:
>   1.0 = tf(termFreq(WORD:home)=1)
>   1.0 = idf(docFreq=2)
>   40960.0 = fieldNorm(field=WORD, doc=1)
>
> As you can see, doc 222 and doc 111 get exactly the same score, even if
> data
> are exactly the same and even id Doc111 boost value is greater
> than  Doc222
> boost value !!!
>
> Is the internal document ID part of score computation ?
>
> How can I solve this problem of gertting same score for different boost
> values?
>
> Thanks a lot !
>
> John
>
>
>
>
>
> Chris Hostetter wrote:
> >
> >
> > : When I have 2 documents that have exactly the same data, but different
> > boost
> > : value.
> > : The order does not respect the boost value. It the following exemples,
> > the
> > : first document of the search is the document with the lower boost
> > value...
> > : is it a bug ?
> >
> > i would suggest you look at the explain output ... but it seems you
> > already are: what does that tell you? what scores do the indvidual
> > documents get, and how do their explanations differ?
> >
> > if i had to guess, i would suspect that your boosts are big enough that
> > they become equivilent when byte encoded and stored as fieldNorms
> (that's
> > what happens to document boosts, they get written as part of hte
> fieldNorm
> > for each field in the document) and the only reason you are getting the
> > order that you are is because doc2 has a lower internal id then doc1
> ....
> > if i'm right, then the scores and the explanations would be equal.
> >
> > ...does that jive with what you are seeing?
> >
> >
> > :
> > : Here is my test code:
> > :
> > : public void testIt() throws IOException, ParseException
> > :     {
> > :             RAMDirectory ramDir = new RAMDirectory();
> > :             IndexWriter writer = new IndexWriter(ramDir, new
> StandardAnalyzer(),
> > : true);
> > :
> > :             Document doc1 = new Document();
> > :             Field word = new Field("WORD","home",
> > : Field.Store.YES,Field.Index.TOKENIZED);
> > :             word.setBoost(15);
> > :             doc1.add(word);
> > :             Field id = new Field("ID","111", Field.Store.YES,
> Field.Index.NO);
> > :             doc1.add(id);
> > :             doc1.setBoost(3163);
> > :
> > :             Document doc2 = new Document();
> > :             word = new Field("WORD","home",
> > Field.Store.YES,Field.Index.TOKENIZED);
> > :             word.setBoost(15);
> > :             doc2.add(word);
> > :             id = new Field("ID","222", Field.Store.YES,Field.Index.NO
> );
> > :             doc2.add(id);
> > :             doc2.setBoost(3150);
> > :
> > :
> > :             writer.addDocument(doc2);
> > :             writer.addDocument(doc1);
> > :
> > :             writer.optimize();
> > :             writer.close();
> > :
> > :             Searcher searcher = new IndexSearcher(ramDir);
> > :
> > :             QueryParser queryParserWord = new QueryParser("WORD",new
> > : StandardAnalyzer());
> > :             Query query = queryParserWord.parse("home");
> > :
> > :             Hits hits = searcher.search(query);
> > :
> > :             for (int i = 0; i < hits.length(); i++) {
> > :                     System.out.println("----- DOC " + hits.doc(i).get("ID")
> +"-----" +
> > : hits.doc(i).get("WORD")+"-");
> > :                     System.out.println(searcher.explain(query,hits.id
> (i)).toString());
> > :             }
> > :     }
> > :
> > :
> > : Thanks a lot !
> > : --
> > : View this message in context:
> > http://www.nabble.com/Boost-Document-tf2654631.html#a7405959
> > : Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> > :
> > :
> > : ---------------------------------------------------------------------
> > : To unsubscribe, e-mail: [hidden email]
> > : For additional commands, e-mail: [hidden email]
> > :
> >
> >
> >
> > -Hoss
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Boost-Document-tf2654631.html#a7417101
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Boost Document

Chris Hostetter-3

: with scoring.  Each time you index a document, it get a doc id greater than
: any already in the index, and they get reassigned if you delete docs and
: optimize.... They *may* be used when scoring to break ties but that doesn't
: do you any good.......

they aren't really used to break ties -- that implies there is code
somewhere that deliberately wants the order to be deterministic and when
it sees two identicle scores, then does a sort on docid.  In reality, the
fact that docs come back in docid order is just a side effect of the fact
that the Scorers iterate over documents in order and that the sorting code
generates a "stable sort"

Looking at why the scores are teh same...

: > ----- DOC 222-----home-
: >   40960.0 = fieldNorm(field=WORD, doc=0)

: > ----- DOC 111-----home-
: > 40960.0 = fieldWeight(WORD:home in 1), product of:
: >   40960.0 = fieldNorm(field=WORD, doc=1)

: > > :             doc1.setBoost(3163);

: > > :             doc2.setBoost(3150);

...norms (which is where field and doc boosts go) get encoded as a single
byte, so they loose a lot of precision, unfortunately the methods used to
translate from float->byte->float aren't subclassable (see
Similarity.encodeNorm) if you write a bit of test code to call those
mehtods on various values and see what you'll get you'll notice that
bigger numbers are more affected then smaller numbers, so you may wnat to
just use smaller boosts (ie: divide by 100) and see if that helps.

alternately: stop using boosts to approach this problem, add these numbers
as a new field, and use the FunctionQuery class from Solr to achieve your
goal (search the list archives for more detailed Discussions of
FunctionQuery)


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]