Re: Index-time Boosting

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Tracey Jaquith
Hi all,

[initially I replied to a thread which went to Mike Klass email so after
his helpful
 reply, I'm trying to merge this back into the list discussion]

Quick intro.  Server Engineer at Internet Archive.
I just spent a mere 3 days porting nearly our entire site to use your
*wonderful* project!

I, too, am looking for a kind of "boosting".
If I understand your reply here, if i reindex *all* my documents with
   <field name="title" boost="100">i'm super, thanks for asking!</field>
and make sure that any subsequent incremental (re)indexing of documents
use that same extra ' boost="100" ' then I should be making the relevance
of the title in our documents 100x (or whatever that translates to)
"heavier"
than other non-title fields, correct?

I know this prolly isn't the relevant place to otherwise gush,
but THANK YOU for this fantastic (and maintained!) code
and we look forward to using this in the near future on our site!
Go opensource!

--tracey jaquith

[We are most interested in always having "title", "description", and a
few other
 fields boosted.  We have both user queries of phrases/words as well as
 "field-specific" queries (eg: "mediatype:moves AND collection:prelinger")
 so my thought is std might be better than dismax.
 I've tried some experiments, adjusting the boosts at index time and running
 the std handler to see the ordering of the results change for
"fieldless queries"
 (eg: "q=tracey+pooh").  I have 33 fields using <copyField dest="text"
source="..."/>
  (where "text" is our default field to query)
 to allow for checking across most of our std XML fields.  I gather that
a boost
  applied to "title" on indexing a docuement must somehow "propogate" to the
  "text" field?   Otherwise, I'm not sure how playing with boosts to fields
  not named "text" would cause any change on the ranking of results for
queries
  like "q=tracey+pooh".  Am I starting to catch on?]
 

Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Yonik Seeley-2
On 12/5/06, Tracey Jaquith <[hidden email]> wrote:

> Quick intro.  Server Engineer at Internet Archive.
> I just spent a mere 3 days porting nearly our entire site to use your
> *wonderful* project!
>
> I, too, am looking for a kind of "boosting".
> If I understand your reply here, if i reindex *all* my documents with
>    <field name="title" boost="100">i'm super, thanks for asking!</field>
> and make sure that any subsequent incremental (re)indexing of documents
> use that same extra ' boost="100" ' then I should be making the relevance
> of the title in our documents 100x (or whatever that translates to)
> "heavier"
> than other non-title fields, correct?
>
> I know this prolly isn't the relevant place to otherwise gush,
> but THANK YOU for this fantastic (and maintained!) code
> and we look forward to using this in the near future on our site!
> Go opensource!

Welcome aboard!

From a "fresh" user perspective, what was your hardest or most
confusing part of starting to use Solr?

> [We are most interested in always having "title", "description", and a
> few other
>  fields boosted.  We have both user queries of phrases/words as well as
>  "field-specific" queries (eg: "mediatype:moves AND collection:prelinger")
>  so my thought is std might be better than dismax.

Yes, for the example above you want the standard request handler
because you are searching for different things in different fields
rather than the same thing in different fields.

However, there are multiple ways of doing everything...
It looks like at least some of your clauses are restrictions rather
than full-text queries, and can be more efficiently modeled as
filters.  Since filters are cached separately, this can lead to a
large increase in performance.

So in either the standard or dismax handlers, you could do
q="foo bar"&fq=mediatype:movies&fq=collection:prelinger

>  I've tried some experiments, adjusting the boosts at index time and running
>  the std handler to see the ordering of the results change for
> "fieldless queries"
>  (eg: "q=tracey+pooh").  I have 33 fields using <copyField dest="text"
> source="..."/>
>   (where "text" is our default field to query)
>  to allow for checking across most of our std XML fields.  I gather that
> a boost
>   applied to "title" on indexing a docuement must somehow "propogate" to the
>   "text" field?

Background: for an indexed field name there is a single boost value
per document.  This is true even if the field is multi-valued... all
values for that document "share" the same boost.  This is a Lucene
restriction so we can't fix it in Solr in any way.

Solr *does* propagate the index-time boost when doing copyField, but
this just ends up being multiplied into all the other boosts for
values for that document.   Matches on the resulting text field will
*always* score higher, regardless of which "part" matched.  Does that
make sense?

Index time boosts can make sense if you want to boost the importance
of certain *documents*.  Query time boosts make more sense when you
want certain fields or certain search terms to count more than others.

So if you want to search across your general text field, while at the
same time boosting the title field, you could do:

q="foo bar" title:"foo bar"^10

Or you could search across all the fields individually, giving them
all different boosts:
q=subject:foo^3 title:foo^10 body:foo

The dismax handler has a different way of specifying fields to search
across and boosts:
q=foo&qf=subject^3,title^10,body

If you really want index-time boosts, there was a bug fix to
index-time field boosts on 11/3, so make sure you are using a later
version.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Chris Hostetter-3

: > [We are most interested in always having "title", "description", and a
: > few other
: >  fields boosted.  We have both user queries of phrases/words as well as
: >  "field-specific" queries (eg: "mediatype:moves AND collection:prelinger")
: >  so my thought is std might be better than dismax.

it depends ... based on what you've said so far, i would think dismax
would work perfectly for you...

 * put the fields you use and their relative weights in the qf param
        qf=body^0.5+title^2.0+author^1.5
 * put query string you get from the user in the q param
        q=tracy+pooh
 * put field constraints on other fields in fq params
        fq=mediatype:moves&fq=collection:prelinger
 * put any score boosting clauses in a bq param (or bf if it's a function)
        bq=promote:yes^3
        bf=recip(rord(docDate),1,2,3)^2



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Yonik Seeley-2
One last thing to keep in mind is the tradeoffs:
Querying a single all encompasing "text" field will be faster, but the
scoring won't be as relevant.  The types of queries dismax generates
can get you better relevance, at the cost of performance.

-Yonik

On 12/5/06, Chris Hostetter <[hidden email]> wrote:

>
> : > [We are most interested in always having "title", "description", and a
> : > few other
> : >  fields boosted.  We have both user queries of phrases/words as well as
> : >  "field-specific" queries (eg: "mediatype:moves AND collection:prelinger")
> : >  so my thought is std might be better than dismax.
>
> it depends ... based on what you've said so far, i would think dismax
> would work perfectly for you...
>
>  * put the fields you use and their relative weights in the qf param
>         qf=body^0.5+title^2.0+author^1.5
>  * put query string you get from the user in the q param
>         q=tracy+pooh
>  * put field constraints on other fields in fq params
>         fq=mediatype:moves&fq=collection:prelinger
>  * put any score boosting clauses in a bq param (or bf if it's a function)
>         bq=promote:yes^3
>         bf=recip(rord(docDate),1,2,3)^2
>
>
>
> -Hoss
>
Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Tracey Jaquith
In reply to this post by Yonik Seeley-2
Hi Yonik!

Yonik Seeley wrote:

> On 12/5/06, Tracey Jaquith <[hidden email]> wrote:
>> Quick intro.  Server Engineer at Internet Archive.
>> I just spent a mere 3 days porting nearly our entire site to use your
>> *wonderful* project!
>>
>> I, too, am looking for a kind of "boosting".
>> If I understand your reply here, if i reindex *all* my documents with
>>    <field name="title" boost="100">i'm super, thanks for asking!</field>
>> and make sure that any subsequent incremental (re)indexing of documents
>> use that same extra ' boost="100" ' then I should be making the
>> relevance
>> of the title in our documents 100x (or whatever that translates to)
>> "heavier"
>> than other non-title fields, correct?
>>
>> I know this prolly isn't the relevant place to otherwise gush,
>> but THANK YOU for this fantastic (and maintained!) code
>> and we look forward to using this in the near future on our site!
>> Go opensource!
>
> Welcome aboard!
>
> From a "fresh" user perspective, what was your hardest or most
> confusing part of starting to use Solr?
>
Thanks!  Well, we presently have a (very badly) homegrown version of an
SE that has lucene + jetty under the hood.  It locks up a lot (badly
threaded), hangs on updates, and generally has "persona non gratia"
status with developers here where noone wants to touch it.  So the
*easiest* thing about Solr was the fact that it uses lucene query syntax
(like ours).  The hardest parts were:
1) I tried to make ant run from the included ant.jar (w/o getting the
latest ant from apache) (and spent an hour or so before trying getting ant)
2) Our SE starts responses with document "1".  Initially (totally my
overlooking from going a little too fast) I just directly "translated"
that concept so I was crushed to find a lot of my documents weren't
coming back like they should.  Once I figured I needed to make
"start=0", not "start=1", everything was great.
3) boosts!  I spent just about 2 days porting our entire site over (have
a nice PHP toggle "define('SOLR', 1);" now in a single place to cut over
to it; spent only 2 hours (clocktime) to index our site (about 450K
documents).  But now I've spent about 1-1/2 days experimenting and not
quite getting the boosts right 8-)

>> [We are most interested in always having "title", "description", and a
>> few other
>>  fields boosted.  We have both user queries of phrases/words as well as
>>  "field-specific" queries (eg: "mediatype:moves AND
>> collection:prelinger")
>>  so my thought is std might be better than dismax.
>
> Yes, for the example above you want the standard request handler
> because you are searching for different things in different fields
> rather than the same thing in different fields.
>
> However, there are multiple ways of doing everything...
> It looks like at least some of your clauses are restrictions rather
> than full-text queries, and can be more efficiently modeled as
> filters.  Since filters are cached separately, this can lead to a
> large increase in performance.
>
> So in either the standard or dismax handlers, you could do
> q="foo bar"&fq=mediatype:movies&fq=collection:prelinger
>
OK, great to know.  I'll prolly stick with our current "pass through" of
our queries in lucene syntax version, and in the future, for speedups,
start moving some of the filters to "&fq="....

>>  I've tried some experiments, adjusting the boosts at index time and
>> running
>>  the std handler to see the ordering of the results change for
>> "fieldless queries"
>>  (eg: "q=tracey+pooh").  I have 33 fields using <copyField dest="text"
>> source="..."/>
>>   (where "text" is our default field to query)
>>  to allow for checking across most of our std XML fields.  I gather that
>> a boost
>>   applied to "title" on indexing a docuement must somehow "propogate"
>> to the
>>   "text" field?
>
> Background: for an indexed field name there is a single boost value
> per document.  This is true even if the field is multi-valued... all
> values for that document "share" the same boost.  This is a Lucene
> restriction so we can't fix it in Solr in any way.
ok, that's no problem for us -- our main two fields to boost are
"singletons"
anyway.  the other two fields we boost can have multiple values, but
most of the time, in practice, they won't matter.  of course, great to know.
> Solr *does* propagate the index-time boost when doing copyField, but
> this just ends up being multiplied into all the other boosts for
> values for that document.   Matches on the resulting text field will
> *always* score higher, regardless of which "part" matched.  Does that
> make sense?
OK, that *mostly* is making sense.  Let me see if I'm understanding it
mostly.  I'm thinking (after still thrashing around a bit) that the way that
seems to be getting the results I *expect* (or at least, that we are likely
used to here with our current IA SE) is something like (std req handler):
     &q="commute" title:"commute"^10
where i did no index boosting, and "title" (and other fields) were being
copied into the the default-to-search-for-unspecified-query "text" field).

That nicely makes items with "commute" in the title show up higher in
the results than those with commute only in the "text" field.

Were I to switch course and index boost each document with
   <field name="title" boost="10">
I would think the documents would come back in the same order for
    &q="commute"
as the first scheme, because the relevance of the title copied into
"text" boosted the documents relevance.
I could see other queries could have different rankings of results
in the two schems above that had more complex AND clauses perhaps.

My new plan is something like:
for each "clause" we get in a raw search bar query, if it doesn't have a ":"
in it, "expand" it to:
  q=text:"commute" title:"commute"^100 description:"commute"^15
collection:"commute"^10 language:"commute"^10
I think I could even then stop copyField-ing title, description,
collection, and language
into "text".

> Index time boosts can make sense if you want to boost the importance
> of certain *documents*.  Query time boosts make more sense when you
> want certain fields or certain search terms to count more than others.
>
> So if you want to search across your general text field, while at the
> same time boosting the title field, you could do:
>
> q="foo bar" title:"foo bar"^10
>
> Or you could search across all the fields individually, giving them
> all different boosts:
> q=subject:foo^3 title:foo^10 body:foo
thanks!  these two examples were perfect and got me the approach that I
think
will work for us!
>
> The dismax handler has a different way of specifying fields to search
> across and boosts:
> q=foo&qf=subject^3,title^10,body
>
> If you really want index-time boosts, there was a bug fix to
> index-time field boosts on 11/3, so make sure you are using a later
> version.
something about dismax (for me, or for my mangling of it) with various
attempts
didn't seem to always be getting me every result I expected, so I've mostly
"chickened out" of dismax for now 8-)


Now I have one new mystery that's popped up for me.
With std req handler, this simple query
    q=title:commute
is *not* returning me all documents that have the word "commute" in the
title.
There must be some other filter/clause or something happening that I'm not
aware of?
(For example, I do "indent=on&fl=title&q=commute" in a wget and grep the
results
 for <title> and then grep -i for commute, there are 23 hits.  But doing
 "&q=title:commute" only returns one of those hits..)

I can provide the url to our open test server so anyone interested can
look at our
config/schema and the query results if need be.

Thanks much!
--tracey
(and as we're about 95% integrated, these will be my most verbose posts)


*       --Tracey Jaquith - http://www.archive.org/~tracey 
<http://www.archive.org/%7Etracey> --*
Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Mike Klaas
On 12/5/06, Tracey Jaquith <[hidden email]> wrote:

> Now I have one new mystery that's popped up for me.
> With std req handler, this simple query
>     q=title:commute
> is *not* returning me all documents that have the word "commute" in the
> title.
> There must be some other filter/clause or something happening that I'm not
> aware of?
> (For example, I do "indent=on&fl=title&q=commute" in a wget and grep the
> results
>  for <title> and then grep -i for commute, there are 23 hits.  But doing
>  "&q=title:commute" only returns one of those hits..)

Indeed--those are different queries.  The "fl" parameter controlled
the stored fields returned by Solr; it does not affect which documents
are returned.  The first query asks for the titles of all documents
containing the word "commute", the second for all documents with
"commute" in their title.

see http://wiki.apache.org/solr/CommonQueryParameters

I'm not sure what problems you are experiencing with dismax, but it is
important to note that you cannot specify a raw lucene query in the
"q" parameter of a dismax handler.  If you want to search for a word
across fields, you can specify the qf (query fields) parameter.

eg.
q=commute
qf=title^10 body
(see http://incubator.apache.org/solr/docs/api/org/apache/solr/request/DisMaxRequestHandler.html)

Turning on debugQuery=true is invaluable for determining what factors
are influencing scoring.

Was your previous solution QueryParser-based?  If so, you should be
able to use the exact same queries as before, passed to
StandardRequestHAndler (assuming the fields are also set up
identically).

cheers,
-MIke
Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Yonik Seeley-2
In reply to this post by Tracey Jaquith
On 12/5/06, Tracey Jaquith <[hidden email]> wrote:

> Now I have one new mystery that's popped up for me.
> With std req handler, this simple query
>     q=title:commute
> is *not* returning me all documents that have the word "commute" in the
> title.
> There must be some other filter/clause or something happening that I'm not
> aware of?
> (For example, I do "indent=on&fl=title&q=commute" in a wget and grep the
> results
>  for <title> and then grep -i for commute, there are 23 hits.  But doing
>  "&q=title:commute" only returns one of those hits..)

title in your schema is of type "string" which indexes the whole value verbatim.
There is only one document with title:commute
Most likely you want to change the type of that field to "text" or
some other analyzed type that at least breaks apart words by
whitespace.


-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Tracey Jaquith
wow, that makes sense now.  my bad.
OK, great.  further testing shows "you mean what you say" -- not
only verbatim, but case sensitive.

so for my dwindling number of remaining "string" types, in my XSL
transform (on the input to index the doc) i'll lowercase them all, too 8-)

thanks!!
--t

Yonik Seeley wrote:
On 12/5/06, Tracey Jaquith [hidden email] wrote:
Now I have one new mystery that's popped up for me.
With std req handler, this simple query
    q=title:commute
is *not* returning me all documents that have the word "commute" in the
title.
There must be some other filter/clause or something happening that I'm not
aware of?
(For example, I do "indent=on&fl=title&q=commute" in a wget and grep the
results
 for <title> and then grep -i for commute, there are 23 hits.  But doing
 "&q=title:commute" only returns one of those hits..)

title in your schema is of type "string" which indexes the whole value verbatim.
There is only one document with title:commute
Most likely you want to change the type of that field to "text" or
some other analyzed type that at least breaks apart words by
whitespace.


-Yonik

--
       --Tracey Jaquith - http://www.archive.org/~tracey --
Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Chris Hostetter-3
In reply to this post by Mike Klaas

: > (For example, I do "indent=on&fl=title&q=commute" in a wget and grep the
: > results
: >  for <title> and then grep -i for commute, there are 23 hits.  But doing
: >  "&q=title:commute" only returns one of those hits..)
:
: Indeed--those are different queries.  The "fl" parameter controlled
: the stored fields returned by Solr; it does not affect which documents
: are returned.  The first query asks for the titles of all documents
: containing the word "commute", the second for all documents with
: "commute" in their title.

to clarify: the first query asks for the titles of all documents
containing the word "commute" in whatever field your schema.xml lists as
the <defaultSearchField>



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Chris Hostetter-3
In reply to this post by Tracey Jaquith

: so for my dwindling number of remaining "string" types, in my XSL
: transform (on the input to index the doc) i'll lowercase them all, too 8-)

I don't beleive that is strictly neccessary, these two field types should
be functionally equivilent...

   <fieldtype name="string"  class="solr.StrField"/>
   <fieldtype name="tstring" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
      </analyzer>
   </fieldtypes>

...so i'm pretty sure you could just use...

   <fieldtype name="lowerCaseString" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
   </fieldtypes>


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Mike Klaas
In reply to this post by Chris Hostetter-3
On 12/5/06, Chris Hostetter <[hidden email]> wrote:

> : Indeed--those are different queries.  The "fl" parameter controlled
> : the stored fields returned by Solr; it does not affect which documents
> : are returned.  The first query asks for the titles of all documents
> : containing the word "commute", the second for all documents with
> : "commute" in their title.
>
> to clarify: the first query asks for the titles of all documents
> containing the word "commute" in whatever field your schema.xml lists as
> the <defaultSearchField>

Thanks for correcting that.  What I said is only true if your
defaultSearchField is a copyField of your other fields.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Tracey Jaquith
In reply to this post by Mike Klaas
Hi Mike,

OK, I guess my "problem" is more of a partially still coming up
to speed / partially wanting to be lazy.

If I make a dismax handler called "dissed", I'd like it to "work"
whether or not i pass in "commute" or "title:commute" to the query.
(Now I *do* realize those are two completely different kinds of
 queries -- 1st would be all docs with "commute" in the default field
 (which is most of our fields copied into it, so the document)
 and the 2nd would be all docs with "commute" in the "title" field
 (now that I've redone the field "type" to be "text" and not "string"
 as Yonik pointed out)).

So this returns no documents because, I gather, you can't feed
the "field:value" lucene syntax directly in to a dismax handler
(although you can to a standard handler):
   indent=on&fl=identifier&q=title:commute&qt=dissed

So I think simply my breaking up the queries from our search bar
(example "raw" formats:
   grateful dead
   "grateful dead"
   mediatype:movies AND collection:prelinger )
into an expanded query of:
  description:"[clause]"^10 text:"[clause]"^1 ....
fore each clause will work the best for us.

Is there any lucene or solr class / method that can break up
a string into clauses (eg: split on AND, OR, NOT, ()s, etc.)?

--tracey


Mike Klaas wrote:

> On 12/5/06, Tracey Jaquith <[hidden email]> wrote:
>
>> Now I have one new mystery that's popped up for me.
>> With std req handler, this simple query
>>     q=title:commute
>> is *not* returning me all documents that have the word "commute" in the
>> title.
>> There must be some other filter/clause or something happening that
>> I'm not
>> aware of?
>> (For example, I do "indent=on&fl=title&q=commute" in a wget and grep the
>> results
>>  for <title> and then grep -i for commute, there are 23 hits.  But doing
>>  "&q=title:commute" only returns one of those hits..)
>
> Indeed--those are different queries.  The "fl" parameter controlled
> the stored fields returned by Solr; it does not affect which documents
> are returned.  The first query asks for the titles of all documents
> containing the word "commute", the second for all documents with
> "commute" in their title.
>
> see http://wiki.apache.org/solr/CommonQueryParameters
>
> I'm not sure what problems you are experiencing with dismax, but it is
> important to note that you cannot specify a raw lucene query in the
> "q" parameter of a dismax handler.  If you want to search for a word
> across fields, you can specify the qf (query fields) parameter.
>
> eg.
> q=commute
> qf=title^10 body
> (see
> http://incubator.apache.org/solr/docs/api/org/apache/solr/request/DisMaxRequestHandler.html)
>
>
> Turning on debugQuery=true is invaluable for determining what factors
> are influencing scoring.
>
> Was your previous solution QueryParser-based?  If so, you should be
> able to use the exact same queries as before, passed to
> StandardRequestHAndler (assuming the fields are also set up
> identically).
>
> cheers,
> -MIke

--
*       --Tracey Jaquith - http://www.archive.org/~tracey 
<http://www.archive.org/%7Etracey> --*
Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Tracey Jaquith
In reply to this post by Mike Klaas
oh, and yes, i've always understood, thankfully, that queries
of
   "q=commute&fl=title"
and
   "q=title:commute&fl=title"
are *quite* different
(but that is probably mostly due to my prior experience with
 lucene with our current broken SE 8-)

-t

Mike Klaas wrote:
On 12/5/06, Tracey Jaquith [hidden email] wrote:

Now I have one new mystery that's popped up for me.
With std req handler, this simple query
    q=title:commute
is *not* returning me all documents that have the word "commute" in the
title.
There must be some other filter/clause or something happening that I'm not
aware of?
(For example, I do "indent=on&fl=title&q=commute" in a wget and grep the
results
 for <title> and then grep -i for commute, there are 23 hits.  But doing
 "&q=title:commute" only returns one of those hits..)

Indeed--those are different queries.  The "fl" parameter controlled
the stored fields returned by Solr; it does not affect which documents
are returned.  The first query asks for the titles of all documents
containing the word "commute", the second for all documents with
"commute" in their title.

see http://wiki.apache.org/solr/CommonQueryParameters

I'm not sure what problems you are experiencing with dismax, but it is
important to note that you cannot specify a raw lucene query in the
"q" parameter of a dismax handler.  If you want to search for a word
across fields, you can specify the qf (query fields) parameter.

eg.
q=commute
qf=title^10 body
(see http://incubator.apache.org/solr/docs/api/org/apache/solr/request/DisMaxRequestHandler.html)

Turning on debugQuery=true is invaluable for determining what factors
are influencing scoring.

Was your previous solution QueryParser-based?  If so, you should be
able to use the exact same queries as before, passed to
StandardRequestHAndler (assuming the fields are also set up
identically).

cheers,
-MIke

--
       --Tracey Jaquith - http://www.archive.org/~tracey --
Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Tracey Jaquith
In reply to this post by Chris Hostetter-3
ok, great to know -- all this is invaluable.
i'm stashing away "ideas" like this for the future (because..)

i think for now i'll stick with XSL transforming the fields to lowercase
because we already need this small XSLT from our item XML to
XML that solr can index.

-t

Chris Hostetter wrote:
: so for my dwindling number of remaining "string" types, in my XSL
: transform (on the input to index the doc) i'll lowercase them all, too 8-)

I don't beleive that is strictly neccessary, these two field types should
be functionally equivilent...

   <fieldtype name="string"  class="solr.StrField"/>
   <fieldtype name="tstring" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
      </analyzer>
   </fieldtypes>

...so i'm pretty sure you could just use...

   <fieldtype name="lowerCaseString" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
   </fieldtypes>


-Hoss

  

--
       --Tracey Jaquith - http://www.archive.org/~tracey --
Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Tracey Jaquith
In reply to this post by Yonik Seeley-2
ahh, after rereading this about 20 times today 8-)
i think i finally "get it" (your final question below).

if i do index-time boosts, and search only "text" (default field)
the boosts will propogate into "text", but only insofar that the
document will weight higher when a phrase is found in the "text"
field (regardless of whether that "hit" really was due to something
copyField-ed in with boost 1, boost 100, etc.)

so that solution would have the effect of making certain documents
have higher scores in the "text" field, not the effect we'd like.

[example documentA]
  [description] i like to commute
   [title] commuting thoughts
copyField text to:
  [text] i like to commute commuting thoughts

we, the Archive, want query hits in title to boost ^100.
if we do q=commute (which searches "text")
with index-time boosting, solr/lucene won't know
the hit due to "title" should effect a much higher ranking
compared to documents with commute in "text" but
not in "title".   however, the above document *will* have a higher
score, in general, because the "title" portion was nearly
half of the "text" field.  Yet A will have a
higher ranking even for matches like "q=like"
compared to documentB like:
  [description] i like bread
  [text] i like bread
(when in reality, we'd like them to have near equal weighting).
So index boosts won't due for us.  I'm learning!

--tracey

>>  the std handler to see the ordering of the results change for
>> "fieldless queries"
>>  (eg: "q=tracey+pooh").  I have 33 fields using <copyField dest="text"
>> source="..."/>
>>   (where "text" is our default field to query)
>>  to allow for checking across most of our std XML fields.  I gather that
>> a boost
>>   applied to "title" on indexing a docuement must somehow "propogate"
>> to the
>>   "text" field?
>  I've tried some experiments, adjusting the boosts at index time and
> running
>
> Background: for an indexed field name there is a single boost value
> per document.  This is true even if the field is multi-valued... all
> values for that document "share" the same boost.  This is a Lucene
> restriction so we can't fix it in Solr in any way.
>
> Solr *does* propagate the index-time boost when doing copyField, but
> this just ends up being multiplied into all the other boosts for
> values for that document.   Matches on the resulting text field will
> *always* score higher, regardless of which "part" matched.  Does that
> make sense?
>
*ith - http://www.archive.org/~tracey <http://www.archive.org/%7Etracey> --*
Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Yonik Seeley-2
Yep, sounds like you got it.  Query-time boosting is what you want.

> however, the above document *will* have a higher
> score, in general, because the "title" portion was nearly
> half of the "text" field.

Well, if you boost *all* of the "title" fields by 100, it also has the
net effect of boosting *all* the "text" fields by 100... it's going to
be a wash when searching on the text field.

FWIW, I don't recall any Solr collections in CNET using index
boosts... query-time boosts are far more flexible.

Some Lucene users have used index-time boosts to boost more recent
documents in the index, but with Solr's function query, that can be
done at query time too.

-Yonik

On 12/5/06, Tracey Jaquith <[hidden email]> wrote:

> ahh, after rereading this about 20 times today 8-)
> i think i finally "get it" (your final question below).
>
> if i do index-time boosts, and search only "text" (default field)
> the boosts will propogate into "text", but only insofar that the
> document will weight higher when a phrase is found in the "text"
> field (regardless of whether that "hit" really was due to something
> copyField-ed in with boost 1, boost 100, etc.)
>
> so that solution would have the effect of making certain documents
> have higher scores in the "text" field, not the effect we'd like.
>
> [example documentA]
>   [description] i like to commute
>    [title] commuting thoughts
> copyField text to:
>   [text] i like to commute commuting thoughts
>
> we, the Archive, want query hits in title to boost ^100.
> if we do q=commute (which searches "text")
> with index-time boosting, solr/lucene won't know
> the hit due to "title" should effect a much higher ranking
> compared to documents with commute in "text" but
> not in "title".   however, the above document *will* have a higher
> score, in general, because the "title" portion was nearly
> half of the "text" field.  Yet A will have a
> higher ranking even for matches like "q=like"
> compared to documentB like:
>   [description] i like bread
>   [text] i like bread
> (when in reality, we'd like them to have near equal weighting).
> So index boosts won't due for us.  I'm learning!
>
> --tracey
>
> >>  the std handler to see the ordering of the results change for
> >> "fieldless queries"
> >>  (eg: "q=tracey+pooh").  I have 33 fields using <copyField dest="text"
> >> source="..."/>
> >>   (where "text" is our default field to query)
> >>  to allow for checking across most of our std XML fields.  I gather that
> >> a boost
> >>   applied to "title" on indexing a docuement must somehow "propogate"
> >> to the
> >>   "text" field?
> >  I've tried some experiments, adjusting the boosts at index time and
> > running
> >
> > Background: for an indexed field name there is a single boost value
> > per document.  This is true even if the field is multi-valued... all
> > values for that document "share" the same boost.  This is a Lucene
> > restriction so we can't fix it in Solr in any way.
> >
> > Solr *does* propagate the index-time boost when doing copyField, but
> > this just ends up being multiplied into all the other boosts for
> > values for that document.   Matches on the resulting text field will
> > *always* score higher, regardless of which "part" matched.  Does that
> > make sense?
> >
> *ith - http://www.archive.org/~tracey <http://www.archive.org/%7Etracey> --*