Question about similarity manipulation...

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about similarity manipulation...

escher2k
We have a requirement that requires an additive score and I am not sure if it is possible
and what the right way to go about it is. Assume, there are three fields -
(a) Project name  - contains text
(b) Project description - contains text
(c) Profile score - numeric

The basic idea is to implement the score as a linear function of term frequency for (a), (b) and
add the score in (c). For example, if the project name is "Web PHP - Web Dev and Web Design" and the project
description is "Web Design and Development", then a search for "Web" should yield,
   score for project name = 150 * 1 + 10 * 2, where the multiplying factor is 150 for the first occurrence of
                                                                 of "Web" and the factor is 10 for the next two occurrences.
   score for project description = 200 * 1, where the multiplying factor is 200.
   score for profile = 10 * 0.5, where 10 is the multiplying factor and 0.5 is  a system generated user score.

The total score would then be, 170 + 200 + 5 = 375.

The DisjunctionMaxQuery seems to yield the maximum score only. From my understanding, I would
need to do the following -
(1) Create a new similarity function
(2) Write a new Query class extension
(3) Need to write a new linear function ??

It seems to be that some elements are probably already there, and I am not understanding the
capabilities/possibilities very well.

Thanks in advance.
Reply | Threaded
Open this post in threaded view
|

Re: Question about similarity manipulation...

Chris Hostetter-3

: The DisjunctionMaxQuery seems to yield the maximum score only. From my

NOTE: by setting the "tiebreaker" value of a DisjunctionMaxQuery to "1.0"
it generates the sum of the scores

: understanding, I would
: need to do the following -
: (1) Create a new similarity function
: (2) Write a new Query class extension
: (3) Need to write a new linear function ??

you'll definitely need a new similarity class with a custom tf and
queryNorm function.  I don't think you'd need a new QUewry class .. what
you are looking for should be fairly straight forward to impliment using
BooleanQueries, TermQueries, and FunctionQueries.  You shouldn't need to
write a new linear function ValueSource -- i can't think of why the
current one wouldn't work for you.

the java-user@lucene list is a good place to ask general questions about
customizing Scoring by writting your own Similarity, and it has a larger
user base then the solr lists.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Question about similarity manipulation...

escher2k
Chris Hostetter wrote
: The DisjunctionMaxQuery seems to yield the maximum score only. From my

NOTE: by setting the "tiebreaker" value of a DisjunctionMaxQuery to "1.0"
it generates the sum of the scores

: understanding, I would
: need to do the following -
: (1) Create a new similarity function
: (2) Write a new Query class extension
: (3) Need to write a new linear function ??

you'll definitely need a new similarity class with a custom tf and
queryNorm function.  I don't think you'd need a new QUewry class .. what
you are looking for should be fairly straight forward to impliment using
BooleanQueries, TermQueries, and FunctionQueries.  You shouldn't need to
write a new linear function ValueSource -- i can't think of why the
current one wouldn't work for you.

the java-user@lucene list is a good place to ask general questions about
customizing Scoring by writting your own Similarity, and it has a larger
user base then the solr lists.



-Hoss
Thanks Hoss. I have written the new similarity class. There are two problems with the existing
linear function -
(a) the input doesn't seem to be the score returned for the field by doing the
similarity computation, but instead depends on the field data type.
(b) Also, the function I want is a slight variation of the linear function. Essentially it is a step function, if term freq = 1, return a particular value and if term freq > 1, implement a linear function.

But I think (a) is the bigger problem.

For instance on this data set -
- <doc>
  <str name="desc">ABCDE XYZ</str> 
  <str name="id">40</str> 
  <str name="name">abcde XYZ GHI</str> 
  <float name="profile_score">55</float> 
  </doc>
- <doc>
  <str name="desc">ABCDE ABCDE XYZ</str> 
  <str name="id">30</str> 
  <str name="name">ABCDE XYZ GHI</str> 
  <float name="profile_score">45</float> 
  </doc>


the following URL returns data -
http://dev01:8983/solr/select/?qt=dismax&q=abcde+ghi&qf=name&bf=linear(id,1,10)&debugQuery=1

whereas
http://dev01:8983/solr/select/?qt=dismax&q=abcde+ghi&qf=name&bf=linear(name,1,10)&debugQuery=1
throws a null pointer exception -
java.lang.RuntimeException: there are more terms than documents in field "name", but it's impossible to sort on tokenized fields
        at org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:274)
        at org.apache.solr.search.function.OrdFieldSource.getValues(OrdFieldSource.java:55)
        at org.apache.solr.search.function.LinearFloatFunction.getValues(LinearFloatFunction.java:49)
        at org.apache.solr.search.function.FunctionQuery$AllScorer.<init>(FunctionQuery.java:100)

Once again, thanks for your help.