Hello, I have a solr index running that is working very well as a search.
But I want to add the ability (if possible) to use it to do matching. The problem is that by default it is only looking for all the input terms to be present, and it doesn't give me any indication as to how many terms in the target field were not specified by the input. For example, if I'm trying to match to the song title "dust in the wind", I'm correctly getting a match if the input query is "dust in wind". But I don't want to get a match if the input is just "dust". Although as a search "dust" should return this result, I'm looking for some way to filter this out based on some indication that the input isn't close enough to the output. Perhaps if I could get information that that the number of input terms is much less than the number of terms in the field. Or something else along those line? I realize that this isn't the typical use case for a search, but I'm just looking for some suggestions as to how I could improve the above example a bit. Thanks, Chris |
it's not possible now because lucene don't support this.
when doing disjunction query, it only record how many terms match this document. I think this is a common requirement for many users. I suggest lucene should divide scorer to a matcher and a scorer. the matcher just return which doc is matched and why/how the doc is matched. especially for disjuction query, it should tell which term matches and possible other information such as tf/idf and the distance of terms(to support proximity search). That's the matcher's job. and then the scorer(a ranking algorithm) use flexible algorithm to score this document and the collector can collect it. On Wed, Apr 11, 2012 at 10:28 AM, Chris Book <[hidden email]> wrote: > Hello, I have a solr index running that is working very well as a search. > But I want to add the ability (if possible) to use it to do matching. The > problem is that by default it is only looking for all the input terms to be > present, and it doesn't give me any indication as to how many terms in the > target field were not specified by the input. > > For example, if I'm trying to match to the song title "dust in the wind", > I'm correctly getting a match if the input query is "dust in wind". But I > don't want to get a match if the input is just "dust". Although as a > search "dust" should return this result, I'm looking for some way to filter > this out based on some indication that the input isn't close enough to the > output. Perhaps if I could get information that that the number of input > terms is much less than the number of terms in the field. Or something > else along those line? > > I realize that this isn't the typical use case for a search, but I'm just > looking for some suggestions as to how I could improve the above example a > bit. > > Thanks, > Chris > |
In reply to this post by Chris Book
I have done that by getting X top hits, finding the best match among them (combination of Levenshtein distance, contains...tweaked the code till testing showed good results), and then deciding if the candidate was a match or not, again based in custom code plus a user defined leniency value
xab |
In reply to this post by Li Li
Hi,
This use case is similar to matching boolean expression problem. You can find recent thread about it. I have an idea that we can introduce disjunction query with dynamic mm (minShouldMatch parameter http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/search/BooleanQuery.html#setMinimumNumberShouldMatch(int)) i.e. 'match these clauses disjunctively but for every document use value from field cache of field xxxCount as a minShouldMatch parameter'. Also norms can be used as a source for dynamics mm values. Wdyt? On Wed, Apr 11, 2012 at 10:08 AM, Li Li <[hidden email]> wrote: > it's not possible now because lucene don't support this. > when doing disjunction query, it only record how many terms match this > document. > I think this is a common requirement for many users. > I suggest lucene should divide scorer to a matcher and a scorer. > the matcher just return which doc is matched and why/how the doc is > matched. > especially for disjuction query, it should tell which term matches and > possible other > information such as tf/idf and the distance of terms(to support proximity > search). > That's the matcher's job. and then the scorer(a ranking algorithm) use > flexible algorithm > to score this document and the collector can collect it. > > On Wed, Apr 11, 2012 at 10:28 AM, Chris Book <[hidden email]> wrote: > > > Hello, I have a solr index running that is working very well as a search. > > But I want to add the ability (if possible) to use it to do matching. > The > > problem is that by default it is only looking for all the input terms to > be > > present, and it doesn't give me any indication as to how many terms in > the > > target field were not specified by the input. > > > > For example, if I'm trying to match to the song title "dust in the wind", > > I'm correctly getting a match if the input query is "dust in wind". But > I > > don't want to get a match if the input is just "dust". Although as a > > search "dust" should return this result, I'm looking for some way to > filter > > this out based on some indication that the input isn't close enough to > the > > output. Perhaps if I could get information that that the number of input > > terms is much less than the number of terms in the field. Or something > > else along those line? > > > > I realize that this isn't the typical use case for a search, but I'm just > > looking for some suggestions as to how I could improve the above example > a > > bit. > > > > Thanks, > > Chris > > > -- Sincerely yours Mikhail Khludnev [hidden email] <http://www.griddynamics.com> <[hidden email]> |
I searched my mail but nothing found.
the thread searched by key words "boolean expression" is Indexing Boolean Expressions from joaquin.delgado to tell which terms are matched, for BooleanScorer2, a simple method is to modify DisjunctionSumScorer and add a BitSet to record matched scorers. When collector collect this document, it can get the scorer and recursively find the matched terms. But I think maybe it's better to add a component maybe named matcher that do the matching job, and scorer use the information from the matcher and do ranking things. On Wed, Apr 11, 2012 at 4:32 PM, Mikhail Khludnev < [hidden email]> wrote: > Hi, > > This use case is similar to matching boolean expression problem. You can > find recent thread about it. I have an idea that we can introduce > disjunction query with dynamic mm (minShouldMatch parameter > > http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/search/BooleanQuery.html#setMinimumNumberShouldMatch(int) > ) > i.e. 'match these clauses disjunctively but for every document use > value > from field cache of field xxxCount as a minShouldMatch parameter'. Also > norms can be used as a source for dynamics mm values. > > Wdyt? > > On Wed, Apr 11, 2012 at 10:08 AM, Li Li <[hidden email]> wrote: > > > it's not possible now because lucene don't support this. > > when doing disjunction query, it only record how many terms match this > > document. > > I think this is a common requirement for many users. > > I suggest lucene should divide scorer to a matcher and a scorer. > > the matcher just return which doc is matched and why/how the doc is > > matched. > > especially for disjuction query, it should tell which term matches and > > possible other > > information such as tf/idf and the distance of terms(to support proximity > > search). > > That's the matcher's job. and then the scorer(a ranking algorithm) use > > flexible algorithm > > to score this document and the collector can collect it. > > > > On Wed, Apr 11, 2012 at 10:28 AM, Chris Book <[hidden email]> > wrote: > > > > > Hello, I have a solr index running that is working very well as a > search. > > > But I want to add the ability (if possible) to use it to do matching. > > The > > > problem is that by default it is only looking for all the input terms > to > > be > > > present, and it doesn't give me any indication as to how many terms in > > the > > > target field were not specified by the input. > > > > > > For example, if I'm trying to match to the song title "dust in the > wind", > > > I'm correctly getting a match if the input query is "dust in wind". > But > > I > > > don't want to get a match if the input is just "dust". Although as a > > > search "dust" should return this result, I'm looking for some way to > > filter > > > this out based on some indication that the input isn't close enough to > > the > > > output. Perhaps if I could get information that that the number of > input > > > terms is much less than the number of terms in the field. Or something > > > else along those line? > > > > > > I realize that this isn't the typical use case for a search, but I'm > just > > > looking for some suggestions as to how I could improve the above > example > > a > > > bit. > > > > > > Thanks, > > > Chris > > > > > > > > > -- > Sincerely yours > Mikhail Khludnev > [hidden email] > > <http://www.griddynamics.com> > <[hidden email]> > |
Free forum by Nabble | Edit this page |