Quantcast

using solr to do a 'match'

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

using solr to do a 'match'

Chris Book
Hello, I have a solr index running that is working very well as a search.
 But I want to add the ability (if possible) to use it to do matching.  The
problem is that by default it is only looking for all the input terms to be
present, and it doesn't give me any indication as to how many terms in the
target field were not specified by the input.

For example, if I'm trying to match to the song title "dust in the wind",
I'm correctly getting a match if the input query is "dust in wind".  But I
don't want to get a match if the input is just "dust".  Although as a
search "dust" should return this result, I'm looking for some way to filter
this out based on some indication that the input isn't close enough to the
output.  Perhaps if I could get information that that the number of input
terms is much less than the number of terms in the field.  Or something
else along those line?

I realize that this isn't the typical use case for a search, but I'm just
looking for some suggestions as to how I could improve the above example a
bit.

Thanks,
Chris
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: using solr to do a 'match'

Li Li
it's not possible now because lucene don't support this.
when doing disjunction query, it only record how many terms match this
document.
I think this is a common requirement for many users.
I suggest lucene should divide scorer to a matcher and a scorer.
the matcher just return which doc is matched and why/how the doc is matched.
especially for disjuction query, it should tell which term matches and
possible other
information such as tf/idf and the distance of terms(to support proximity
search).
That's the matcher's job. and then the scorer(a ranking algorithm) use
flexible algorithm
to score this document and the collector can collect it.

On Wed, Apr 11, 2012 at 10:28 AM, Chris Book <[hidden email]> wrote:

> Hello, I have a solr index running that is working very well as a search.
>  But I want to add the ability (if possible) to use it to do matching.  The
> problem is that by default it is only looking for all the input terms to be
> present, and it doesn't give me any indication as to how many terms in the
> target field were not specified by the input.
>
> For example, if I'm trying to match to the song title "dust in the wind",
> I'm correctly getting a match if the input query is "dust in wind".  But I
> don't want to get a match if the input is just "dust".  Although as a
> search "dust" should return this result, I'm looking for some way to filter
> this out based on some indication that the input isn't close enough to the
> output.  Perhaps if I could get information that that the number of input
> terms is much less than the number of terms in the field.  Or something
> else along those line?
>
> I realize that this isn't the typical use case for a search, but I'm just
> looking for some suggestions as to how I could improve the above example a
> bit.
>
> Thanks,
> Chris
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: using solr to do a 'match'

jmlucjav
In reply to this post by Chris Book
I have done that by getting X top hits, finding the best match among them (combination of Levenshtein distance, contains...tweaked the code till testing showed good results), and then deciding if the candidate was a match or not, again based in custom code plus a user defined leniency value

xab
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: using solr to do a 'match'

Mikhail Khludnev
In reply to this post by Li Li
Hi,

This use case is similar to matching boolean expression problem. You can
find recent thread about it. I have an idea that we can introduce
disjunction query with dynamic mm (minShouldMatch parameter
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/search/BooleanQuery.html#setMinimumNumberShouldMatch(int))
i.e. 'match these clauses disjunctively but for every document use
value
from field cache of field xxxCount as a minShouldMatch parameter'. Also
norms can be used as a source for dynamics mm values.

Wdyt?

On Wed, Apr 11, 2012 at 10:08 AM, Li Li <[hidden email]> wrote:

> it's not possible now because lucene don't support this.
> when doing disjunction query, it only record how many terms match this
> document.
> I think this is a common requirement for many users.
> I suggest lucene should divide scorer to a matcher and a scorer.
> the matcher just return which doc is matched and why/how the doc is
> matched.
> especially for disjuction query, it should tell which term matches and
> possible other
> information such as tf/idf and the distance of terms(to support proximity
> search).
> That's the matcher's job. and then the scorer(a ranking algorithm) use
> flexible algorithm
> to score this document and the collector can collect it.
>
> On Wed, Apr 11, 2012 at 10:28 AM, Chris Book <[hidden email]> wrote:
>
> > Hello, I have a solr index running that is working very well as a search.
> >  But I want to add the ability (if possible) to use it to do matching.
>  The
> > problem is that by default it is only looking for all the input terms to
> be
> > present, and it doesn't give me any indication as to how many terms in
> the
> > target field were not specified by the input.
> >
> > For example, if I'm trying to match to the song title "dust in the wind",
> > I'm correctly getting a match if the input query is "dust in wind".  But
> I
> > don't want to get a match if the input is just "dust".  Although as a
> > search "dust" should return this result, I'm looking for some way to
> filter
> > this out based on some indication that the input isn't close enough to
> the
> > output.  Perhaps if I could get information that that the number of input
> > terms is much less than the number of terms in the field.  Or something
> > else along those line?
> >
> > I realize that this isn't the typical use case for a search, but I'm just
> > looking for some suggestions as to how I could improve the above example
> a
> > bit.
> >
> > Thanks,
> > Chris
> >
>



--
Sincerely yours
Mikhail Khludnev
[hidden email]

<http://www.griddynamics.com>
 <[hidden email]>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: using solr to do a 'match'

Li Li
I searched my mail but nothing found.
the thread searched by key words "boolean expression" is Indexing Boolean
Expressions from joaquin.delgado
to tell which terms are matched, for BooleanScorer2, a simple method is to
modify DisjunctionSumScorer and add a BitSet to record matched scorers.
When collector collect this document, it can get the scorer and recursively
find the matched terms.
But I think maybe it's better to add a component maybe named matcher that
do the matching job, and scorer use the information from the matcher and do
ranking things.

On Wed, Apr 11, 2012 at 4:32 PM, Mikhail Khludnev <
[hidden email]> wrote:

> Hi,
>
> This use case is similar to matching boolean expression problem. You can
> find recent thread about it. I have an idea that we can introduce
> disjunction query with dynamic mm (minShouldMatch parameter
>
> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/search/BooleanQuery.html#setMinimumNumberShouldMatch(int)
> )
> i.e. 'match these clauses disjunctively but for every document use
> value
> from field cache of field xxxCount as a minShouldMatch parameter'. Also
> norms can be used as a source for dynamics mm values.
>
> Wdyt?
>
> On Wed, Apr 11, 2012 at 10:08 AM, Li Li <[hidden email]> wrote:
>
> > it's not possible now because lucene don't support this.
> > when doing disjunction query, it only record how many terms match this
> > document.
> > I think this is a common requirement for many users.
> > I suggest lucene should divide scorer to a matcher and a scorer.
> > the matcher just return which doc is matched and why/how the doc is
> > matched.
> > especially for disjuction query, it should tell which term matches and
> > possible other
> > information such as tf/idf and the distance of terms(to support proximity
> > search).
> > That's the matcher's job. and then the scorer(a ranking algorithm) use
> > flexible algorithm
> > to score this document and the collector can collect it.
> >
> > On Wed, Apr 11, 2012 at 10:28 AM, Chris Book <[hidden email]>
> wrote:
> >
> > > Hello, I have a solr index running that is working very well as a
> search.
> > >  But I want to add the ability (if possible) to use it to do matching.
> >  The
> > > problem is that by default it is only looking for all the input terms
> to
> > be
> > > present, and it doesn't give me any indication as to how many terms in
> > the
> > > target field were not specified by the input.
> > >
> > > For example, if I'm trying to match to the song title "dust in the
> wind",
> > > I'm correctly getting a match if the input query is "dust in wind".
>  But
> > I
> > > don't want to get a match if the input is just "dust".  Although as a
> > > search "dust" should return this result, I'm looking for some way to
> > filter
> > > this out based on some indication that the input isn't close enough to
> > the
> > > output.  Perhaps if I could get information that that the number of
> input
> > > terms is much less than the number of terms in the field.  Or something
> > > else along those line?
> > >
> > > I realize that this isn't the typical use case for a search, but I'm
> just
> > > looking for some suggestions as to how I could improve the above
> example
> > a
> > > bit.
> > >
> > > Thanks,
> > > Chris
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> [hidden email]
>
> <http://www.griddynamics.com>
>  <[hidden email]>
>
Loading...